Learned image compression (LIC) has achieved remarkable coding efficiency, where entropy modeling plays a pivotal role in minimizing bitrate through informative priors. Existing methods predominantly exploit internal contexts within the input image, yet the rich external priors embedded in large-scale training data remain largely underutilized. Recent advances in dictionary-based entropy models have demonstrated that incorporating external priors can substantially enhance compression performance. However, current approaches organize heterogeneous external priors within a single-level dictionary, resulting in imbalanced utilization and limited representational capacity. Moreover, effective entropy modeling requires not only expressive priors but also a parameter estimation network capable of interpreting them. To address these challenges, we propose HiDE, a Hierarchical Dictionary-based Entropy modeling framework for learned image compression. HiDE decomposes external priors into global structural and local detail dictionaries with cascaded retrieval, enabling structured and efficient utilization of external information. Moreover, a context-aware parameter estimator with parallel multi-receptive-field design is introduced to adaptively exploit heterogeneous contexts for accurate conditional probability estimation. Experimental results show that HiDE achieves 18.5%, 21.99%, and 24.01% BD-rate savings over VTM-12.1 on the Kodak, CLIC, and Tecnick datasets, respectively.
In modern power systems, edge devices serve as local hubs that collect data, perform on-site computing, sense electrical parameters, execute control actions, and communicate with neighboring edge devices as part of the larger grid. However, as the number of monitored nodes and control loops grows, traditional edge devices face serious limits. They can become overloaded by complex signal processing and decision tasks, causing delays and higher energy use. Standard sensors hit a noise floor that prevents them from detecting miniature changes, making it harder to spot early signs of faults or instability. Meanwhile, conventional communication links struggle with bandwidth limits, security risks, and rising encryption demands, which together slow down and weaken the transfer of critical grid information. Quantum technologies have the potential to overcome these challenges. Quantum computers can deliver exponential speed-ups for optimization and machine-learning tasks that ordinary processors cannot handle. Quantum sensors can sense signals with atomic precision, giving edge devices a more precise view of grid dynamics. Quantum communication techniques, including quantum key distribution, offer methods to achieve information-theoretic security and ensure that information arrives quickly and without tampering. We explore how quantum technologies can be integrated into edge devices, highlighting both opportunities and challenges.
This paper presents an Adaptive Gain Nonlinear Observer (AGNO) for estimating the external interaction wrench (forces and torques) in human-UAV physical interaction for assistive payload transportation. The proposed AGNO uses the full nonlinear dynamic model to achieve an accurate and robust wrench estimation without relying on dedicated force-torque sensors. A key feature of this approach is the explicit consideration of the non-constant inertia matrix, which is essential for aerial systems with asymmetric mass distribution or shifting payloads. A comprehensive dynamic model of a cooperative transportation system composed of two quadrotors and a shared payload is derived, and the stability of the observer is rigorously established using Lyapunov-based analysis. Simulation results validate the effectiveness of the proposed observer in enabling intuitive and safe human-UAV interaction. Comparative evaluations demonstrate that the proposed AGNO outperforms an Extended Kalman Filter (EKF) in terms of estimation root mean square errors (RMSE), particularly for torque estimation under nonlinear interaction conditions. This approach reduces system weight and cost by eliminating additional sensing hardware, enhancing practical feasibility.
Uncoordinated electric vehicle (EV) charging is altering residential load patterns and pushing distribution transformers to operate beyond their limits. These outcomes can be offset by exploiting the flexibility in work schedules (hybrid, remote vs. in-person) of EV owners, particularly when combined with rooftop photovoltaic (PV) generation. However, this phenomenon has not been explored in-depth yet. This paper addresses this research gap by introducing weekly work schedule-aware robust and chance-constrained optimization formulations for EV charging coordination to determine a transformer's EV hosting capacity. The results obtained using data from a residential feeder in Arizona indicate that an intelligent combination of work schedule flexibility with PV generation can help power utilities effectively manage changing grid demands.
In practical data-driven applications on electrical equipment fault diagnosis, training data can be poisoned by sensor failures, which can severely degrade the performance of machine learning (ML) models. However, once the ML model has been trained, removing the influence of such harmful data is challenging, as full retraining is both computationally intensive and time-consuming. To address this challenge, this paper proposes a SISA (Sharded, Isolated, Sliced, and Aggregated)-based machine unlearning (MU) framework for power transformer inter-turn short-circuit fault (ITSCF) localization. The SISA method partitions the training data into shards and slices, ensuring that the influence of each data point is isolated within specific constituent models through independent training. When poisoned data are detected, only the affected shards are retrained, avoiding retraining the entire model from scratch. Experiments on simulated ITSCF conditions demonstrate that the proposed framework achieves almost identical diagnostic accuracy to full retraining, while reducing retraining time significantly.
Extreme weather events and cyberattacks can cause component failures and disrupt the operation of power distribution networks (DNs), during which reconfiguration and load shedding are often adopted for resilience enhancement. This study introduces a topology-aware graph reinforcement learning (RL) framework for outage management that embeds higher-order topological features of the DN into a graph-based RL model, enabling reconfiguration and load shedding to maximize energy supply while maintaining operational stability. Results on the modified IEEE 123-bus feeder across 300 diverse outage scenarios demonstrate that incorporating the topological data analysis (TDA) tool, persistence homology (PH), yields 9-18% higher cumulative rewards, up to 6% increase in power delivery, and 6-8% fewer voltage violations compared to a baseline graph-RL model. These findings highlight the potential of integrating RL with TDA to enable self-healing in DNs, facilitating fast, adaptive, and automated restoration.
Power distribution systems increasingly rely on dense sensor networks for real-time monitoring, yet unreliable communication links and equipment malfunctions often result in missing or incomplete measurement sets at the operating center, requiring accurate data recovery techniques. Most existing approaches operate solely on the available measurements and overlook the role of the communication network that delivers sensor data, leading to large, spatially correlated losses when multiple sensors share failing communication links. This paper proposes a communication-aware framework that integrates routing constraints with low-rank matrix completion to improve data recovery accuracy under communication failures. Sensors are grouped into balanced clusters, and routing paths are designed to limit intracluster sensors sharing a common communication path, preventing complete data loss within any cluster. The remaining measurements for each cluster are then recovered using an optimal singular value thresholding (OSVT) method. Simulation results on the IEEE standard test feeder with real-world data demonstrate that the proposed framework significantly improves recovery accuracy compared to communication-agnostic, measurement-only methods.
Cognitive Radars (CRs) employ perception-action cycle to adapt their sensing and transmission strategies based on its' perception of the target kinematic states and mission objectives. This paper considers an inverse learning Electronic Counter Measure (ECM) that infers both the perception and perception-driven action policy of the adversarial CR's from the actions of the CR, i.e. the sensing and transmission actions taken by the CR. Existing frameworks, in the literature, assume the knowledge of either the perception or the perception-action policy and infer the other. However, this assumption is unrealistic in an adversarial setting. We address this gap by proposing an online, nonparametric Bayesian machine learning framework and developing the Inverse Particle Filter with Dependent Dirichlet Process (IPFDDP) algorithm, which characterizes the perception-dependent action policy using a Dependent Dirichlet Process (DDP) and embeds kernel-based DDP inference within a Bayesian inverse particle filtering framework to jointly estimate the CR's perception and perception-action policy. Extensive numerical simulations demonstrate that IPFDDP outperforms existing inverse learning methods in terms of mean squared error, Kullback-Leibler divergence between the estimated and true policy, and accuracy in identifying relative action preferences. Unlike the existing techniques, the proposed Bayesian formulation naturally quantifies uncertainty in inferred perception and perception-action policy, enabling active probing strategies for sample efficient inverse learning. Simulation results show that active probing integrated with IPFDDP achieves, on average, a 40% faster reduction in KL divergence compared to randomized probing.
This paper presents novel method for distribution-free robust trajectory optimization and control of discrete-time, nonlinear, and non-Gaussian stochastic systems, with closed-loop guarantees on chance constraint satisfaction. Our framework employs conformal inference to generate coverage-based confidence sets for the closed-loop dynamics around arbitrary reference trajectories, by constructing a joint nonconformity score to quantify both the validity of contraction (i.e., incremental stability) conditions and the impact of external stochastic disturbance on the closed-loop dynamics, without any distributional assumptions. Via appropriate constraint tightening, chance constraints can be reformulated into tractable, statistically valid deterministic constraints on the reference trajectories. This enables a formal pathway to leverage and validate learning-based motion planners and controllers, such as those with neural contraction metrics, in safety-critical real-world applications. Notably, our statistical guarantees are non-diverging and can be computed with finite samples of the underlying uncertainty, without overly conservative structural priors. We demonstrate our approach in motion planning problems for designing safe, dynamically feasible trajectories in both numerical simulation and hardware experiments.
The millimeter wave (mmW) frequency spectrum has been explored recently for large bandwidth communication. At these frequencies, narrow directional beams are required for communication since the signal attenuation is high due to atmospheric absorption. This work presents an AMD RFSoC and Sivers Semiconductors analog front-end based hardware testbed capable of directional communication via analog beamforming at mmW. The proposed testbed comprises orthogonal frequency division multiplexing (OFDM) based baseband physical layer and digital front-end on an ARM processor and field programmable gate array (FPGA), respectively, integrated with high-speed data converters of the RFSoC. The RFSoC output at sub-6GHz is integrated with a mmW multi-antenna analog-front end for over- the-air communication at 29.8 GHz. We demonstrate end-to-end communication over the air and present bit error rate (BER) analysis in the presence of radio frequency impairments and beam misalignments in real radio channels.
An integrated sensing and communication (ISAC) framework comprises radar sensing to enable reliable direction beam-based communication between a base station (BS) and mobile user (MU). The ISAC will be an integral part of 6G with potential applications for high-speed vehicular communications. Existing works have explored azimuth and Doppler velocity estimated via radar sensing for beam identification and identification in dynamic environments. In this work, we propose radar-enabled modulation scheme selection for ISAC, thereby eliminating conventional time-consuming downlink-uplink feedback-based modulation scheme selection. We have analyzed the performance of the proposed approach for four different trajectories and shown an improvement in throughput between 54-209% over state-of-the-art ISAC.
IEEE 802.11ad standard uses analog beamforming for high-speed directional communication with mobile user (MU) in the millimeter wave (mmWave) spectrum. However, the lengthy beam alignment procedures involving large data packets between the base station (BS) and the MU introduce considerable overhead, deteriorating the overall throughput. Prior works have proposed 802.11ad-based integrated sensing and communication (ISAC) BS transceivers to eliminate time-consuming beam alignment. Instead, the radar and communication functionalities use the same waveform, spectrum, and millimeter wave front end (MFE) with a common spatial field of view. The radar detects and localizes the MU, enabling the subsequent directional communication with the MU. This work proposes an end-to-end IEEE 802.11ad-based ISAC BS transceiver prototype, wherein the digital baseband hardware frontend on edge is integrated with a Simulink-based MFE. The proposed prototype facilitates a systematic link budget and detailed performance analysis for different wireless channels, target motions, signal-to-noise ratios, hardware configurations, and impairments. We also investigate how these impairments affect radar performance and, in turn, the communication metrics since the performances of both systems are uniquely interrelated in an ISAC system. Our results show that even with hardware impairments, the 802.11ad-based ISAC offers 34% higher throughput than the standard with an ideal MFE.
This paper presents a Vehicle-to-Grid (V2G) coordination framework using reinforcement learning (RL). {An intelligent control strategy based on the soft actor-critic algorithm is developed for voltage regulation through single and multi-hub charging systems while respecting realistic fleet constraints. A two-phase training approach integrates stability-focused learning with battery-aware deployment to ensure practical feasibility. Simulation studies on the IEEE 34-bus system validate the framework against a standard Volt-Var/Volt-Watt droop controller. Results indicate that the RL agent achieves performance comparable to the baseline control strategy in nominal scenarios. Under aggressive overloading, it provides robust voltage recovery (within 10% of the baseline) while prioritizing fleet availability and state-of-charge preservation, demonstrating the viability of constraint-aware learning for critical grid services.}
Integrated Energy Systems (IES) are systems of interconnected electricity, gas, heating, and cooling networks, where the carriers interact and depend on one another. Beyond these core vectors, IES may also incorporate additional infrastructures, such as hydrogen, transportation and water networks, whenever sector coupling or cross-vector exchanges are relevant. Although modern cities already function as multi-energy systems, these networks are still planned and operated in isolation, which leads to inefficiencies and unused flexibility. As distributed energy resources (DERs) grow, local coupling among electricity, heating, and gas networks becomes stronger, so coordinated operation across carriers and infrastructures is essential. IES can improve efficiency, flexibility, and renewable integration, yet operation is challenging because of complex interdependencies, non-convex behaviors, and multi-scale dynamics of the energy networks. A key point that the literature often overlooks is the explicit role of network constraints and topology, which shape feasible operating regions, affect scalability, and determine how uncertainty and formal guarantees can be addressed. This review provides a first comprehensive analysis of network-aware modeling, optimization, and control methods for IES. We identify methodological limitations related to tractability, feasibility guarantees, and scalability. Building on these insights, we outline research directions that include distributed optimization with theoretical guarantees and control approaches informed by operational data. The review offers a foundation for scalable, network-aware operational frameworks for future low-carbon energy systems.
We propose a Vocos-based bandwidth extension model that enhances audio at 8-48 kHz by generating missing high-frequency content. Inputs are resampled to 48 kHz and processed by a neural vocoder backbone, enabling a single network to support arbitrary upsampling ratios. A lightweight Linkwitz-Riley-inspired refiner merges the original low band with the generated high frequencies via a smooth crossover. On validation, the model achieves competitive log-spectral distance while running at a real-time factor of 0.0001 on an NVIDIA A100 GPU and 0.0053 on an 8-core CPU, demonstrating practical, high-quality BWE at extreme throughput.
Radars provide robust perception of vehicle surroundings by effectively functioning in poor light and adverse weather conditions. Synthetic aperture radar (SAR) algorithms are employed to address the limited angular resolution of radars by enlarging antenna aperture size synthetically as the radar moves. An autofocus algorithm is essential to improve the SAR image quality by compensating for errors mainly caused by inaccurate radar localization. Existing autofocus algorithms are mostly tailored for the frequency domain SAR techniques which are prevalent in aviation and spaceborne applications thanks to their lower complexity in large data processing. However, in the automotive context, the backprojection algorithm (BPA) is often preferred since it provides less distorted images at the cost of more complexity. Addressing the gap in efficient autofocus solutions for time-domain algorithms, this paper introduces a dual-layered autofocus strategy that integrates the Polar Format Algorithm (PFA) with BPA. The first layer employs a novel Localization Error Compensation Autofocus (LECA) processing pipeline to estimate and correct the localization errors within the PFA domain, leveraging its computational efficiency. The second layer seamlessly transfers these corrections to BPA, enabling high-quality SAR imaging while maintaining low complexity. Additionally, the strategy extends Phase Gradient Autofocus (PGA) techniques to enhance the efficiency of localization error compensation for BPA. Validated through real-world automotive experiments, the proposed pipeline delivers state-of-the-art image focus and resolution, setting a new benchmark for computationally efficient SAR imaging.
All applications in fifth-generation (5G) networks rely on stable radio-frequency (RF) environments to support mission-critical services in mobility, automation, and connected intelligence. Their exposure to intentional interference or low-power jamming threatens availability and reliability, especially when such attacks remain below link-layer observability. This paper investigates lightweight, explainable, and hardware-efficient jamming detection using the Convolutional Tsetlin Machine (CTM) operating directly on 5G Synchronization Signal Block (SSB) features. CTM formulates Boolean logic clauses over quantized inputs, enabling bit-level inference and deterministic deployment on FPGA fabrics. These properties make CTM well suited for real-time, resource-constrained edge environments anticipated in 5G. The proposed approach is experimentally validated on a real 5G testbed using over-the-air SSB data, emulating practical downlink conditions. We benchmark CTM against a convolutional neural network (CNN) baseline under identical preprocessing and training pipelines. On the real dataset, CTM achieves comparable detection performance (Accuracy 91.53 +/- 1.01 vs. 96.83 +/- 1.19 for CNN) while training $9.5\times$ faster and requiring 14x less memory (45~MB vs.\ 624~MB). Furthermore, we outline a compact FPGA-oriented design for Zybo~Z7 (Zynq-7000) and provide resource projections (not measured) under three deployment profiles optimized for latency, power, and accuracy trade-offs. The results show that the CTM provides a practical, interpretable, and resource-efficient alternative to conventional DNNs for RF-domain jamming detection, establishing it as a strong candidate for edge-deployed, low-latency, and security-critical 5G applications while laying the groundwork for B5G systems.
Open Radio Access Network (RAN) enables flexible, AI-driven control of mobile networks through disaggregated, multi-vendor components. In this architecture, xApps handle real-time functions, whereas rApps in the non-real-time controller generate strategic policies. However, current rApp development remains largely manual, brittle, and poorly scalable as xApp diversity proliferates. In this work, we propose a Multi-Agentic AI framework to automate rApp policy generation and orchestration. The architecture integrates three specialized large language model (LLM)-based agents, Perception, Reasoning, and Refinement, supported by retrieval-augmented generation (RAG) and memory-based analogical reasoning. These agents collectively analyze potential conflicts, synthesize intent-aligned control pipelines, and incrementally refine deployment decisions. Experiments across diverse deployment scenarios demonstrate that the proposed system achieves over 70% improvement in deployment accuracy and 95% reduction in reasoning cost compared to baseline methods, while maintaining zero-shot generalization to unseen intents. These results establish a scalable and conflict-aware solution for fully autonomous, zero-touch rApp orchestration in Open RAN.
The Internet of Underwater Things (IoUT) is becoming a critical infrastructure for ocean observation, marine resource management, and climate science. Its development is hindered by severe acoustic attenuation, propagation delays far exceeding those of terrestrial wireless systems, strict energy constraints, and dynamic topologies shaped by ocean currents. Machine learning (ML) has emerged as a key enabler for addressing these limitations, offering data driven mechanisms that enhance performance across all layers of underwater wireless sensor networks. This tutorial survey synthesises ML methodologies supervised, unsupervised, reinforcement, and deep learning specifically contextualised for underwater communication environments. It outlines the algorithmic principles of each paradigm and examines the conditions under which particular approaches deliver superior performance. A layer wise analysis highlights physical layer gains in localisation and channel estimation, MAC layer adaptations that improve channel utilisation, network layer routing strategies that extend operational lifetime, and transport layer mechanisms capable of reducing packet loss by up to 91 percent. At the application layer, ML enables substantial data compression and object detection accuracies reaching 92 percent. Drawing on 300 studies from 2012 to 2025, the survey documents energy efficiency gains of 7 to 29 times, throughput improvements over traditional protocols, and cross layer optimisation benefits of up to 42 percent. It also identifies persistent barriers, including limited datasets, computational constraints, and the gap between theoretical models and real world deployment. The survey concludes with emerging research directions and a technology roadmap supporting ML adoption in operational underwater networks.
Networked low Earth orbit (LEO) satellite constellations enabled by inter-satellite links offer a promising path toward ubiquitous broadband non-terrestrial services. However, fast orbital motion induces frequent scheduling updates and handovers, while stringent on-board constraints (e.g., limited radio-frequency chains) tightly couple user scheduling with cooperative beamforming. This paper investigates handover-aware power-efficient downlink transmission in networked LEO systems under statistical channel state information. We introduce a two-segment frame structure that separates handover-related operations from user-plane transmission, and propose a power consumption model that captures both the switching cost of newly established satellite-user links and the reduced effective transmission window during handover. Using a hardening-bound ergodic-rate metric, we formulate a per-frame network-wide power minimization problem with joint cooperative beamforming and implicit scheduling under segmented quality-of-service constraints, per-satellite power budgets, and serving-cardinality limits. To address scheduling-induced combinatorial sparsity and nonconvex fractional rate constraints, we develop an iterative algorithm that combines a reweighted $\ell_2$ surrogate with a penalty-based relaxation and a fractional-programming inner loop, yielding a sequence of convex second-order cone programs. Simulations based on time-varying orbital dynamics with frame-wise serving-set evolution and maritime user data quantify the power-handover tradeoff and demonstrate consistent power savings and improved feasibility over non-cooperative and pre-scheduled cooperative baselines.
When the concept of fluid antenna system (FAS) is applied to multiple-input multiple-output (MIMO) systems, this gives rise to MIMO-FAS, a.k.a.~fluid MIMO. Under rich scattering, the spatial correlation matrices are governed by the zeroth-order Bessel function $J_0(\cdot)$ through the continuously adjustable antenna positions, creating a highly non-convex landscape for optimization with fluctuating local optima -- the \emph{Bessel landscape}. In this paper, we tackle the joint transmitter (TX) and receiver (RX) antenna position optimization problem in fluid MIMO to maximize the ergodic capacity by shaping this landscape. Using Kronecker channel decomposition, we firstly develop a suite of analytical results that expose the problem's intrinsic structure: (i) a high signal-to-noise ratio (SNR) capacity approximation that decomposes the objective into separable log-determinant terms of the TX and RX correlation matrices, $\mathbf{R}_T$ and $\mathbf{R}_R$, respectively, (ii) a closed-form capacity loss bound linking $\det(\mathbf{R}_T)\det(\mathbf{R}_R)$ to the performance gap relative to the independent and identically distributed (i.i.d.) ideal MIMO channel, and (iii) the globally optimal inter-element spacing when the number of fluid elements at the TX is $N=2$ at the first zero of $J_0$. Guided by these insights, we propose two algorithms within an alternating optimization (AO) framework. The first algorithm is AO with particle swarm optimization (PSO) which deploys a particle swarm to explore the Bessel landscape globally without gradient information. Then in the second method, we use successive convex approximation (SCA) to obtain the gradient in closed form via $J_1(\cdot)$ to construct convex surrogates for orders-of-magnitude faster convergence.
Input delays are a common source of performance degradation and instability in control systems. This paper addresses the $\mathcal{H}_\infty$ output-feedback control problem for LPV systems with time-varying input delays under the integral quadratic constraint (IQC) framework. By integrating parameter-dependent Lyapunov functions with dynamic IQC multipliers, we derive convex, delay-dependent synthesis conditions formulated as parameter-dependent LMIs, enabled by the proposed exact-memory controller structure. An explicit controller reconstruction formula is provided to recover the LPV controller from the LMI solution, avoiding the need to specify the functional form of the parameter-dependent controller gains. While the synthesis problem for memoryless control is inherently non-convex, the proposed approach demonstrates significant performance improvement, reduced conservatism, and computational efficiency for standard output-feedback design. Numerical examples illustrate the effectiveness and broad applicability of the method to LPV systems with time-varying input delays.
Recent studies have shown that post-deployment adaptation can improve the robustness of speech enhancement models in unseen noise conditions. However, existing methods often incur prohibitive computational and memory costs, limiting their suitability for on-device deployment. In this work, we investigate model adaptation in realistic settings with dynamic acoustic scene changes and propose a lightweight framework that augments a frozen backbone with low-rank adapters updated via self-supervised training. Experiments on sequential scene evaluations spanning 111 environments across 37 noise types and three signal-to-noise ratio ranges, including the challenging [-8, 0] dB range, show that our method updates fewer than 1% of the base model's parameters while achieving an average 1.51 dB SI-SDR improvement within only 20 updates per scene. Compared to state-of-the-art approaches, our framework achieves competitive or superior perceptual quality with smoother and more stable convergence, demonstrating its practicality for lightweight on-device adaptation of speech enhancement models under real-world acoustic conditions.
The transition to Extremely Large Antenna Arrays (ELAA) in 6G introduces significant near-field effects, necessitating robust near-field beam training strategies in multi-path environments. Because signal phases are frequently compromised by hardware impairments such as phase noise and frequency offsets, amplitude-only channel recovery is a critical alternative to coherent beam training. However, existing near-field amplitude-based training methods often assume simplistic line-of-sight conditions. Conversely, far-field phase retrieval (PR) methods lack the sensing flexibility required to optimize training efficiency and are fundamentally limited by plane-wave models, making them ill-suited for near-field propagation. We propose a two-stage sparse PR framework for amplitude-only near-field beam training in multipath channels. Stage I performs adaptive support discovery on the standard 2D DFT beamspace by exploiting a physics-guided prior induced by near-field beam patterns. Stage II then refines the channel estimate by restricting sensing and sparse PR to the learned subspace. Numerical results show that the proposed adaptive pipeline consistently outperforms non-adaptive baselines, improving beamforming gain by over 70% at low SNR.
Accurate estimation of the vehicle's sideslip angle and tire forces is essential for enhancing safety and handling performances in unknown driving scenarios. To this end, the present paper proposes an innovative observer that combines a linear single-track model with a distributed representation of the tires and information collected from standard sensors. In particular, by adopting a comprehensive representation of the tires in terms of hyperbolic partial differential equations (PDEs), the proposed estimation strategy exploits dynamical inversion to reconstruct the lumped and distributed vehicle states solely from yaw rate and lateral acceleration measurements. Simulation results demonstrate the effectiveness of the observer in estimating the sideslip angle and tire forces even in the presence of noise and model uncertainties.
The increasing penetration of renewable energy necessitates unlocking demand-side flexibility. While air conditioning (AC) systems offer significant thermal inertia, existing physical and data-driven models struggle with parameter acquisition, interpretability, and data scarcity. This paper proposes VB-NET, a physics-constrained gray-box deep learning framework that transforms complex AC thermodynamics into a standardized Virtual Battery (VB) model. We first mathematically prove the isomorphic equivalence between the AC and VB models. Subsequently, VB-NET is designed to strictly enforces physical laws by decoupling shared meteorological drivers from private building thermal fingerprints and embedding a differentiable physics layer. Experimental results demonstrate that VB-NET significantly outperforms conventional black-box models in state of charge tracking while successfully recovering underlying thermodynamic laws to yield physically consistent parameters. Furthermore, utilizing multi-task learning and terminal sensitivity modulation, VB-NET overcomes the cold-start dilemma, achieving high-precision modeling for new AC units using only 2% to 6% of historical data. Ultimately, this study provides an interpretable and data-efficient pathway for aggregating decentralized AC resources for grid regulation.
Many previous works in spike sorting study spike classification and compression independently. In this paper, a novel algorithm is proposed called MetaSort to address these two problems. To deal with compression, a novel adaptive level crossing algorithm is proposed to approximate spike shapes with high fidelity. Meanwhile, the latent feature representation is used to handle the classification problem. Besides, to guarantee MetaSort is robust and discriminative, the geometric information of data is exploited simultaneously in the proposed framework by meta-transfer learning. Empirical experiments with in-vivo spike data demonstrate that MetaSort delivers promising performance, highlighting its potential and motivating continued development toward an ultra-low-power, on-chip implementation.
This paper investigates a downlink multi-satellite integrated sensing and communication (ISAC) network, in which multiple satellites simultaneously transmit ISAC signals to provide communication services to ground user equipments and enable cooperative sensing of airborne targets through multiple gateways. To support this dual functionality, we introduce communication and sensing beamforming designs based on uniform planar arrays with optimized power allocation. Building on these designs, we propose two cooperative sensing frameworks, namely centralized and distributed. In the centralized framework, each gateway forwards its sensing observations to a central unit (CU), where the positions of multiple targets are jointly estimated from the aggregated data using a sparse signal recovery formulation. To mitigate the signaling overhead inherent in centralized processing, a distributed framework is further proposed, in which each gateway independently estimates target positions and transmits only the local estimates to the CU. To associate estimates from different gateways, a data association problem based on the squared Euclidean distance is formulated and efficiently solved using the Hungarian algorithm. The final target positions are then obtained by minimizing the distance estimation error. Simulation results demonstrate that the proposed centralized and distributed frameworks significantly outperform existing sensing schemes while satisfying communication performance requirements. We also evaluate the sensing-communication trade-off from the viewpoints of sensing accuracy and communication power consumption under the proposed frameworks.
Audio-Visual Target Speaker Extraction (AVTSE) aims to separate a target speaker's voice from a mixed audio signal using the corresponding visual cues. While most existing AVTSE methods rely exclusively on frontal-view videos, this limitation restricts their robustness in real-world scenarios where non-frontal views are prevalent. Such visual perspectives often contain complementary articulatory information that could enhance speech extraction. In this work, we propose Multi-View Tensor Fusion (MVTF), a novel framework that transforms multi-view learning into single-view performance gains. During the training stage, we leverage synchronized multi-perspective lip videos to learn cross-view correlations through MVTF, where pairwise outer products explicitly model multiplicative interactions between different views of input lip embeddings. At the inference stage, the system supports both single-view and multi-view inputs. Experimental results show that in the single-view inputs, our framework leverages multi-view knowledge to achieve significant performance gains, while in the multi-view mode, it further improves overall performance and enhances the robustness. Our demo, code and data are available at this https URL
Simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) offer a transformative approach for integrated sensing and communication (ISAC) systems, particularly for enhancing physical layer security (PLS). This paper investigates a robust, secure downlink transmission framework for a STAR-RIS empowered multi-user (MU) multiple-input multiple-output (MU-MIMO) system, where a multi-antenna dual-function radar and communication base station (DFRC-BS) that simultaneously transmits confidential messages to multiple intended users (IUs) and performs target sensing in the presence of malicious eavesdroppers. To optimize system security, we formulate a worst-case robust beamforming problem to maximize the secrecy rate. This formulation jointly designs the active transmit beamforming at the BS and the passive reflection, transmission coefficients at the STAR-RIS, adheres to transmit power budgets, user quality-of-service (QoS) thresholds, sensing signal-to-interference-plus-noise ratio (SINR) requirements, maximum tolerable eavesdropping leakage, and practical phase shifts constraints. To efficiently tackle the formulated problem, we develop an alternating optimization (AO) algorithm. Specifically, the S-procedure is employed to solve semi-infinite channel uncertainty constraints, while semidefinite relaxation (SDR) and penalty convex-concave programming (CCP) are applied to obtain tractable suboptimal solutions. Extensive simulation results validate the efficacy of the proposed framework and demonstrate significant improvement in spectral efficiency compared to conventional reflecting-only RIS (R-RIS) systems under stringent sensing conditions.
Tunable input-to-state safety (TISSf) generalizes the input-to-state safety (ISSf) framework by incorporating a tuning function that regulates safety conservatism while preserving robustness against perturbations. Despite its flexibility, the TISSf tuning function is often designed without explicitly incorporating actuator limits, which can lead to incompatibility with input constraints. To address this gap, this paper proposes a framework that integrates general compact input constraints into tuning function synthesis. Leveraging a geometric perspective, we characterize the TISSf condition as a state-dependent half-space constraint and derive a verifiable certificate for input compatibility using support functions. This characterization transforms the compatibility requirement into a design constraint on the tuning function, yielding a prescriptive lower bound that defines an admissible family of tunings under input constraints. These results are specialized to norm-bounded, polyhedral, and box constraints, yielding tractable control design conditions. We show that these conditions, combined with tuning function monotonicity, guarantee input compatibility and recursive feasibility of the resulting quadratic program (QP)-based safety filter. Furthermore, an offline parameter selection procedure using a covering-based sampling strategy ensures compatibility across the entire safe set via a linear program (LP). A connected cruise control (CCC) application demonstrates robust safety under TISSf while enforcing input constraints by design.
This article considers robust cooperative output regulation of discrete-time uncertain heterogeneous (in dimension) multi-agent systems (MASs). We show that the solvability of this problem with an internal model-based distributed control law reduces to the existence of a structured control gain that makes the nominal closed-loop system matrix of the MAS Schur. Accordingly, this article focuses on global and agent-wise local sufficient conditions for the existence and design of such a structured control gain. Based on a structured Lyapunov inequality, we present a convexification that yields a linear matrix inequality (LMI), whose feasibility is a global sufficient condition for the existence and design. Considering the individual nominal dynamics of each agent, the existence is also ensured if each agent solves a structure-free control problem. Its convexification yields LMIs that allow each agent to separately design its structure-free control gain. Lastly, we study the relationships between the sets of control gains emerging from both global and local perspectives.
The environmental impact of Large Language Models (LLMs) on data centers hosting these models is becoming a significant concern. While many efforts have focused on reducing the substantial training overhead of LLMs, carbon and water consumption during the inference phase can often surpass the costs associated with their training. The cooling systems of data centers are crucial in this context, but they are frequently modeled with a location-independent efficiency term. However, their energy efficiency is highly influenced by ambient temperature, which can vary significantly across different geographical locations. Leveraging this temperature diversity can help reduce total cooling energy costs and improve the performance of edge data centers. To address these critical sustainability issues related to LLMs, this study proposes a temperature-aware approach that co-optimizes LLM energy costs, carbon emissions, time-to-first token, and water consumption. The approach employs a distributed optimization algorithm based on an alternating direction method of multipliers, aimed at enhancing the sustainability of LLM hosting across geo-distributed edge data centers in Australia. Our method demonstrates reductions in cooling energy consumption and improves overall cost efficiency for geo-distributed cloud environments.
This paper presents the design and simulation of a new curved monopole antenna optimized for skywave HF radar applications, with a systematic investigation of the effects of curvature and fixed-section length on antenna performance. The proposed design achieves improved impedance matching, broader bandwidth, and enhanced realized gain compared to a conventional quarter-wavelength monopole at 15 MHz. Parametric analysis shows that fully bending the monopole degrades performance, whereas introducing a straight section and carefully optimizing the curvature enables a 18.5% gain increase and a 400 kHz bandwidth expansion. The single-element design is further extended to a 12-element linear array with 0.45{\lambda} spacing (where {\lambda} is the wavelength), demonstrating stable embedded-element behavior and improved low-to- moderate elevation gain for skywave over-the-horizon radar operation. At {\theta} = 30°, the proposed array achieves 14.04 dBi compared to 13.11 dBi for the reference array, corresponding to 24% gain enhancement, which is significant in high-power HF radar systems. These results confirm that the proposed curved monopole antenna provides a compact, broadband, and scalable solution for next-generation HF radar arrays.
Hydrogen integration into microgrids facilitates the absorption of intermittencies from renewable energy resources. However, significant challenges remain due to complex optimization problems, particularly in large-scale applications involving multiple fuel cells (FCs) and electrolyzers (ELs) with numerous binary decision variables. This paper presents a hierarchical quantum annealing (QA) model predictive control-based power allocation framework aimed at accelerating these optimization problems. First, in a day-ahead stage, the framework determines the startup and shutdown of the FCs and ELs. The short-term stage then refines the output power of the FCs and the hydrogen generation rate of the ELs. The feasibility is evaluated through a case study consisting of multiple households in Australia. Our findings demonstrate that while the traditional optimization approach performs satisfactorily in scenarios with a small number of households, the QA approach becomes more appropriate and effectively solves the problem within an acceptable range as the number of connected households increases.
The paper documents the implementation of a novel phase-noise analysis module within the open-source QUCS circuit simulator environment. The underlying algorithm is based on a rigorous, unified time-domain methodology of (coupled) oscillator noise-response, recently proposed by the authors. The theoretical approach used to develop this model is entirely unconstrained by any empirical and/or phenomenological modelling techniques, such as e.g. LTI and LTV theory, and this differentiates it from all prior proposals on this topic. The paper introduces important, and previously unpublished, extensions to this framework, in the form of novel unified closed-form expressions for both the amplitude and phase-amplitude correlation response of a general coupled oscillating circuit perturbed by noise. The research discussed herein has many important scientific and industrial applications w.r.t. predicting, synthesizing and optimizing the performance of noise-perturbed free-running and coupled autonomous circuits operating under large-signal steady-state conditions. These timing circuits are ubiquitous in all modern communication and remote-sensing systems and the developed simulation tools will prove to have great impact in various areas of industrial circuit design. This paper represents second part of a two-part series with the first part discussing the implementation of the underlying steady-state analysis module. The open-source simulator, discussed and developed herein, applies advanced state-of-the-art stochastic modelling techniques, in-order to produce noise simulation tools with capabilities and scope which, in many areas, exceed what is found in the commercial EDAs currently on the market.
This paper develops a robust control synthesis method for uncertain linear systems with input saturation in the framework of integral quadratic constraints (IQCs). The system is reformulated as a linear fractional representation (LFR) that captures both dead-zone nonlinearity and time-varying uncertainties. By combining mixed IQC-based dissipation inequalities with quadratic Lyapunov functions, sufficient conditions for robust stabilization are established. Compared with conventional approaches based on a single static sector condition for the dead-zone nonlinearity, the proposed method yields improved $\mathcal{L}_2$-gain performance through the use of scaled mixed IQCs. For systems subject to time-varying structured uncertainties, a new scaled bounded real lemma is further developed based on the IQC characterization. The resulting $\mathcal{H}_\infty$ synthesis conditions are expressed as linear matrix inequalities (LMIs), which are numerically tractable in all decision variables, including the scaling factors in the IQC multipliers. The proposed method is validated using a second-order uncertain system in linear fractional form, and its superiority over an anti-windup design is further illustrated by a cart-pendulum example.
In response to the trade-off between control performance and computational burden hindering the deployment of Deep Reinforcement Learning (DRL) in power inverters, this paper presents a novel model-free control framework leveraging policy distillation. To handle the convergence instability and steady-state errors inherent in model-free agents, an error energy-guided hybrid reward mechanism is established to theoretically constrain the exploration space. More specifically, an adaptive importance weighting mechanism is integrated into the distillation architecture to amplify the significance of fluctuation regions, ensuring high-quality transfer of transient control logic by mitigating the observational bias dominated by steady-state data. This approach efficiently compresses the heavy DRL policy into a lightweight neural network, retaining the desired control performance while overcoming the computational bottleneck during deployment. The proposed method is validated through a hardware-based kilowatt-level experimental platform. Experimental comparison results with traditional methods demonstrate that the proposed technique reduces inference time to the microsecond level and achieves superior transient response speed and parameter robustness.
In this paper we focus on the distributed quantized average consensus problem in open multi-agent systems consisting of dynamic directed communication links among active nodes. We propose three communication-efficient distributed algorithms designed for different scenarios. Our first algorithm solves the quantized averaging problem over the currently active node set under finite network openness (i.e., when the active set eventually stabilizes). Our second algorithm extends the aforementioned approach for the case where nodes suffer from arbitrary bounded processing delays. Our third algorithm operates over indefinitely open multi-agent networks with dynamic communication links (i.e., with continuous node arrivals and departures), computing the average that incorporates both active and historically active nodes. We analyze our algorithms' operation, establish their correctness, and present novel necessary and sufficient topological conditions ensuring their finite-time convergence. Numerical simulations on distributed sensor fusion for environmental monitoring demonstrate fast finite-time convergence and robustness across varying network sizes, departure/arrival rates, and processing delays. Finally, it is shown that our proposed algorithms compare favorably to algorithms in the existing literature.
Stacked intelligent metasurfaces (SIMs) have recently emerged as a key enabler for realizing electromagnetic wave-domain signal processing in next-generation wireless networks. However, practical SIM implementations often suffer from noticeable mismatches between theoretical models and measured responses due to fabrication and assembly imperfections. This article systematically investigates the problem of interlayer error calibration in SIMs. We first classify representative modeling and hardware-induced imperfections. Then, we outline the major challenges in SIM calibration and further develop a general framework that integrates a calibration protocol with the relevant solution strategies. Moreover, we investigate the effectiveness of the multi-stage calibration approach in mitigating geometric deviations and improving the alignment between the calibrated and practical propagation coefficients. Finally, we elaborate on key research opportunities and practical challenges toward realizing physically consistent and hardware-compliant SIM implementations for future research.
This paper presents a systematic framework for computing formally guaranteed trajectory tracking error bounds for autonomous helicopters based on Robust Positive Invariant (RPI) sets. The approach focuses on establishing a closed-loop translational error dynamics which is cast into polytopic linear parameter-varying form with bounded additive and state-dependent disturbances. Ellipsoidal RPI sets are computed, yielding explicit position error bounds suitable as certified buffer zones in upper-level trajectory planning. Three controller architectures are compared with respect to the conservatism of their error bounds and tracking performance. Simulation results on a nonlinear helicopter model demonstrate that all architectures respect the derived bounds, while highlighting trade-offs between dynamical fidelity and conservatism in invariant set computation.
The proliferation of smart and autonomous systems has motivated a shift toward executing intelligence directly on edge devices. This shift becomes particularly challenging for zero-energy devices (ZEDs), where severe constraints on memory, energy availability, and inference accuracy must be addressed simultaneously. In this paper, we present a unified approach to managing these constraints for smart ZEDs. Specifically, we design, train, and deploy a tiny machine learning (TinyML) model for person detection on a ZED. The proposed architecture stores a single model in memory while enabling adaptive inference through multiple exit points, allowing computational effort to scale with input difficulty. As a result, low-energy inference is performed for easy instances, while higher-precision inference is selectively employed for harder cases. This strategy significantly reduces energy consumption without sacrificing detection accuracy. Furthermore, to enhance device autonomy and prevent power failures, we introduce auxiliary energy-aware circuits that dynamically regulate system operation based on available energy. Compared with a state-of-the-art energy-aware single-exit TinyML approach, the proposed method achieves an energy consumption reduction of approximately $29.6\%$. Overall, the proposed framework is appealing for enabling accurate and energy-efficient intelligence on ZED platforms.
Agile earth observation satellites employ multiple actuators to enable flexible and responsive imaging capabilities. While significant advancements in actuator technology have enhanced satellites' torque and momentum, relatively little attention has been given to control strategies specifically tailored to improve satellite agility. This paper provides a comparative analysis of different Model Predictive Control (MPC) formulations and introduces an augmented-MPC method that effectively balances agility requirements with hardware implementation constraints. The proposed method achieves the high-performance characteristics of nonlinear MPC while preserving the computational simplicity of linear MPC. Numerical simulations and physical experiments are conducted to validate the effectiveness and feasibility of the proposed approach.
Multilingual speaker verification (SV) remains challenging due to limited cross-lingual data and language-dependent information in speaker embeddings. This paper presents a language-invariant multilingual SV system for the TidyVoice 2026 Challenge. We adopt the multilingual self-supervised w2v-BERT 2.0 model as the backbone, enhanced with Layer Adapters and Multi-scale Feature Aggregation to better exploit multi-layer representations. A language-adversarial training strategy with a Gradient Reversal Layer is applied to promote language-invariant speaker embeddings. Moreover, a multilingual zero-shot text-to-speech system is used to synthesize speech in multiple languages, improving language diversity. Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness. In addition, synthetic speech augmentation provides additional gains under limited training data conditions. Source code is available at this https URL.
This paper presents a novel symbiotic radio system for integrated sensing and backscatter communication (ISABC) technique that enables signal-domain interference-free coexistence of the primary communication signal and the backscatter communication (BC) signal within the same spectrum. The proposed system design allows simultaneous backscatter devices (BDs) sensing and data transmission without mutual interference by exploiting waveform-domain orthogonality between orthogonal frequency division multiplexing (OFDM) and affine frequency domain multiplexing (AFDM) signals. Specifically, a chirp-based AFDM waveform is adopted due to its inherent processing gain, which enhances the detectability and reliability of the weak backscatter signal while simultaneously supporting high-resolution sensing. Unlike conventional methods that attempt to suppress direct-link interference (DLI), this approach embeds the backscatter transmission within the affine domain while maintaining reliable OFDM-based primary communication. Furthermore, by assigning distinct affine-domain shifts to each backscatter device, the proposed framework inherently suppresses inter-backscatter device interference (IBDI). Comprehensive simulation results demonstrate that the proposed coexistence scheme effectively mitigates interference without affecting the error rate of the primary link and improves the miss-detection probability performance of the BC, making it a promising candidate for future low-power and interferenceresilient systems.
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).
This paper introduces a method for detecting, estimating, and localising a soft fault in wired communication networks. The proposed method is based on analysing the transmission coefficients (TC) in the time domain under both fault-free and faulty situations. An orthogonal frequency-division multiplexing (OFDM)-based scheme is used to estimate the TC. A fault-severity ratio is derived to estimate the fault intensity, while a residual-based function is proposed to determine its location. Experimental validation is carried out on a Y-shaped test setup to demonstrate the efficiency of the proposed approach.
In this paper, we propose a novel estimator of the instantaneous frequencies (IFs) of the modes making up multicomponent signals (MCSs). We are particularly interested in dealing with noisy MCSs containing close modes in the time-frequency plane. Though it is possible to adapt Prony approach to estimate IFs in such situations, interference between the modes generates oscillations in the obtained estimations. After having investigated the nature of these oscillations, we propose an algorithm to remove these in IFs estimation, based on spline approximation. Numerical applications in various situations illustrate the benefit of mixing Prony technique with spline approximation for IF estimation in noisy MCSs containing close modes.
Free-text promptable 3D medical image segmentation offers an intuitive and clinically flexible interaction paradigm. However, current methods are highly sensitive to linguistic variability: minor changes in phrasing can cause substantial performance degradation despite identical clinical intent. Existing approaches attempt to improve robustness through stronger vision-language fusion or larger vocabularies, yet they lack mechanisms to consistently align ambiguous free-form expressions with anatomically grounded representations. We propose Skill-Evolving grounded Reasoning (SEER), a novel framework for free-text promptable 3D medical image segmentation that explicitly bridges linguistic variability and anatomical precision through a reasoning-driven design. First, we curate the SEER-Trace dataset, which pairs raw clinical requests with image-grounded, skill-tagged reasoning traces, establishing a reproducible benchmark. Second, SEER constructs an evidence-aligned target representation via a vision-language reasoning chain that verifies clinical intent against image-derived anatomical evidence, thereby enforcing semantic consistency before voxel-level decoding. Third, we introduce SEER-Loop, a dynamic skill-evolving strategy that distills high-reward reasoning trajectories into reusable skill artifacts and progressively integrates them into subsequent inference, enabling structured self-refinement and improved robustness to diverse linguistic expressions. Extensive experiments demonstrate superior performance of SEER over state-of-the-art baselines. Under linguistic perturbations, SEER reduces performance variance by 81.94% and improves worst-case Dice by 18.60%.
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.
Near-field (NF) passive radar imaging depends on the illumination of the imaging scene by a non-cooperative transmitter (Tx). It is demonstrated that combining imaging results obtained with Tx antennas at different positions can enhance the performance of passive radar imaging. On the one hand, multiple Tx antennas provide diverse illumination perspectives, reducing the likelihood of unilluminated regions on the targets of interest (TOIs). On the other hand, the coherent summation of imaging results obtained for different illuminations helps to suppress potential artifacts. This approach is in particular advantageous for imaging complex objects with concave structures such as dihedral arrangements, where the ghosts due to multiple reflections are highly configuration-dependent. For each illuminating configuration, a single-frequency inverse source solver is utilized to reconstruct the equivalent sources of the TOIs and the resulting single-frequency images are then superimposed coherently with corresponding phase and magnitude correction methods. The obtained multi-frequency images are finally coherently combined to enhance the imaging quality. Both simulation and measurement results are presented to validate the effectiveness of the approach.
The extension of 5G connectivity through Low-Earth Orbit satellite systems introduces significant technical challenges, particularly due to time-varying propagation delays and high Doppler shifts resulting from satellite motion. While the Third Generation Partnership Project Release 17 established the initial framework for non-terrestrial networks, the ongoing developments in Release 19 further enhance this effort by introducing support for regenerative payload architectures, where part of the communication protocol stack is processed directly on board the satellite. In this work, we present the design of a 5G user equipment adapted for Low-Earth Orbit satellite connectivity, with specific focus on strategies for managing variable delay and Doppler compensation. Additionally, we describe a custom experimental platform based on a drone-mounted software-defined radio platform capable of emulating both transparent and regenerative satellite payloads. Although full end-to-end system validation is not yet complete, initial laboratory tests confirm the feasibility of the architecture and lay the groundwork for future experimental campaigns.
Paralinguistic speech tasks are often considered relatively language-agnostic, as they rely on extralinguistic acoustic cues rather than lexical content. However, prior studies report performance degradation under cross-lingual conditions, indicating non-negligible language dependence. Still, these studies typically focus on isolated language pairs or task-specific settings, limiting comparability and preventing a systematic assessment of task-level language dependence. We introduce the Cross-Lingual Transfer Matrix (CLTM), a systematic method to quantify cross-lingual interactions between pairs of languages within a given task. We apply the CLTM to two paralinguistic tasks, gender identification and speaker verification, using a multilingual HuBERT-based encoder, to analyze how donor-language data affects target-language performance during fine-tuning. Our results reveal distinct transfer patterns across tasks and languages, reflecting systematic, language-dependent effects.
The revolutionary convergence of fluid antenna systems (FAS) and reconfigurable intelligent surfaces (RIS) creates unprecedented opportunities for secure wireless communications, yet the practical implications of hardware impairments on this promising combination remain largely unexplored. This paper investigates the security performance of non-orthogonal multiple access (NOMA) systems when fluid antennas (FAs) meet intelligent surfaces under realistic hardware constraints. We develop a comprehensive analytical framework that captures the complex interplay between adaptive spatial diversity, intelligent signal reflection, and hardware-induced distortions in short-packet communications. Through novel piecewise linear approximations and block-correlation models, we derive tractable expressions for average secure block error rate (BLER) that reveal fundamental performance limits imposed by hardware impairments. Our analysis demonstrates that while the synergy between FAs and intelligent surfaces offers remarkable degrees of freedom for security enhancement, practical hardware imperfections create performance ceilings that persist regardless of spatial diversity gains. The theoretical framework exposes critical design trade-offs between system complexity and achievable security performance, showing that hardware quality becomes a decisive factor in realizing the full potential of FAS-RIS architectures. Extensive simulations validate our analytical insights and provide practical design guidelines for implementing secure NOMA systems that effectively balance the benefits of fluid-intelligent cooperation against the constraints of realistic hardware limitations.
European Member States are increasingly introducing national capacity mechanisms (CMs) to manage growing adequacy risks. However, isolated national CMs are inefficient in highly interconnected electricity systems, such as the European system. While progress has been made in facilitating cross-border participation by generation capacity in CMs, existing arrangements are prone to under- or over-investment and do not properly value the contribution of interconnection capacity to Member States' adequacy targets. In this paper, we propose a novel conceptual design for a coupled European capacity market that utilises the logic of flow-based market coupling. In a comparative analysis of different market design scenarios in an illustrative multi-zone case study, using a bespoke long-run equilibrium problem, we show that the proposed flow-based coupling of capacity markets reduces system costs by harnessing available capacity in neighbouring market zones while ensuring deliverability with respect to network constraints in all scarcity situations.
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
This paper presents an adaptive control framework for Euler-Lagrange (E-L) systems that enforces user-defined time-varying state and input constraints in the presence of parametric uncertainties and bounded disturbances. The proposed design integrates a time-varying barrier Lyapunov Function (TVBLF) with a saturated control law to guarantee constraint satisfaction without resorting to real-time optimization. A key contribution is the development of an offline, verifiable feasibility condition that certifies the existence of a feasible control policy for any prescribed pair of time-varying state and input envelopes. Additionally, we prove boundedness of all closed-loop signals. Real-time experiments conducted on a 2-DoF helicopter model validate the efficacy and practical viability of the proposed method.
Nowadays, energy transition is ongoing in many countries, aiming to reduce dependence on fossil fuels and CO2 emissions. Besides the positive impacts on the environment, this transition brings technical challenges to the system operators, such as the intricacies of energy system integration, diminishing uncertainty, and incentivizing customers with advanced transaction models. The coordination between the Transmission system operator (TSO) and the Distribution system operator (DSO) is one of the most important aspects of encountering these obstacles. This coordination enhances the utilization of flexibility from Distributed energy resources (DERs) by incentivizing the market parties with better willingness to pay schemes. This paper provides an overview of the coordination schemes (CS), their classification, assessment of the current situation and the challenges associated with applying these schemes in practical context. The main purpose is to investigate the most effective way for TSO/DSOs to use the flexibility resource to maintain the balancing of the entire system while ensuring no congestion occurs in the network. A broad range of possible coordination schemes along with exploiting flexibility services is presented and the pros and cons are analyzed. Additionally, the study presents a general scenario that describes the interaction between the operators and the third party in providing service to the balancing market, considering cases with and without coordination.
Electrocardiogram (ECG) analysis is vital for detecting cardiac abnormalities, yet robust automated classification is challenging due to the complexity and variability of physiological signals. In this work, we investigate transformer-based ECG classification using features derived from the Koopman operator and wavelet transforms. Two tasks are studied: (1) binary classification (Normal vs. Non-normal), and (2) four-class classification (Normal, Atrial Fibrillation, Ventricular Arrhythmia, Block). We use Extended Dynamic Mode Decomposition (EDMD) to approximate the Koopman operator. Our results show that wavelet features excel in binary classification, while Koopman features, when paired with transformers, achieve superior performance in the four-class setting. A simple hybrid of Koopman and wavelet features does not improve accuracy. However, selecting an appropriate EDMD dictionary -- specifically a radial basis function dictionary with tuned parameters -- yields significant gains, surpassing the wavelet-only baseline and the hybrid wavelet-Koopman system. We also present a Koopman-based reconstruction analysis for interpretable insights into the learned dynamics and compare against a recurrent neural network baseline. Overall, our findings demonstrate the effectiveness of Koopman-based feature learning with transformers and highlight promising directions for integrating dynamical systems theory into time-series classification.
State-space analysis is widely employed for examining power system dynamics but faces challenges in large-scale power systems integrated with numerous inverter-based resources (IBRs), where the significant increase of system states complicates modal analysis. Notably, renewable energy power systems often consist of multiple homogeneous generation units. This uniformity, termed symmetry in this paper, can facilitate the system stability analysis. Eigenvalue patterns and participation factors in three types of symmetric renewable energy power systems are investigated, including ideally-, quasi-, and group-symmetric systems. An ideally-symmetric (quasi-symmetric) system comprises a group of identical (similar) subsystems connected to an external grid. A system containing multiple such groups is termed group-symmetric. In these symmetric systems, two types of modes are defined to characterize different interactions: inner-group modes, which describe the interactions among subsystems within a single group, and group-grid modes, which describe the interactions between the groups and the external grid. A new concept termed group participation factor is also proposed to extend the use of conventional participation factors for repeated and close modes. In addition, the invariance properties of the inner-group modes and group-grid modes are discussed. The findings provide insights for stability analysis and targeted optimization in power systems. Theoretical advances are validated through numerical results and electromagnetic transient (EMT) simulations on example power systems of varied types and scales.
Purpose/Objective: Brain tumors result in 20 years of lost life on average. Standard therapies induce complex structural changes in the brain that are monitored through MRI. Recent developments in artificial intelligence (AI) enable conditional multimodal image generation from clinical data. In this study, we investigate AI-driven generation of follow-up MRI in patients with in- tracranial tumors through conditional image generation. This approach enables realistic modeling of post-radiotherapy changes, allowing for treatment optimization. Material/Methods: The public SAILOR dataset of 25 patients was used to create a 2D rectified flow model conditioned on axial slices of pre-treatment MRI and RT dose maps. Cross-attention conditioning was used to incorporate temporal and chemotherapy data. The resulting images were validated with structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), Dice scores and Jacobian determinants. Results: The resulting model generates realistic follow-up MRI for any time point, while integrating treatment information. Comparing real versus predicted images, SSIM is 0.88, and PSNR is 22.82. Tissue segmentations from real versus predicted MRI result in a mean Dice-Sørensen coefficient (DSC) of 0.91. The rectified flow (RF) model enables up to 250x faster inference than Denoising Diffusion Probabilistic Models (DDPM). Conclusion: The proposed model generates realistic follow-up MRI in real-time, preserving both semantic and visual fidelity as confirmed by image quality metrics and tissue segmentations. Conditional generation allows counterfactual simulations by varying treatment parameters, producing predicted morphological changes. This capability has potential to support adaptive treatment dose planning and personalized outcome prediction for patients with intracranial tumors.
In mixed near-field and far-field systems, the nonorthogonality between near-field and far-field channels may cause severe inter-user interference and hence degrade rate performance, when the analog beamforming is designed based on the low-complexity full-array maximum ratio transmission (MRT). To tackle this issue, we propose in this paper an antenna selection-based transmission framework to effectively suppress mixed-field interference without mechanically altering antenna structures. To this end, an optimization problem is formulated to maximize the sum-rate of mixed-field systems, by jointly designing antenna selection and power allocation under the MRT-based analog beamforming. As the problem is non-convex and generally difficult to solve optimally, we first consider a typical two-user scenario to obtain useful insights. Interestingly, we analytically show that the strong mixed-field interference can be substantially mitigated by deactivating only a small portion of antennas, yet without compromising array gains too much. Moreover, an inherent tradeoff is revealed in antenna selection between interference suppression and array-gain enhancement, based on which a suboptimal number of deactivated antennas for achieving the maximum sum-rate is obtained. Next, for the general multi-user case, we develop an efficient penalty dual decomposition (PDD)-based two-layer framework to obtain its high quality solution by using the block coordinate descent (BCD) and successive convex approximation (SCA) techniques. To further reduce the computational complexity, a low-complexity antenna deactivation strategy is proposed capitalizing on an interference suppression criterion. Last, numerical results demonstrate that the proposed scheme achieves a favorable trade-off between interference suppression and array gain loss, hence achieving significant performance gains over various baseline schemes.
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.
Radio frequency fingerprint identification (RFFI) exploits device-specific hardware impairments for transmitter recognition, but its performance is highly vulnerable to receiver variations and changing wireless channels in cross-receiver deployment. To address both challenges, this paper proposes a novel cross-receiver RFFI framework with channel robustness. In the enrollment stage, a channel-robust preprocessing method is developed to construct denoised spectral quotient (DSQ) sequences, and a DSQ-based convolutional neural network (DSQCNN) is trained using data collected from the source receiver. In the cross-receiver deployment stage, a calibration dataset is built from signals captured by both the source and target receivers, and a trainable calibration neural network (TCNN) is designed to learn the nonlinear mapping between them. The cascaded TCNN-DSQCNN framework then enables robust transmitter classification on the target receiver under varying channel conditions. To the best of our knowledge, this is the first work to jointly address channel and receiver portability through combined channel suppression and nonlinear receiver calibration. Simulations with twelve WiFi transmitters and three receivers show that the proposed method achieves reliable cross-receiver classification, reaching over 90\% accuracy at an SNR of 24 dB.
This paper proposes a subspace fusion sensing algorithm for cooperative integrated sensing and communication. First, we stack the received signals from access points (APs) into a third-order tensor and construct the equivalent virtual antenna (EVA) array via tensor unfolding. Then, a data association-free subspace-based fusion sensing algorithm is developed utilizing the EVA arrays from distributed APs. A derivation of Cramer-Rao lower bound (CRLB) is also presented. Finally, simulation results validate the effectiveness of the proposed algorithm compared to traditional techniques.
Connected autonomous vehicles (CAVs) require reliable and efficient communication frameworks to support safety critical and task-oriented applications such as collision avoidance, cooperative perception, and traffic risk assessment. Traditional communication paradigms, which focus on transmitting raw bits, often incur excessive bandwidth consumption and fail to preserve the semantic relevance of transmitted information. To bridge this gap, we propose a Graph-Based Semantic Encoder-Decoder (GBSED) architecture tailored for task-oriented communications in CAV networks. The encoder leverages scene graphs to capture spatial and semantic relationships among road entities, combined with a semantic compression algorithm that reduces the size of the extracted graph based representations by up to 99% compared to raw images, while the decoder reconstructs task relevant representations rather than raw data. This design enables a significant reduction in communication overhead while maintaining high semantic fidelity, exceeding 0.9 at SNR levels above 10dB, for downstream vehicular tasks. We evaluate the proposed framework through simulations in autonomous driving scenarios, where the semantic encoder and decoder are integrated into a MIMO OFDM physical layer system. The results demonstrate high prediction success rates for risk assessment, improved robustness under the 3GPP CDL channel, and significant compression gains, confirming that the proposed semantic communication framework is a promising solution for future 6G systems.
Model-based reinforcement learning (MBRL) is sample-efficient but depends on the accuracy of the learned dynamics, which are often modeled using black-box methods that do not adhere to physical laws. Those methods tend to produce inaccurate predictions when presented with data that differ from the original training set. In this work, we employ Lagrangian neural networks (LNNs), which enforce an underlying Lagrangian structure to train the model within a Dyna-based MBRL framework. Furthermore, we train the LNN using stochastic gradient-based and state-estimation-based optimizers to learn the network's weights. The state-estimation-based method converges faster than the stochastic gradient-based method during neural network training. Simulation results are provided to illustrate the effectiveness of the proposed LNN-based Dyna framework for MBRL.
This letter investigates multi-mode pinching antenna systems (PASS), where signals of multiple orthogonal modes can be transmitted within a dielectric waveguide and radiated by pinching antennas (PAs). This enables mode-domain multiplexing for efficient multi-user communications using a single waveguide. In particular, two operating protocols are proposed, namely mode selection and mode combining. Mode selection enforces each PA to predominantly radiate signal power of one single mode, while mode combining allows each PA to flexibly radiate power of multiple modes. Based on the two protocols, a sum rate maximization problem is formulated for multi-mode PASS-enabled multi-user downlink communications, where the transmit beamforming, PA positions, and PA propagation constants are jointly optimized. To address this rapidly oscillating and highly nonconvex problem, a particle swarm optimization (PSO) based Karush-Kuhn-Tucker (KKT)-parameterized beamforming (PSO- KPBF) algorithm is proposed. KKT-conditioned solutions are exploited to guide the swarm search, thus reducing the search space and achieving fast convergence. Numerical results demonstrate that: 1) Even using a simple uniform mode-combining design, the multi-mode PASS significantly outperform conventional single-mode PASS and hybrid beamforming systems; and 2) Mode combining achieves high spectral efficiency, while mode selection approximates its performance with a lower hardware complexity. Code is released at this https URL
This paper presents positive initial evidence that generative agents can relax the rigidity of traditional mathematical models for human decision-making in power dispatch and auction settings. We design two proof-of-concept energy experiments with generative agents powered by a large language model (LLM). First, we construct a home battery management testbed with stochastic electricity prices and blackout interventions, and benchmark LLM decisions against dynamic programming. By incorporating an in-context learning (ICL) module, we show that behavioral patterns discovered by a stronger reasoning model can be transferred to a smaller LLM via example-based prompting, leading agents to prioritize post-blackout energy reserves over short-term profit. Second, we study LLM agents in simultaneous ascending auctions (SAA) for power network access, comparing their behavior with an optimization benchmark, the straightforward bidding strategy. By designing ICL prompts with rule-based, myopic, and strategic objectives, we find that structured prompting combined with ICL enables LLM agents to both reproduce economically rational strategies and exhibit systematic behavioral deviations. Overall, these results suggest that LLM-powered agents provide a flexible and expressive testbed for modeling human decision-making in power system applications.
We introduce a task-relative taxonomy of actuator inputs for nonlinear systems within the input-output feedback-linearization framework. Given a flat output specifying the task, inputs are classified as essential, redundant, or dexterity: essential inputs are required for exact linearization, redundant inputs can be removed without effect, and dexterity inputs can be deactivated while preserving exact linearization of a reduced task. We show that a subset is dexterity if and only if, under a suitable dynamic prolongation, it can appear as additional output channels (flat-input complement) on a common validity set. Whenever a family of systems obtained by (de)activating dexterity inputs admits a common prolongation, the family can be interpreted as a single prolonged system endowed with different output selections. This enables a unified linearizing controller that negotiates between full and reduced tasks without transients on shared outputs under compatibility and dwell-time conditions. Simulations on a fully actuated aerial platform illustrate graceful task downgrades from six-dimensional pose tracking as lateral-force channels are deactivated.
We investigate the impact of mode-dependent loss (MDL) on the statistics of the signal-to-noise ratio (SNR) in coupled-core multi-core fiber (CC-MCF) systems. Through numerical and theoretical simulations, we present an in-depth analysis of the impact of MDL on received amplified spontaneous emission (ASE) noise and nonlinear interference (NLI), as well as their joint contribution to the SNR. We show that MDL induces different statistics on the two noises and discuss the differences with single-mode polarization-dependent loss. Moreover, we investigate the impact of spatial mode dispersion (SMD) on the MDL-induced impairment, offering insights on their joint effects on ASE and NLI.
We propose a reachability-based framework for reliable LLM-guided human-autonomy teaming (HAT) using signal temporal logic (STL). In the proposed framework, LLM is leveraged as a translator that transfers natural language commands given by a human operator into corresponding STL specifications or vice versa. An STL feasibility filter (SFF) is proposed to check the feasibility of the generated STL. The SFF first decomposes the complex and nested LLM translation into a set of simpler subformulas for parallelization and informative feedback generation. The reachability analysis method is then applied to verify if each subformula is feasible for a target dynamical system: if feasible, perform mission planning, otherwise, reject it. The proposed SFF can identify infeasible subformulas, more than simply providing the boolean verification results for the whole STL, thereby facilitating the feedback generation of LLM to request modification of the command to the human. Consequently, the proposed framework can allow more reliable HAT by enabling safe and informative communication between the human operator and the autonomous agent. Our experiments demonstrate that the proposed framework can successfully filter out infeasible subformulas and generate informative feedback based on such information.
Tackling climate change requires the rapid and deep decarbonization of electric power systems. While energy management systems (EMSs) play a central role in this transition, conventional EMSs focus mainly on economic efficiency and often overlook the environmental impact of operational decisions. To address this gap, this paper proposes a unified, real-time building-level carbon-aware EMS (CAEMS) capable of simultaneously co-optimizing grid imports, energy storage, and flexible demand within a single integrated framework. We formulate a mixed-integer linear program (MILP) model that directly integrates time-varying marginal carbon intensity signals into the EMS objective for coordinated participation in both day-ahead (DA) and real-time (RT) markets. To relax the unrealistic assumption of perfect foresight, we incorporate a model predictive control (MPC) extension driven by a Transformer-based forecaster that jointly predicts electricity prices and carbon intensity. The proposed CAEMS is validated using real-world data from the PJM electricity market. Simulation results demonstrate that modest carbon prices can achieve a significant 22.5% reduction in emissions with only a 1.7% increase in cost.
Aggregate size and shape are key properties for determining quality of aggregate materials used in road construction and transportation geotechnics applications. The composition and packing, layer stiffness, and load response are all influenced by these morphological characteristics of aggregates. Many aggregate imaging systems developed to date only focus on analyses of individual or manually separated aggregate particles. There is a need to develop a convenient and affordable system for acquiring 3D aggregate information from stockpiles in the field. This paper presents an innovative 3D imaging approach for potential field evaluation of large-sized aggregates, whereby engineers can perform inspection by taking videos/images with mobile devices such as smartphone cameras. The approach leverages Structure-from-Motion (SfM) techniques to reconstruct the stockpile surface as 3D spatial data, i.e. point cloud, and uses a 3D segmentation algorithm to separate and extract individual aggregates from the reconstructed stockpile. The preliminary results presented in this paper demonstrate the future potential of using 3D aggregate size and shape information for onsite Quality Assurance/Quality Control (QA/QC) tasks.
Safe navigation of autonomous robots remains one of the core challenges in the field, especially in dynamic and uncertain environments. One of the prevalent approaches is safety filtering based on control barrier functions (CBFs), which are easy to deploy but difficult to design. Motivated by the shortcomings of existing learning- and model-based methods, we propose a simple yet effective neural CBF design method for safe robot navigation in dynamic environments. We employ the idea of a composite CBF, where multiple neural CBFs are combined into a single CBF. The individual CBFs are trained via the Hamilton-Jacobi reachability framework to approximate the optimal safe set for single moving obstacles. Additionally, we use the residual neural architecture, which guarantees that the estimated safe set does not intersect with the corresponding failure set. The method is extensively evaluated in simulation experiments for a ground robot and a quadrotor, comparing it against several baseline methods. The results show improved success rates of up to 18\% compared to the best baseline, without increasing the conservativeness of the motion. Also, the method is demonstrated in hardware experiments for both types of robots.
Memory disaggregation via Compute Express Link (CXL) enables multiple hosts to share remote memory, improving utilization for data-intensive workloads. Today, virtual memory enables process-level isolation on a host and CXL enables host-level isolation. This creates a critical security gap: the absence of process-level memory isolation in shared disaggregated memory. We present Space-Control, a hardware-software co-design that provides fine-grained, process-level isolation for shared disaggregated memory. Space-Control authenticates execution context in the hardware and enforces access control on every memory access and amortizes lookup times with a small cache. Our design allows up to 127 processes Simulation Toolkit (SST) based CXL model, Space-Control incurs minimal performance overhead of 3.3%, making shared disaggregated memory isolation practical.
This tutorial provides a critical review of the practical application of Control Barrier Functions (CBFs) in robotic safety. While the theoretical foundations of CBFs are well-established, I identify a recurring gap between the mathematical assumption of a safe controller's existence and its constructive realization in systems with input constraints. I highlight the distinction between candidate and valid CBFs by analyzing the interplay of system dynamics, actuation limits, and class-K functions. I further show that some purported demonstrations of safe robot policies or controllers are limited to passively safe systems, such as single integrators or kinematic manipulators, where safety is already inherited from the underlying physics and even naive geometric hard constraints suffice to prevent collisions. By revisiting simple low-dimensional examples, I show when CBF formulations provide valid safety guarantees and when they fail due to common misuses. I then provide practical guidelines for constructing realizable safety arguments for systems without such passive safety. The goal of this tutorial is to bridge the gap between theoretical guarantees and actual implementation, supported by an open-source interactive web demonstration that visualizes these concepts intuitively.
Collaborative aerial transportation of tethered payloads is fundamentally limited by space, power, and weight constraints. Conventional approaches rely on static equilibrium conditions, where each vehicle tilts to generate the forces that ensure they maintain a formation geometry that avoids aerodynamic interactions and collision. This horizontal thrust component represents a significant energy penalty compared to the ideal case in which each vehicle produces purely vertical thrust to lift the payload. Operating in tighter tether configurations can minimize this effect, but at the cost of either having to fly the vehicles in closer proximity, which risks collision, or significantly increasing the length of the tether, which increases complexity and reduces potential use-cases. We propose operating the tether-suspended flying system at a rotating equilibrium. By maintaining steady circular motion, centrifugal forces provide the necessary horizontal tether tension, allowing each quadrotor to generate purely vertical thrust and thus reducing the total force (and power) required compared to an equilibrium where the thrusts are not vertical. It also allows for a wider range of tether configurations to be used without sacrificing efficiency. Results demonstrate that rotating equilibria can reduce power consumption relative to static lifting by up to 20%, making collaborative aerial solutions more practically relevant.
Scientists face significant visualization challenges as time-varying datasets grow in speed and volume, often requiring specialized infrastructure and expertise to handle massive datasets. Petascale climate models generated in NASA laboratories require a dedicated group of graphics and media experts and access to high-performance computing resources. Scientists may need to share scientific results with the community iteratively and quickly. However, the time-consuming trial-and-error process incurs significant data transfer overhead and far exceeds the time and resources allocated for typical post-analysis visualization tasks, disrupting the production workflow. Our paper introduces a user-friendly framework for creating 3D animations of petascale, time-varying data on a commodity workstation. Our contributions: (i) Generalized Animation Descriptor (GAD) with a keyframe-based adaptable abstraction for animation, (ii) efficient data access from cloud-hosted repositories to reduce data management overhead, (iii) tailored rendering system, and (iv) an LLM-assisted conversational interface as a scripting module to allow domain scientists with no visualization expertise to create animations of their region of interest. We demonstrate the framework's effectiveness with two case studies: first, by generating animations in which sampling criteria are specified based on prior knowledge, and second, by generating AI-assisted animations in which sampling parameters are derived from natural-language user prompts. In all cases, we use large-scale NASA climate-oceanographic datasets that exceed 1PB in size yet achieve a fast turnaround time of 1 minute to 2 hours. Users can generate a rough draft of the animation within minutes, then seamlessly incorporate as much high-resolution data as needed for the final version.
Intelligent fault diagnosis (IFD) has emerged as a powerful paradigm for ensuring the safety and reliability of industrial machinery. However, traditional IFD methods rely heavily on abundant labeled data for training, which is often difficult to obtain in practical industrial environments. Constructing a digital twin (DT) of the physical asset to obtain simulation data has therefore become a promising alternative. Nevertheless, existing DT-assisted diagnosis methods mainly transfer diagnostic knowledge through domain adaptation techniques, which still require a considerable amount of unlabeled data from the target asset. To address the challenges in few-shot scenarios where only extremely limited samples are available, a bi-directional DT prototype anchoring method with multi-periodicity learning is proposed. Specifically, a framework involving meta-training in the DT virtual space and test-time adaptation in the physical space is constructed for reliable few-shot model adaptation for the target asset. A bi-directional twin-domain prototype anchoring strategy with covariance-guided augmentation for adaptation is further developed to improve the robustness of prototype estimation. In addition, a multi-periodicity feature learning module is designed to capture the intrinsic periodic characteristics within current signals. A DT of an asynchronous motor is built based on finite element method, and experiments are conducted under multiple few-shot settings and three working conditions. Comparative and ablation studies demonstrate the superiority and effectiveness of the proposed method for few-shot fault diagnosis.
Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome-based warnings, leading to missed detections and false stops in safety-critical environments.
Collaborative transportation of heavy payloads via loco-manipulation is a challenging yet essential capability for legged robots operating in complex, unstructured environments. Centralized planning methods, e.g., holistic trajectory optimization, capture dynamic coupling among robots and payloads but scale poorly with system size, limiting real-time applicability. In contrast, hierarchical and fully decentralized approaches often neglect force and dynamic interactions, leading to conservative behavior. This study proposes an Alternating Direction Method of Multipliers (ADMM)-based distributed model predictive control framework for collaborative loco-manipulation with a team of quadruped robots with manipulators. By exploiting the payload-induced coupling structure, the global optimal control problem is decomposed into parallel individual-robot-level subproblems with consensus constraints. The distributed planner operates in a receding-horizon fashion and achieves fast convergence, requiring only a few ADMM iterations per planning cycle. A wrench-aware whole-body controller executes the planned trajectories, tracking both motion and interaction wrenches. Extensive simulations with up to four robots demonstrate scalability, real-time performance, and robustness to model uncertainty.
We introduce a multimodal industrial fault analysis dataset collected from a single-speed chain conveyor (SSCC) system, targeting system-level fault detection in production lines. The dataset consists of multimodal signals, including three audio and four vibration channels. It covers normal operation and four representative fault types under multiple speeds, loads, and both clean and realistic factory-noise conditions reproduced on-site. It is explicitly designed to support channel-wise analysis and multimodal fusion research. We establish standardized evaluation protocols for unsupervised fault detection with normal-only training and supervised fault classification with balanced dataset splits across different operating conditions and fault types. A unified channel-wise kNN baseline is provided to enable fair comparison of representation quality without task-specific training. The dataset offers a practical and extensible benchmark for robust multimodal industrial fault analysis.
This tutorial presents a control-oriented introduction to aided inertial navigation systems using a Lie-group formulation centered on the extended Special Euclidean group SE_2(3). The focus is on developing a clear and implementation-oriented geometric framework for fusing inertial measurements with aiding information, while making the role of invariance and symmetry explicit. Recent extensions, including higher-order state representations, synchronous observer designs, and equivariant filtering methods, are discussed as natural continuations of the same underlying principles. The goal is to provide readers with a coherent system-theoretic perspective that supports both understanding and practical use of modern aided inertial navigation methods.
Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable automated signal analysis, offering the potential to provide clinicians with both objective and quantitative feedback on bowel activity. This study presents an automated pipeline for bowel sound segmentation and classification using a wearable acoustic SonicGuard sensor. BS signals from 83 subjects were recorded using a SonicGuard sensor. Data from 40 subjects were manually annotated by clinical experts and used to train an automatic annotation algorithm, while the remaining subjects were used for further model evaluation. An energy-based event detection algorithm was developed to detect BS events. Detected sound segments were then classified into BS patterns using a pretrained Audio Spectrogram Transformer (AST) model. Model performance was evaluated separately for healthy individuals and patients. The best configuration used two specialized models, one trained on healthy subjects and one on patients, achieving (accuracy: 0.97, AUROC: 0.98) for healthy group and (accuracy: 0.96, AUROC: 0.98) for patient group. The auto-annotation method reduced manual labeling time by approximately 70%, and expert review showed that less than 12% of automatically detected segments required correction. The proposed automated segmentation and classification system enables quantitative assessment of bowel activity, providing clinicians with an objective diagnostic tool that may improve the diagnostic of gastrointestinal function and support the annotation of large-scale datasets.
Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling linguistic coverage of an S3M-based language identification system from 126 to 4,017 languages influences this topology. Our results reveal a non-linear effect: while phylogenetic recovery remains stagnant up to the 1K scale, the 4K model displays a dramatic qualitative shift, resolving both clear lineages and complex, long-term linguistic contact. Notably, our analysis reveals the emergence of a robust macro-cluster in the Pacific (comprising Papuan, Oceanic, and Australian languages) and investigates its latent drivers. We find that the 4K model utilizes a more concentrated encoding that captures shared, robust acoustic signatures such as global energy dynamics. These findings suggest that massive S3Ms can internalize multiple layers of language history, providing a promising perspective for computational phylogenetics and the study of language contact.
Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.
Vehicle tracking, motion estimation, and collision prediction are fundamental components of traffic safety and management in Intelligent Transportation Systems (ITS). Many recent approaches rely on computationally intensive prediction models, which limits their practical deployment on resource-constrained edge devices. This paper presents a lightweight digital-twin-based framework for vehicle tracking and spatiotemporal collision prediction that relies solely on object detection, without requiring complex trajectory prediction networks. The framework is implemented and evaluated in Quanser Interactive Labs (QLabs), a high-fidelity digital twin of an urban traffic environment that enables controlled and repeatable scenario generation. A YOLO-based detector is deployed on simulated edge cameras to localize vehicles and extract frame-level centroid trajectories. Offline path maps are constructed from multiple traversals and indexed using K-D trees to support efficient online association between detected vehicles and road segments. During runtime, consistent vehicle identifiers are maintained, vehicle speed and direction are estimated from the temporal evolution of path indices, and future positions are predicted accordingly. Potential collisions are identified by analyzing both spatial proximity and temporal overlap of predicted future trajectories. Our experimental results across diverse simulated urban scenarios show that the proposed framework predicts approximately 88% of collision events prior to occurrence while maintaining low computational overhead suitable for edge deployment. Rather than introducing a computationally intensive prediction model, this work introduces a lightweight digital-twin-based solution for vehicle tracking and collision prediction, tailored for real-time edge deployment in ITS.
Autonomous underwater robots are increasingly deployed for environmental monitoring, infrastructure inspection, subsea resource exploration, and long-horizon exploration. Yet, despite rapid advances in learning-based planning and control, reliable autonomy in real ocean environments remains fundamentally constrained by tightly coupled physical limits. Hydrodynamic uncertainty, partial observability, bandwidth-limited communication, and energy scarcity are not independent challenges; they interact within the closed perception-planning-control loop and often amplify one another over time. This Review develops a constraint-coupled perspective on underwater embodied intelligence, arguing that planning and control must be understood within tightly coupled sensing, communication, coordination, and resource constraints in real ocean environments. We synthesize recent progress in reinforcement learning, belief-aware planning, hybrid control, multi-robot coordination, and foundation-model integration through this embodied perspective. Across representative application domains, we show how environmental monitoring, inspection, exploration, and cooperative missions expose distinct stress profiles of cross-layer coupling. To unify these observations, we introduce a cross-layer failure taxonomy spanning epistemic, dynamic, and coordination breakdowns, and analyze how errors cascade across autonomy layers under uncertainty. Building on this structure, we outline research directions toward physics-grounded world models, certifiable learning-enabled control, communication-aware coordination, and deployment-aware system design. By internalizing constraint coupling rather than treating it as an external disturbance, underwater embodied intelligence may evolve from performance-driven adaptation toward resilient, scalable, and verifiable autonomy under real ocean conditions.
We study the problem of online tracking in unknown nonlinear dynamical systems, where only short-horizon predictions of future target states are available. This setting arises in practical scenarios where full future information and exact system dynamics are unavailable. We focus on a class of nonlinear systems that admit a Koopman linear embedding, enabling the dynamics to evolve linearly in a lifted space. Exploiting this structure, we analyze a model-free predictive tracking algorithm based on Willems' fundamental lemma, which imposes dynamic constraints using only past data within a receding-horizon control framework. We show that, for Koopman-linearizable systems, the cumulative cost and dynamic regret of the nonlinear tracking problem coincide with those of the lifted linear counterpart. Moreover, we prove that the dynamic regret of our algorithm decays exponentially with the prediction horizon, as validated by numerical experiments.
We are concerned with the challenge of reliably classifying and assessing intracranial aneurysms using deep learning without compromising clinical transparency. While traditional black-box models achieve high predictive accuracy, their lack of inherent interpretability remains a significant barrier to clinical adoption and regulatory approval. Explainability is paramount in medical modeling to ensure that AI-driven diagnoses align with established neurosurgical principles. Unlike traditional eXplainable AI (XAI) methods -- such as saliency maps, which often provide post-hoc, non-causal visual correlations -- Concept Bottleneck Models (CBMs) offer a robust alternative by constraining the model's internal logic to human-understandable clinical indices. In this article, we propose an end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification. We implemented this pipeline using a pre-trained 3D ResNet-34 backbone and a 3D DenseNet-121 to extract features from CTA volumes, which were subsequently processed through a soft bottleneck layer representing human-interpretable clinical concepts. The model was optimized using a joint-loss function to balance diagnostic focal loss and concept mean squared error (MSE), validated via stratified five-fold cross-validation. Our results demonstrate a peak task classification accuracy of 93.33% +/- 4.5% for the ResNet-34 architecture and 91.43% +/- 5.8% for the DenseNet-121 model. Furthermore, the implementation of 8-pass Test-Time Augmentation (TTA) yielded a robust mean accuracy of 88.31%, ensuring diagnostic stability during inference. By maintaining an accuracy-generalization gap of less than 0.04, this framework proves that high predictive performance can be achieved without sacrificing interpretability.
We study the problem of state representation learning for control from partial and potentially high-dimensional observations. We approach this problem via cost-driven state representation learning, in which we learn a dynamical model in a latent state space by predicting cumulative costs. In particular, we establish finite-sample guarantees on finding a near-optimal representation function and a near-optimal controller using the learned latent model for infinite-horizon time-invariant Linear Quadratic Gaussian (LQG) control. We study two approaches to cost-driven representation learning, which differ in whether the transition function of the latent state is learned explicitly or implicitly. The first approach has also been investigated in Part I of this work, for finite-horizon time-varying LQG control. The second approach closely resembles MuZero, a recent breakthrough in empirical reinforcement learning, in that it learns latent dynamics implicitly by predicting cumulative costs. A key technical contribution of this Part II is to prove persistency of excitation for a new stochastic process that arises from the analysis of quadratic regression in our approach, and may be of independent interest.
Automatic detection of Parkinson's disease (PD) from speech is a promising non-invasive diagnostic tool, but it raises significant privacy concerns. Speaker anonymization mitigates these risks, but it may suppress the pathological information necessary for PD detection. We assess the trade-off between privacy and PD detection for two anonymizers (STT-TTS and kNN-VC) using two Spanish datasets. STT-TTS provides better privacy but severely degrades PD detection by eradicating prosodic information. kNN-VC preserves macro-prosodic features such as duration and F0 contours, achieving F1 scores only 3-7\% lower than original baselines, demonstrating that privacy-preserving PD detection is viable when using appropriate anonymization. Finally, an acoustic distortion analysis characterizes specific weaknesses in kNN-VC, offering insights for designing anonymizers that better preserve PD information.
Computational engine sound modeling is central to the automotive audio industry, particularly for active sound design, virtual prototyping, and emerging data-driven engine sound synthesis methods. These applications require large volumes of standardized, clean audio recordings with precisely time-aligned operating-state annotations: data that is difficult to obtain due to high costs, specialized measurement equipment requirements, and inevitable noise contamination. We present an analysis-driven framework for generating engine audio with sample-accurate control annotations. The method extracts harmonic structures from real recordings through pitch-adaptive spectral analysis, which then drive an extended parametric harmonic-plus-noise synthesizer. With this framework, we generate the Procedural Engine Sounds Dataset (19 hours, 5,935 files), a set of engine audio signals with sample-accurate RPM and torque annotations, spanning a wide range of operating conditions, signal complexities, and harmonic profiles. Comparison against real recordings validates that the synthesized data preserves characteristic harmonic structures, and baseline experiments confirm its suitability for learning-based parameter estimation and synthesis tasks. The dataset is released publicly to support research on engine timbre analysis, control parameter estimation, acoustic modeling and neural generative networks.
This paper introduces a novel class of model-driven evolutionary frameworks for near-field multi-source localization, addressing the major limitations of grid-based subspace methods such as MUSIC and data-dependent deep learning approaches. To this end, we develop two complementary evolutionary localization frameworks that operate directly on the continuous spherical-wave signal model and support arbitrary array geometries without requiring labeled data, discretized angle--range grids, or architectural constraints. The first framework, termed NEar-field MultimOdal DE (NEMO-DE) associates each individual in the evolutionary population to a single source and optimizes a residual least-squares objective in a sequential manner, updating the data residual and enforcing spatial separation to estimate multiple source locations. To overcome the limitation of NEMO-DE under large power imbalances among the sources, we propose the second framework, named NEar-field Eigen-subspace Fitting DE (NEEF-DE), which jointly encodes all source locations and minimizes a subspace-fitting criterion that aligns a model-based array response subspace with the received signal subspace. Although the proposed frameworks are algorithm-agnostic and compatible with various evolutionary optimizers, differential evolution (DE) is adopted in this work as a representative search strategy due to its simplicity, robustness, and strong empirical performance. We provide extensive numerical experiments to evaluate the performance of the proposed frameworks under different system configurations. This work establishes evolutionary computation as a powerful and flexible paradigm for model-based near-field localization, paving the way for future innovations in this domain.
Brand advertising plays a critical role in building long-term consumer awareness and loyalty, making it a key objective for advertisers across digital platforms. Although real-time bidding has been extensively studied, there is limited literature on algorithms specifically tailored for brand auction ads that fully leverage their unique characteristics. In this paper, we propose a lightweight Model Predictive Control (MPC) framework designed for brand advertising campaigns, exploiting the inherent attributes of brand ads -- such as stable user engagement patterns and fast feedback loops -- to simplify modeling and improve efficiency. Our approach utilizes online isotonic regression to construct monotonic bid-to-spend and bid-to-conversion models directly from streaming data, eliminating the need for complex machine learning models. The algorithm operates fully online with low computational overhead, making it highly practical for real-world deployment. Simulation results demonstrate that our approach significantly improves spend efficiency and cost control compared to baseline strategies, providing a scalable and easily implementable solution for modern brand advertising platforms.
Non-orthogonal multiple access (NOMA) systems allowing multiple users sharing the same resource block offer significant gains in spectral efficiency which can enable the required massive access in future wireless systems. However, they face several challenges due to their sensitivity to power allocation coefficients, fading effects, and imperfect channel state information (CSI). To address these limitations, this paper proposes Hadamard-NOMA, an approach leveraging the Hadamard Transform (HT) at the source level prior to modulation. By introducing HT, the system mitigates the adverse impact of fading and CSI imperfections, reducing bit error rates (BER) and enhancing overall system reliability. Theoretical analysis and Monte Carlo simulations validate the effectiveness of this technique, demonstrating robust NOMA transmission in dynamic wireless environments. The proposed method offers a promising solution for next-generation wireless networks, ensuring more reliable performance under diverse transmission conditions. Simulation results confirm analytical predictions, demonstrating significant performance improvements over state-of-the-art T-NOMA and Usman-NOMA schemes. Specifically, for the Near user, a gain of 15 dB is achieved at a Bit Error Rate (BER) of $10^{-2}$, while the Far user benefits from a 10 dB gain at a BER of $10^{-1}$. Compared to Usman-NOMA, the proposed method provides an improvement of 15 dB for the Far user at BER $10^{-1}$. Additionally, in a two-user scenario with imperfect Successive Interference Cancelation (SIC), user 1 requires an SNR at least 14 dB lower than user 2 to achieve a BER of $10^{-3}$. These findings highlight the effectiveness of applying HT at the source stage, significantly mitigating CSI errors and making NOMA more resilient for next-generation wireless networks.
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0$ \times $ latency reduction with a cache of only ${\sim}$1K entries while preserving or improving perceptual quality.
Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.
This work introduces the Drag-Aware Aerodynamic Manipulability (DAAM), a geometric framework for control allocation in redundant multirotors. By equipping the propeller spin-rate space with a Riemannian metric based on the remaining symmetric acceleration capacity of each motor, the formulation explicitly accounts for motor torque limits and aerodynamic drag. Mapping this metric through the nonlinear thrust law to the generalized force space yields a state-dependent manipulability volume. The log-determinant of this volume acts as a natural barrier function, strictly penalizing drag-induced saturation and low-spin thrust loss. Optimizing this volume along the allocation fibers provides a redundancy resolution strategy inherently invariant to arbitrary coordinate scaling in the generalized-force space. Analytically, we prove that the resulting optimal allocations locally form smooth embedded manifolds, and we geometrically characterize the global jump discontinuities that inevitably arise from physical actuator limits and spin-rate sign transitions.
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
The autocovariance least squares (ALS) method is a computationally efficient approach for estimating noise covariances in Kalman filters without requiring specific noise models. However, conventional ALS and its variants rely on the classic least mean squares (LMS) criterion, making them highly sensitive to measurement outliers and prone to severe performance degradation. To overcome this limitation, this paper proposes a novel outlier-robust ALS algorithm, termed ALS-IRLS, based on the iteratively reweighted least squares (IRLS) framework. Specifically, the proposed approach introduces a two-tier robustification strategy. First, an innovation-level adaptive thresholding mechanism is employed to filter out heavily contaminated data. Second, the outlier-contaminated autocovariance is formulated using an $\epsilon$-contamination model, where the standard LMS criterion is replaced by the Huber cost function. The IRLS method is then utilized to iteratively adjust data weights based on estimation deviations, effectively mitigating the influence of residual outliers. Comparative simulations demonstrate that ALS-IRLS reduces the root-mean-square error (RMSE) of noise covariance estimates by over two orders of magnitude compared to standard ALS. Furthermore, it significantly enhances downstream state estimation accuracy, outperforming existing outlier-robust Kalman filters and achieving performance nearly equivalent to the ideal Oracle lower bound in the presence of noisy and anomalous data.
Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies.
Real-world optimization problems are often constrained by complex physical laws that limit computational scalability. These constraints are inherently tied to complex regions, and thus learning models that incorporate physical and geometric knowledge, i.e., physics-informed machine learning (PIML), offer a promising pathway for efficient solution. Here, we introduce PolyFormer, which opens a new direction for PIML in prescriptive optimization tasks, where physical and geometric knowledge is not merely used to regularize learning models, but to simplify the problems themselves. PolyFormer captures geometric structures behind constraints and transforms them into efficient polytopic reformulations, thereby decoupling problem complexity from solution difficulty and enabling off-the-shelf optimization solvers to efficiently produce feasible solutions with acceptable optimality loss. Through evaluations across three important problems (large-scale resource aggregation, network-constrained optimization, and optimization under uncertainty), PolyFormer achieves computational speedups up to 6,400-fold and memory reductions up to 99.87%, while maintaining solution quality competitive with or superior to state-of-the-art methods. These results demonstrate that PolyFormer provides an efficient and reliable solution for scalable constrained optimization, expanding the scope of PIML to prescriptive tasks in scientific discovery and engineering applications.
Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.
This paper presents IronEngine, a general AI assistant platform organized around a unified orchestration core that connects a desktop user interface, REST and WebSocket APIs, Python clients, local and cloud model backends, persistent memory, task scheduling, reusable skills, 24-category tool execution, MCP-compatible extensibility, and hardware-facing integration. IronEngine introduces a three-phase pipeline -- Discussion (Planner--Reviewer collaboration), Model Switch (VRAM-aware transition), and Execution (tool-augmented action loop) -- that separates planning quality from execution capability. The system features a hierarchical memory architecture with multi-level consolidation, a vectorized skill repository backed by ChromaDB, an adaptive model management layer supporting 92 model profiles with VRAM-aware context budgeting, and an intelligent tool routing system with 130+ alias normalization and automatic error correction. We present experimental results on file operation benchmarks achieving 100\% task completion with a mean total time of 1541 seconds across four heterogeneous tasks, and provide detailed comparisons with representative AI assistant systems including ChatGPT, Claude Desktop, Cursor, Windsurf, and open-source agent frameworks. Without disclosing proprietary prompts or core algorithms, this paper analyzes the platform's architectural decomposition, subsystem design, experimental performance, safety boundaries, and comparative engineering advantages. The resulting study positions IronEngine as a system-oriented foundation for general-purpose personal assistants, automation frameworks, and future human-centered agent platforms.
Robust single-vessel tracking from fixed coastal platforms is hindered by modality-specific degradations: cameras suffer from illumination and visual clutter, while LiDAR performance drops with range and intermittent returns. We present a heterogeneous multi-sensor fusion particle-filter tracker that incorporates an information-gain (entropy-reduction) adaptive sensing policy to select the most informative configuration at each fusion time bin. The approach is validated in a real maritime deployment at the CMMI Smart Marina Testbed (Ayia Napa Marina, Cyprus), using a shore-mounted 3D LiDAR and an elevated fixed camera to track a rigid inflatable boat with onboard GNSS ground truth. We compare LiDAR-only, camera-only, all-sensors, and adaptive configurations. Results show LiDAR dominates near-field accuracy, the camera sustains longer-range coverage when LiDAR becomes unavailable, and the adaptive policy achieves a favorable accuracy-continuity trade-off by switching modalities based on information gain. By avoiding continuous multi-stream processing, the adaptive configuration provides a practical baseline for resilient and resource-aware maritime surveillance.
Omnidirectional images are increasingly used in robotics and vision due to their wide field of view. However, extending 3D Gaussian Splatting (3DGS) to panoramic camera models remains challenging, as existing formulations are designed for perspective projections and naive adaptations often introduce distortion and geometric inconsistencies. We present Spherical-GOF, an omnidirectional Gaussian rendering framework built upon Gaussian Opacity Fields (GOF). Unlike projection-based rasterization, Spherical-GOF performs GOF ray sampling directly on the unit sphere in spherical ray space, enabling consistent ray-Gaussian interactions for panoramic rendering. To make the spherical ray casting efficient and robust, we derive a conservative spherical bounding rule for fast ray-Gaussian culling and introduce a spherical filtering scheme that adapts Gaussian footprints to distortion-varying panoramic pixel sampling. Extensive experiments on standard panoramic benchmarks (OmniBlender and OmniPhotos) demonstrate competitive photometric quality and substantially improved geometric consistency. Compared with the strongest baseline, Spherical-GOF reduces depth reprojection error by 57% and improves cycle inlier ratio by 21%. Qualitative results show cleaner depth and more coherent normal maps, with strong robustness to global panorama rotations. We further validate generalization on OmniRob, a real-world robotic omnidirectional dataset introduced in this work, featuring UAV and quadruped platforms. The source code and the OmniRob dataset will be released at this https URL.
Understanding dynamic 3D environments in a spatially continuous and temporally consistent manner is fundamental for robotics and autonomous driving. While recent advances in occupancy prediction provide a unified representation of scene geometry and semantics, progress in 4D panoptic occupancy tracking remains limited by the lack of benchmarks that support surround-view fisheye sensing, long temporal sequences, and instance-level voxel tracking. To address this gap, we present OccTrack360, a new benchmark for 4D panoptic occupancy tracking from surround-view fisheye cameras. OccTrack360 provides substantially longer and more diverse sequences (174~2234 frames) than prior benchmarks, together with principled voxel visibility annotations, including an all-direction occlusion mask and an MEI-based fisheye field-of-view mask. To establish a strong fisheye-oriented baseline, we further propose Focus on Sphere Occ (FoSOcc), a framework that addresses two core challenges in fisheye occupancy tracking: distorted spherical projection and inaccurate voxel-space localization. FoSOcc includes a Center Focusing Module (CFM) to enhance instance-aware spatial localization through supervised focus guidance, and a Spherical Lift Module (SLM) that extends perspective lifting to fisheye imaging under the Unified Projection Model. Extensive experiments on Occ3D-Waymo and OccTrack360 show that our method improves occupancy tracking quality with notable gains on geometrically regular categories, and establishes a strong baseline for future research on surround-view fisheye 4D occupancy tracking. The benchmark and source code will be made publicly available at this https URL.
Stability of economic model predictive control can be proven under the assumption that a strict dissipativity condition holds. This assumption has a clear interpretation in terms of the so-called rotated stage cost, which must have its minimum at the optimal steady state. However, contrary to dissipativity, for strict dissipativity the storage function cannot be immediately related to the value function of an optimal control problem formulated with the economic stage cost. We propose the novel concept of two-storage strict dissipativity, which requires two storage functions to satisfy dissipativity and be separated by a positive definite function. This new condition can be immediately related to optimal control by means of value functions and might be easier to verify than strict dissipativity. Furthermore, we prove that two-storage strict dissipativity is sufficient and necessary for asymptotic stability, it is related to strict dissipativity, and also to alternative approaches relying on the so-called cost-to-travel. Finally, we discuss commonly used and new terminal cost designs that guarantee asymptotic stability in the finite-horizon case.
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
A single-particle cryo-electron microscopy (cryo-EM) measurement, called a micrograph, consists of multiple two-dimensional tomographic projections of a three-dimensional (3-D) molecular structure at unknown locations, taken under unknown viewing directions. All existing cryo-EM algorithmic pipelines first locate and extract the projection images, and then reconstruct the structure from the extracted images. However, if the molecular structure is small, the signal-to-noise ratio (SNR) of the data is very low, making it challenging to accurately detect projection images within the micrograph. Consequently, all standard techniques fail in low-SNR regimes. To recover molecular structures from measurements of low SNR, and in particular small molecular structures, we devise an approximate expectation-maximization algorithm to estimate the 3-D structure directly from the micrograph, bypassing the need to locate the projection images. We corroborate our computational scheme with numerical experiments and present successful structure recoveries from simulated noisy measurements.
This paper presents a novel technique to build squint-free massive phased arrays. This is accomplished by explicitly implementing a spatial IDFT to cancel out the DFT imposed by the array nature which causes beam squint. In addition, the paper analyzes the beam-squint issue, which arises from two mechanisms: the coherent bandwidth limitations and the systematic delay spread in the array. These mechanisms reduce the signal-to-noise ratio and cause inter-symbol interference. This work also highlights the importance of utilizing OFDM modulation to enhance signal quality by mitigating the self-interference issue. A numerical solver is used to simulate and verify the IDFT squint-free implementation and to estimate the signal quality limitations in massive arrays.
The paper studies the robustness properties of discrete-time stochastic optimal control under Wasserstein model approximation for both discounted-cost and average-cost criteria. Specifically, we study the performance loss when applying an optimal policy designed for an approximate model to the true dynamics compared with the optimal cost for the true model under the sup-norm-induced metric, and relate it to the Wasserstein-1 distance between the approximate and true transition kernels. A primary motivation of this analysis is empirical model learning, as well as empirical noise distribution learning, where Wasserstein convergence holds under mild conditions but stronger convergence criteria, such as total variation, may not. We discuss applications of the results to the disturbance estimation problem, where sample complexity bounds are given, and also to a general empirical model learning approach, obtained under either Markov or i.i.d. learning settings.
Power system resilience is vital to modern society, as outages caused by extreme weather can severely disrupt communities. Existing statistical and simulation-based methods for resilience quantification are either retrospective or rely on simplified physical models, limiting their applicability. This paper proposes a deep learning-based framework that integrates historical outage and weather data to predict event-level resilience, measured using the resilience trapezoid method. The trained model is then applied to a benchmark weather dataset to estimate regional resilience, with optional socioeconomic and demographic factors incorporated as weighting terms when policymakers wish to emphasize the needs of specific population groups. The effectiveness of the framework is first validated on simulated outage records, showing strong agreement between predicted and simulated resilience values. It is then applied to real historical outage data to assess the resilience of actual power systems. Beyond evaluation, the results can guide targeted investments in distributed energy resources to improve resilience in vulnerable regions.
Diffusion models have indeed shown great promise in solving inverse problems in image processing. In this paper, we propose a novel, problem-agnostic diffusion model called the maximum a posteriori (MAP)-based guided term estimation method for inverse problems. To leverage unconditionally pretrained diffusion models to address conditional generation tasks, we divide the conditional score function into two terms according to Bayes' rule: an unconditional score function (approximated by a pretrained score network) and a guided term, which is estimated using a novel MAP-based method that incorporates a Gaussian-type prior of natural images. This innovation allows us to better capture the intrinsic properties of the data, leading to improved performance. Numerical results demonstrate that our method preserves contents more effectively compared to state-of-the-art methods--for example, maintaining the structure of glasses in super-resolution tasks and producing more coherent results in the neighborhood of masked regions during inpainting.
Alzheimer's disease (AD) is a major neurodegenerative condition that affects millions around the world. As one of the main biomarkers in the AD diagnosis procedure, brain amyloid positivity is typically identified by positron emission tomography (PET), which is costly and invasive. Brain structural magnetic resonance imaging (sMRI) may provide a safer and more convenient solution for the AD diagnosis. Recent advances in geometric deep learning have facilitated sMRI analysis and early diagnosis of AD. However, determining AD pathology, such as brain amyloid deposition, in preclinical stage remains challenging, as less significant morphological changes can be observed. As a result, few AD classification models are generalizable to the brain amyloid positivity classification task. Blood-based biomarkers (BBBMs), on the other hand, have recently achieved remarkable success in predicting brain amyloid positivity and identifying individuals with high risk of being brain amyloid positive. However, individuals in medium risk group still require gold standard tests such as Amyloid PET for further evaluation. Inspired by the recent success of transformer architectures, we propose a geometric deep learning model based on transformer that is both scalable and robust to variations in input volumetric mesh size. Our work introduced a novel tokenization scheme for tetrahedral meshes, incorporating anatomical landmarks generated by a pre-trained Gaussian process model. Our model achieved superior classification performance in AD classification task. In addition, we showed that the model was also generalizable to the brain amyloid positivity prediction with individuals in the medium risk class, where BM alone cannot achieve a clear classification. Our work may enrich geometric deep learning research and improve AD diagnosis accuracy without using expensive and invasive PET scans.
This paper proposes a coordinated routing approach that investigates the use of connected and automated vehicles (CAVs) in dedicated bus lanes. The aim is to improve bus schedule adherence while enhancing the travel efficiency of CAVs during the transitional phase of mixed traffic environments. Our approach utilizes real-time traffic data to dynamically reroute CAVs in anticipation of congestion. By continuously monitoring traffic conditions on dedicated lanes and tracking the real-time positions of buses, the system adjusts CAV routes in advance to avoid potential interference with operating buses. This cooperation reduces CAV travel times and minimizes delays that impact transit services. The proposed strategy is validated using microscopic traffic simulations in SUMO. The results demonstrate significant improvements in both transit on-time performance and CAV travel efficiency across a range of traffic conditions.
Cone-beam computed tomography (CBCT) is a critical 3D imaging technology in the medical field, while the high radiation exposure required for high-quality imaging raises significant concerns, particularly for vulnerable populations. Sparse-view reconstruction reduces radiation by using fewer X-ray projections while maintaining image quality, yet existing methods face challenges such as high computational demands and poor generalizability to different datasets. To overcome these limitations, we propose DeepSparse, the first foundation model for sparse-view CBCT reconstruction, featuring DiCE (Dual-Dimensional Cross-Scale Embedding), a novel network that integrates multi-view 2D features and multi-scale 3D features. Additionally, we introduce the HyViP (Hybrid View Sampling Pretraining) framework, which pretrains the model on large datasets with both sparse-view and dense-view projections, and a two-step finetuning strategy to adapt and refine the model for new datasets. Extensive experiments and ablation studies demonstrate that our proposed DeepSparse achieves superior reconstruction quality compared to state-of-the-art methods, paving the way for safer and more efficient CBCT imaging.
This study further explores reformulating power flow (PF) analysis as a discrete combinatorial optimization problem, proposed in our earlier study using the Adiabatic Quantum Power Flow (AQPF) algorithm, which can be executed on Ising machines, including quantum and quantum-inspired hardware. This approach provides a new representation of the underlying equations, analogous to how neural networks approximate complex functions using simple operations. While the resulting combinatorial optimization problem is NP-hard, it is compatible with emerging quantum hardware designed to address such complexity. We introduce the Adiabatic Quantum Optimal Power Flow (AQOPF) algorithm, which transforms the classical optimal power flow (OPF) equations into quadratic unconstrained binary optimization (QUBO) models. Furthermore, the AQPF and AQOPF algorithms are evaluated on standard test cases ranging from 4- to 1354-bus systems using D-Wave's Advantage\texttrademark\ system (QA), its hybrid quantum-classical solver (HA), and Fujitsu's third-generation Digital Annealer (DAv3) and Quantum-Inspired Integrated Optimization (QIIO) platform. Both full and partitioned formulations are investigated, with particular attention to scalability and robustness in ill-conditioned scenarios. The results demonstrate that the algorithms can reproduce feasible PF and OPF solutions and exhibit promising computational scalability when supported by scalable hardware.
Extremely Large-Scale (XL) multiple input multiple output (MIMO) antenna systems combined with ultra-wide signal bandwidth (BW) offer the potential for ultra-high-resolution sensing in frequency modulated continuous wave (FMCW) radars. However, the use of ultra-wide BW results in significant spatial delays across the array aperture, comparable to the range resolution, leading to the spatial wideband effect (SWE). SWE introduces coupling between the range and angle domains, rendering conventional narrowband signal processing techniques ineffective for target signature estimation. In this paper, we propose an efficient super-resolution signature estimation technique for XL-MIMO FMCW radars operating under SWE, leveraging compressive sensing (CS) methods. The proposed 2D CS-based approach offers low computational complexity, making it highly suitable for real-time applications in large-scale radar systems. Numerical simulation results validate the superior performance of the proposed method compared to existing wideband and narrowband estimation techniques.
The overexpression of the human epidermal growth factor receptor 2 (HER2) in breast cells is a key driver of HER2-positive breast cancer, a highly aggressive subtype requiring precise diagnosis and targeted therapy. Immunohistochemistry (IHC) is the standard technique for HER2 assessment but is costly, labor-intensive, and highly dependent on antibody selection. In contrast, hematoxylin and eosin (H&E) staining, a routine histopathological procedure, offers broader accessibility but lacks HER2 specificity. This study proposes an advanced deep learning-based image translation framework to generate high-fidelity IHC images from H&E-stained tissue samples, enabling cost-effective and scalable HER2 assessment. By modifying the loss function of pyramid pix2pix, we mitigate mode collapse, a fundamental limitation in generative adversarial networks (GANs), and introduce a novel variance-based penalty that enforces structural diversity in generated images. Our model particularly excels in translating HER2-positive (IHC 3+) images, which have remained challenging for existing methods. Quantitative evaluations on the overall BCI dataset reveal that our approach outperforms baseline models, achieving a peak signal-to-noise ratio (PSNR) of 22.16, a structural similarity index (SSIM) of 0.47, and a Fréchet Inception Distance (FID) of 346.37. In comparison, the pyramid pix2pix baseline attained PSNR 21.15, SSIM 0.43, and FID 516.75, while the standard pix2pix model yielded PSNR 20.74, SSIM 0.44, and FID 472.6. These results affirm the superior fidelity and realism of our generated IHC images. Beyond medical imaging, our model exhibits superior performance in general image-to-image translation tasks, showcasing its potential across multiple domains. This work marks a significant step toward AI-driven precision oncology, offering a reliable and efficient alternative to traditional HER2 diagnostics.
This paper presents a unified framework based on Davis-Wielandt (DW) shell for graphical stability analysis of multi-input and multi-output linear time-invariant feedback systems. Connections between DW shells and various graphical representations, as well as gain and phase measures, are established through an intuitive geometric perspective. Within this framework, we map the relationships and relative conservatism among various separation conditions. A rotated scaled relative graph ($\theta$-SRG) concept is proposed as a mixed gain-phase representation, from which a closed-loop stability criterion is derived and shown to be the least conservative among the existing 2-D graphical conditions for bi-component feedback loops. We also propose a reliable and generalizable algorithm for visualizing the $\theta$-SRGs and include a system example to demonstrate the reduced conservatism of the proposed condition.
We consider a novel and general approach to easily compute the Cramér-Rao Lower Bounds (CRLBs) of rigid body localization (RBL) problem using arbitrary types of information. To that end, we adopt an information-centric construction of the Fisher information matrix (FIM), which allows capturing the contribution of each measurement towards the FIM explicitly, both in terms of input measurement types, as well as of their error distributions. Taking advantage of this approach, we derive a generic framework for evaluating the CRLB, which is applicable to arbitrary rigid body localization scenarios, and which, unlike the formulation of FIM commonly used in point-target localization, is better suited to RBL problems as it explicitly allows capturing the precision in both the translation vector and the rotation matrix (or alternative the rotation angles) of the rigid body, with respect to a reference. Examples of CRLBs obtained via the proposed approach are given in closed form, including the bound incorporating an orthonormality constraint onto the rotation matrix, which enables a straightforward adjustment of the derived bound when new measurements are added or removed. Numerical results illustrate that the derived expression correctly lower-bounds the errors of estimated localization parameters obtained via various related state-of-the-art (SotA) estimators, revealing their accuracies and suggesting that SotA RBL algorithms can still be improved.
Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we employ the TransUNet architecture, a hybrid framework that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net structure. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution. We trained the model on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset using a robust augmentation pipeline and a hybrid loss function to mitigate class imbalance. On the internal validation set, the model achieved a Dice Similarity Coefficient (F1-score) of 0.8886 using an optimized threshold of 0.4843. Crucially, to assess generalizability, we performed external validation on two independent datasets: the AZH Wound Care Center dataset (n=278) and the Medetec dataset (n=152). Without any retraining, the model achieved Dice scores of 0.6209 and 0.7850, respectively, demonstrating robust zero-shot transferability to unseen clinical domains. Furthermore, clinical utility analysis revealed a strong correlation (Pearson r = 0.9749) between predicted and ground-truth wound areas. These outcomes demonstrate that our approach effectively integrates global and local feature extraction, offering a reliable, effective, and explainable solution for automated foot ulcer assessment.
Learning physics-constrained inverse operators-rather than post-processing physics-based reconstructions-is a broadly applicable strategy for problems with expensive forward models. We demonstrate this principle in three-dimensional photoacoustic computed tomography (3D PACT), where current systems demand dense transducer arrays and prolonged scans, restricting clinical translation. We introduce PANO (PACT imaging neural operator), an end-to-end physics-aware neural operator-a deep learning architecture that generalizes across input sampling densities without retraining-that directly learns the inverse mapping from raw sensor measurements to a 3D volumetric image. Unlike two-step methods that reconstruct then denoise, PANO performs direct inversion in a single pass, jointly embedding physics and data priors. It employs spherical discrete-continuous convolutions to respect hemispherical sensor geometry and Helmholtz equation constraints to ensure physical consistency. PANO reconstructs high-quality images from both simulated and real data across diverse sparse acquisition settings, achieves real-time inference and outperforms the widely-used UBP algorithm by approximately 33 percentage points in cosine similarity on simulated data and 14 percentage points on real phantom data. These results establish a pathway toward more accessible 3D PACT systems for preclinical research, and motivate future in-vivo validation for clinical translation.
Ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety. However, current AI research often treats disease prediction and tissue segmentation as two separate tasks and their model requires substantial computational overhead. In such a situation, we introduce UltraUPConvNet, a computationally efficient universal framework designed for both ultrasound image classification and segmentation. Trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions, our model achieves state-of-the-art performance on certain datasets with lower computational overhead. Our model weights and codes are available at this https URL
Infrastructure-mounted sensors can capture rich environmental information to enhance communications and facilitate beamforming in millimeter-wave systems. This work presents an efficient sensing-assisted long-term beam tracking framework that selects optimal beams from a codebook for current and multiple future time slots. We first design a large attention-enhanced neural network (NN) to fully exploit past visual observations for beam tracking. A convolutional NN extracts compact image features, while gated recurrent units with attention capture the temporal dependencies within sequences. The large NN then acts as the teacher to guide the training of a lightweight student NN via knowledge distillation. The student requires shorter input sequences yet preserves long-term beam prediction ability. Numerical results demonstrate that the teacher achieves Top-5 accuracies exceeding 93% for current and six future time slots, approaching state-of-the-art performance with a 90% reduction of model parameters. The student closely matches the teacher's performance while reducing the number of model parameters by over 1670% and cutting complexity by over 450%, despite operating with 60% shorter input sequences. This improvement significantly enhances data efficiency, reduces latency, and reduces power consumption in sensing and processing.
This paper proposes DroFiT (Drone Frequency lightweight Transformer for speech enhancement, a single microphone speech enhancement network for severe drone self-noise. DroFit integrates a frequency-wise Transformer with a full/sub-band hybrid encoder-decoder and a TCN back-end for memory-efficient streaming. A learnable skip-and-gate fusion with a combined spectral-temporal loss further refines reconstruction. The model is trained on VoiceBank-DEMAND mixed with recorded drone noise (-5 to -25 dB SNR) and evaluate using standard speech enhancement metrics and computational efficiency. Experimental results show that DroFiT achieves competitive enhancement performance while significantly reducing computational and memory demands, paving the way for real-time processing on resource-constrained UAV platforms. Audio demo samples are available on our demo page.
Highly accurate indoor localization systems with absolute mm positioning accuracy are currently expensive. They include laser trackers, total stations, and motion capture systems relying on multiple high-end cameras. In this work, we introduce a high-accuracy, planar indoor localization system named GroundGazer (GG) for autonomous mobile robots (AMRs). GG estimates the AMR's planar position with mm and its heading with sub-degree accuracy. The system requires only a monocular (fisheye) camera, a chessboard floor, and an optional laser diode. Our system is simple and low-cost due to the chessboard floor, robust, scalable to multiple robots, and extendable to 3D position and orientation estimation.
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance.
Large-scale self-supervised Pre-Trained Models (PTMs) have shown significant improvements in the speaker verification (SV) task by providing rich feature representations. In this paper, we utilize w2v-BERT 2.0, a model with approximately 600 million parameters trained on 4.5 million hours of unlabeled data across 143 languages, for the SV task. The MFA structure with Layer Adapter is employed to process the multi-layer feature outputs from the PTM and extract speaker embeddings. Additionally, we incorporate LoRA for efficient fine-tuning. Our model achieves state-of-the-art results with 0.12% and 0.55% EER on the Vox1-O and Vox1-H test sets, respectively. Furthermore, we apply knowledge distillation guided structured pruning, reducing the model size by 80% while achieving only a 0.04% EER degradation. Source code and models are released at this https URL.
This paper proposes a Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) framework, which integrates direction-of-arrival (DOA) cues estimated by SRP-DNN into the S2SND backbone. A two-stage training strategy is adopted: the model is first trained with single-channel audio and DOA features, and then further optimized with multi-channel inputs under DOA guidance. In addition, a simulated DOA generation scheme is introduced to alleviate dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperform the S2SND baseline, achieving a 7.4% relative DER reduction in the offline mode and over 19% improvement when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding good performance in both online and offline settings.
Pinching-antenna systems have emerged as a novel and transformative flexible-antenna architecture for next-generation wireless networks. They offer unprecedented flexibility and spatial reconfigurability by enabling dynamic positioning and activation of radiating elements along a signal-guiding medium (e.g., dielectric waveguides), which is not possible with conventional fixed antenna systems. In this paper, we introduce the concept of generalized pinching antenna systems, which retain the core principle of creating localized radiation points on demand, but can be physically realized in a variety of settings. These include implementations based on dielectric waveguides, leaky coaxial cables, surface-wave guiding structures, and other types of media, employing different feeding methods and activation mechanisms (e.g., mechanical, electronic, or hybrid). Despite differences in their physical realizations, they all share the same inherent ability to form, reposition, or deactivate radiation sites as needed, enabling user-centric and dynamic coverage. We first describe the underlying physical mechanisms of representative generalized pinching-antenna realizations and their associated wireless channel models, highlighting their unique propagation and reconfigurability characteristics compared with conventional antennas. Then, we review several representative pinching-antenna system architectures, ranging from single- to multiple-waveguide configurations, and discuss advanced design strategies tailored to these flexible deployments. Furthermore, we examine their integration with emerging wireless technologies to enable synergistic, user-centric solutions. Finally, we identify key open research challenges and outline future directions, charting a pathway toward the practical deployment of generalized pinching antennas in next-generation wireless networks.
From organisms to machines, autonomous systems rely on measured sensory cues to estimate unknown information about themselves or their environment. For nonlinear systems, strategic sensor motion can be leveraged to extract otherwise inaccessible information. This principle, known as active sensing, is widespread in biology yet difficult to study, and remains underutilized in engineered systems due to the challenge of systematically designing active sensing motifs. Here, we introduce the method ``BOUNDS: Bounding Observability for Uncertain Nonlinear Dynamic Systems", and Python package pybounds, which can discover movement motifs that increase the information encoded in sensory cues. To exploit sporadic estimates from bouts of active sensing, we further introduce the Augmented Information Kalman Filter (AI-KF). The AI-KF uses insight from BOUNDS to dynamically fuse neural network and model-based estimation. We demonstrate BOUNDS and the AI-KF on a flying agent model and experimental GPS-denied data from a quadcopter, revealing how specific active movements improve estimates of ground speed, altitude, and wind direction. Altogether, our work will prove useful for designing sensor-minimal autonomous systems and investigating active sensing in living organisms.
Spectral super-resolution (SSR) aims to reconstruct hyperspectral images (HSIs) from multispectral observations, with broad applications in computer vision and remote sensing. Deep learning-based methods have been widely used, but they often treat spectra as discrete vectors learned from data, rather than continuous curves constrained by physics principles, leading to unrealistic predictions and limited applicability. To address this challenge, we propose the Radiative-Structured Neural Operator (RSNO), which learns a continuous mapping for spectral super-resolution while enforcing physical consistency under the radiative prior. The proposed RSNO consists of three stages: upsampling, reconstruction, and refinement. In the upsampling stage, we leverage prior information to expand the input multispectral image, producing a physically plausible hyperspectral estimate. Subsequently, we adopt a neural operator backbone in the reconstruction stage to learn a continuous mapping across the spectral domain. Finally, the refinement stage imposes a hard constraint on the output HSI to eliminate color distortion. The upsampling and refinement stages are implemented via the proposed angular-consistent projection (ACP), which is derived from a non-convex optimization problem. Moreover, we theoretically demonstrated the optimality of ACP by null-space decomposition. Various experiments validate the effectiveness of the proposed approach across conventional spectral super-resolution, continuous spectral reconstruction, and infrared extrapolation.
This paper presents a framework for target detection and downlink data transmission in a repeater-assisted bi-static integrated sensing and communication system. A repeater is an active scatterer that retransmits incoming signals with a complex gain almost instantaneously, thereby enhancing sensing performance by amplifying the echoes reflected by the targets. The same mechanism can also improve downlink communication by mitigating coverage holes. However, the repeater introduces noise and increases interference at the sensing receiver, while also amplifying the interference from target detection signals at the downlink users. The proposed framework accounts for these sensing-communication trade-offs and demonstrates the potential benefits achievable through a carefully designed precoder at the transmitting base station. In particular, our finding is that a higher value of probability of detection can be attained with considerably lower target radar-cross-section variance by deploying repeaters in the target hot-spot areas.
Roll-to-roll manufacturing requires precise tension and velocity control to ensure product quality, yet controller commissioning and adaptation remain time-intensive processes dependent on expert knowledge. This paper presents an LLM-assisted multi-agent framework that automates control system design and adaptation for R2R systems while maintaining safety. The framework operates through five phases: system identification from operational data, automated controller selection and tuning, sim-to-real adaptation with safety verification, continuous monitoring with diagnostic capabilities, and periodic model refinement. Experimental validation on a R2R system demonstrates successful tension regulation and velocity tracking under significant model uncertainty, with the framework achieving performance convergence through iterative adaptation. The approach reduces manual tuning effort while providing transparent diagnostic information for maintenance planning, offering a practical pathway for integrating AI-assisted automation in manufacturing control systems.
Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio's unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage of lightweight GAN fine-tuning enables us to obtain few-step (e.g., 1/2/4 steps) generators that produce high-quality audio. In addition, we develop a multi-branch network architecture that processes Fourier coefficients at different time-frequency resolutions, which improves the modeling capabilities compared to prior single-resolution designs. Experimental results indicate that our Flow2GAN delivers high-fidelity audio generation from Mel-spectrograms or discrete audio tokens, achieving highly favorable quality-efficiency trade-offs compared to existing state-of-the-art GAN-based and Flow Matching-based methods. Online demo samples are available at this https URL, and the source code is released at this https URL.
NashOpt is an open-source Python library for computing and designing generalized Nash equilibria (GNEs) in noncooperative games with shared constraints and real-valued decision variables. The library exploits the joint Karush-Kuhn-Tucker (KKT) conditions of all players to handle both general nonlinear GNEs and linear-quadratic games, including their variational versions. Nonlinear games are solved via nonlinear least-squares formulations, relying on JAX for automatic differentiation. Linear-quadratic GNEs are reformulated as mixed-integer linear programs, enabling efficient computation of multiple equilibria. The framework also supports inverse-game and Stackelberg game-design problems. The capabilities of NashOpt are demonstrated through several examples, including noncooperative game-theoretic control problems of linear quadratic regulation and model predictive control. The library is available at this https URL
Effective data center cooling is crucial for reliable operation; however, cooling systems often exhibit inefficiencies that result in excessive energy consumption. This paper presents a three-stage, physics-guided machine learning framework for identifying and reducing cooling energy waste in high-performance computing facilities. Using one year of 10-minute resolution operational data from the Frontier exascale supercomputer, we first train a monotonicity-constrained gradient boosting surrogate that predicts facility accessory power from coolant flow rates, temperatures, and server power. The surrogate achieves a mean absolute error of 0.026 MW and predicts power usage effectiveness within 0.01 of measured values for 98.7% of test samples. In the second stage, the surrogate serves as a physics-consistent baseline to quantify excess cooling energy, revealing approximately 85 MWh of annual inefficiency concentrated in specific months, hours, and operating regimes. The third stage evaluates guardrail-constrained counterfactual adjustments to supply temperature and subloop flows, demonstrating that up to 96% of identified excess can be recovered through small, safe setpoint changes while respecting thermal limits and operational constraints. The framework yields interpretable recommendations, supports counterfactual analyses such as flow reduction during low-load periods and redistribution of thermal duty across cooling loops, and provides a practical pathway toward quantifiable reductions in accessory power. The developed framework is readily compatible with model predictive control and provides a template that, with site-specific recalibration, could be adapted to other liquid-cooled data centers with different configurations and cooling requirements.
Learning-enabled control systems must maintain safety when system dynamics and sensing conditions change abruptly. Although stochastic latent-state models enable uncertainty-aware control, most existing approaches rely on fixed internal representations and can degrade significantly under distributional shift. This letter proposes a \emph{cognitive-flexible control} framework in which latent belief representations adapt online, while the control law remains explicit and safety-certified. We introduce a Cognitive-Flexible Deep Stochastic State-Space Model (CF--DeepSSSM) that reorganizes latent representations subject to a bounded \emph{Cognitive Flexibility Index} (CFI), and embeds the adapted model within a Bayesian model predictive control (MPC) scheme. We establish guarantees on bounded posterior drift, recursive feasibility, and closed-loop stability. Simulation results under abrupt changes in system dynamics and observations demonstrate safe representation adaptation with rapid performance recovery, highlighting the benefits of learning-enabled, rather than learning-based, control for nonstationary cyber--physical systems.
High-resolution radar sensors are critical for autonomous systems but pose significant challenges to traditional tracking algorithms due to the generation of multiple measurements per object and the presence of multipath effects. Existing solutions often rely on the point target assumption or treat multipath measurements as clutter, whereas current extended target trackers often lack the capability to maintain trajectory continuity in complex multipath environments. To address these limitations, this paper proposes the multipath extended target generalized labeled multi-Bernoulli (MPET-GLMB) filter. A unified Bayesian framework based on labeled random finite set theory is derived to jointly model target existence, measurement partitioning, and the association between measurements, targets, and propagation paths. This formulation enables simultaneous trajectory estimation for both targets and reflectors without requiring heuristic post-processing. To enhance computational efficiency, a joint prediction and update implementation based on Gibbs sampling is developed. Furthermore, a measurement-driven adaptive birth model is introduced to initialize tracks without prior knowledge of target positions. Experimental results from simulated scenarios and real-world automotive radar data demonstrate that the proposed filter outperforms state-of-the-art methods, achieving superior state estimation accuracy and robust trajectory maintenance in dynamic multipath environments.
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.
This paper develops a new control framework for linear parameter-varying (LPV) systems with time-varying state delays by integrating parameter-dependent Lyapunov functions with integral quadratic constraints (IQCs). A novel delay-dependent state-feedback controller structure is proposed, consisting of a linear state-feedback law augmented with an additional term that captures the delay-dependent dynamics of the plant. Closed-loop stability and $\mathcal{L}_2$-gain performance are analyzed using dynamic IQCs and parameter-dependent quadratic Lyapunov functions, leading to convex synthesis conditions that guarantee performance in terms of parameter-dependent linear matrix inequalities (LMIs). Unlike traditional delay control approaches, the proposed IQC-based framework provides a flexible and systematic methodology for handling delay effects, enabling enhanced control capability, reduced conservatism, and improved closed-loop performance.
Active noise control (ANC) is an effective approach to noise suppression, and the filtered-reference least mean square (FxLMS) algorithm is a widely adopted method in ANC systems, owing to its computational efficiency and stable performance. However, its convergence speed and noise reduction performance are highly dependent on the step size parameter. Common step-size algorithms-such as normalized and variable step-size variants-require additional computational resources and exhibit limited adaptability under varying environmental conditions. To address this challenge, a novel Monte Carlo gradient meta-learning (MCGM) approach is proposed herein to determine an appropriate step size, into which a forgetting factor is incorporated to mitigate the impact of initial zero effect. Compared to other algorithms, the proposed method imposes no additional computational burden on FxLMS operations. Numerical simulations involving real-world acoustic paths and noise signals further confirm its effectiveness and robustness.
Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset of 3,013 audio-text pairs (about 4.5 hours) from 195 songs by 36 artists, segmented at the lyric-line level. Using Whisper as the base recogniser, we fine-tune models under seven training scenarios involving Songs, Common Voice Corpus (CVC), and FLEURS, and evaluate them on three benchmarks: CVC, FLEURS, and Kazakh Speech Corpus 2 (KSC2). Results show that song-based fine-tuning improves performance over zero-shot baselines. For instance, Whisper Large-V3 Turbo trained on a mixture of Songs, CVC, and FLEURS achieves 27.6% normalised WER on CVC and 11.8% on FLEURS, while halving the error on KSC2 (39.3% vs. 81.2%) relative to the zero-shot model. Although these gains remain below those of models trained on the 1,100-hour KSC2 corpus, they demonstrate that even modest song-speech mixtures can yield meaningful adaptation improvements in low-resource ASR. The dataset is released on Hugging Face for research purposes under a gated, non-commercial licence.
Data center cooling systems consume significant auxiliary energy, yet optimization studies rarely quantify the gap between theoretically optimal and operationally deployable control strategies. This paper develops a digital twin of the liquid cooling infrastructure at the Frontier exascale supercomputer, in which a hot-temperature water system comprises three parallel subloops, each serving dedicated coolant distribution unit clusters through plate heat exchangers and variable-speed pumps. The surrogate model is built based on Modelica and validated through one full calendar year of 10-minute operational data following ASHRAE Guideline 14. The model achieves a subloop coefficient of variation of the root mean square error below 2.7% and a normalized mean bias error within 2.5%. Using this validated surrogate model, a layered optimization framework evaluates three progressively constrained strategies: an analytical flow-only optimization achieves 20.4% total energy saving, unconstrained joint optimization of flow rate and supply temperature demonstrates 30.1% total energy saving, and ramp-constrained optimization of flow rate and supply temperature, enforcing actuator rate limits, can reach total energy saving of 27.8%. The analysis reveals that the baseline system operates at 2.9 times the minimum thermally safe flow rate, and the co-optimizing supply temperature with flow rate nearly doubles the savings achievable by flow reduction alone.
This report presents the TCG CREST system description for Track 1 (Speaker Diarization) of the DISPLACE-M challenge, focusing on naturalistic medical conversations in noisy rural-healthcare scenarios. Our study evaluates the impact of various voice activity detection (VAD) methods and advanced clustering algorithms on overall speaker diarization (SD) performance. We compare and analyze two SD frameworks: a modular pipeline utilizing SpeechBrain with ECAPA-TDNN embeddings, and a state-of-the-art (SOTA) hybrid end-to-end neural diarization system, Diarizen, built on top of a pre-trained WavLM. With these frameworks, we explore diverse clustering techniques, including agglomerative hierarchical clustering (AHC), and multiple novel variants of spectral clustering, such as SC-adapt, SC-PNA, and SC-MK. Experimental results demonstrate that the Diarizen system provides an approximate $39\%$ relative improvement in the diarization error rate (DER) on the post-evaluation analysis of Phase~I compared to the SpeechBrain baseline. Our best-performing submitted system employing the Diarizen baseline with AHC employing a median filtering with a larger context window of $29$ achieved a DER of 10.37\% on the development and 9.21\% on the evaluation sets, respectively. Our team ranked fifth out of the 11 participating teams after the Phase~I evaluation.
This paper studies the safe and resilient control of Connected and Automated Vehicles (CAVs) operating in mixed traffic environments where they must interact with Human-Driven Vehicles (HDVs) under uncertain dynamics and exponentially unbounded false data injection (EU-FDI) attacks. These attacks pose serious threats to safety-critical applications. While resilient control strategies can mitigate adversarial effects, they often overlook collision avoidance requirements. Conversely, safety-critical approaches tend to assume nominal operating conditions and lack resilience to adversarial inputs. To address these challenges, we propose an event-driven safe and resilient (EDSR) control framework that integrates event-driven Control Barrier Functions (CBFs) and Control Lyapunov Functions (CLFs) with adaptive attack-resilient control. The framework further incorporates data-driven estimation of HDV behaviors to ensure safety and resilience against EU-FDI attacks. Specifically, we focus on the lane-changing maneuver of CAVs in the presence of unpredictable HDVs and EU-FDI attacks on acceleration inputs. The event-driven approach reduces computational load while maintaining real-time safety guarantees. Simulation results, including comparisons with conventional safety-critical control methods that lack resilience, validate the effectiveness and robustness of the proposed EDSR framework in achieving collision-free maneuvers, stable velocity regulation, and resilient operation under adversarial conditions.
Radio Frequency Interference (RFI) is a growing concern for Global Navigation Satellite System (GNSS) reliability. The Cyclone GNSS (CYGNSS) constellation, designed for ocean wind retrieval via GNSS reflectometry (GNSS-R), provides Delay-Doppler Maps (DDMs) with noise floor metrics exploitable for spaceborne RFI detection. This study proposes a maximum-based DDM noise floor strategy that selects the highest noise floor value among four simultaneous GNSS reflections at each 0.5-second epoch, rather than their mean, preventing dilution of anomalous signals by unaffected channels. To suppress false alarms, a two-tier verification framework is introduced: (1) multi-satellite concurrent detection, confirming RFI when two or more CYGNSS satellites independently flag the same geographic region, and (2) temporal persistence verification, confirming a single-satellite detection only if threshold exceedance persists over a 10-second window. The physical basis for this criterion is established through slant-range geometry analysis between a ground-based jammer and the orbiting satellite. Performance is evaluated using CYGNSS Level 1 data from May 2025 in two regions: White Sands Missile Range, where NOTAM-announced GPS jamming tests were conducted, and the Middle East, where persistent RFI has been documented. The proposed method is compared against NASA's kurtosis-based RFI flags and a mean-based noise floor method. Results show that it detected RFI on three dates where the other methods produced negligible detections, and flagged 62% of total epochs in the Middle East compared to 46% (mean-based) and 33% (kurtosis-based). It also demonstrated capability to detect the early onset of gradually intensifying interference and atypical abnormal patterns not previously reported, highlighting the potential of maximum-based DDM noise floor analysis for sensitive and reliable spaceborne RFI detection.
We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a cost-driven approach, where a dynamic model in some latent state space is learned by predicting the costs without predicting the observations or actions. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model, for finite-horizon time-varying LQG control problems. To the best of our knowledge, despite various empirical successes, finite-sample guarantees of such a cost-driven approach remain elusive. Our result underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations. A second part of this work, that is to appear as Part II, addresses the infinite-horizon linear time-invariant setting; it also extends the results to an approach that implicitly learns the latent dynamics, inspired by the recent empirical breakthrough of MuZero in model-based reinforcement learning.
Cognitive modeling, which explores the essence of cognition, including motivation, emotion, and perception, has been widely applied in the artificial intelligence (AI) agent domains, such as robotics. From the computational perspective, various cognitive functionalities have been developed through utility theory to provide a detailed and process-based understanding for specifying corresponding computational models of representations, mechanisms, and processes. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable cognitive model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. This survey examines existing robotic systems for developmental cognitive models in the context of utility theory. We discuss the evolution of cognitive modeling in robotics from behavior-based robotics (BBR) and cognitive architectures to the properties of value systems in robots, such as the studies on motivations as artificial value systems, and the utility theory based cognitive modeling for generating and updating strategies in robotic interactions. Then, we examine the extent to which existing value systems support the application of robotics from an AI agent cognitive modeling perspective, including single-agent and multi-agent systems, trust among agents, and human-robot interaction. Finally, we survey the existing literature of current value systems in relevant fields and propose several promising research directions, along with some open problems that we deem necessary for further investigation.
We consider the problem of operating a battery in a grid-connected home to minimize electricity cost, which includes an energy charge and a tiered peak power charge based on the average of the $N$ largest daily peak powers over a month. With perfect foresight of loads and prices, the minimum cost can be found by solving a mixed-integer linear program (MILP), which provides a lower bound on achievable cost. We propose a model predictive control (MPC) policy that uses simple forecasts of prices and loads and solves a small MILP at each time step. Numerical experiments on data from a home in Trondheim, Norway, show that the MPC policy achieves a cost within $1.7\%$ of the prescient bound.
5G New Radio (NR) Sidelink (SL) Mode 2 has enabled decentralized, infrastructure-less direct communications which is evolving to serve reliability-critical services in 6G SL. Particularly, the channel access in NR SL Mode 2 relies on the Sensing-based Semi-Persistent Scheduling (SPS) whose key features significantly influence the packet reception ratio (PRR). While SPS has been widely studied, existing analytical models typically abstract or omit several NR-specific SPS features that are standardized in the 3rd Generation Partnership Project (3GPP), limiting their ability to explain how SPS parameters shape MAC collision dynamics and PRR. This paper develops an analytical MAC-layer PRR model for broadcast NR SL mode 2 by explicitly modeling SPS-driven MAC collision events. The model captures (i) Collisions caused by simultaneous resource reselection and (ii) Persistent collisions induced by resource keeping across resource reservation intervals (RRIs). Based on the event-level characterization, we derive closed-form expressions for the steady-state MAC collision probability and PRR. We further extend the analysis to incorporate under-explored SPS features, including the duplicate transmissions per RRI and the minimum resource-availability requirement for reselection, and quantify their impact on PRR in under-saturated regimes. The analytical results are validated using ns-3 simulations based on the 5G-LENA framework, showing close agreement under under-saturation and revealing deviations as the system approaches saturation. The proposed model provides mechanistic insight and design guidance of tuning the SPS parameters to improve 6G SL reliability.
The computational complexity of deep learning algorithms has given rise to significant speed and memory challenges for the execution hardware. In energy-limited portable devices, highly efficient processing platforms are indispensable for reproducing the prowess afforded by much bulkier processing platforms. In this work, we present a low-power Leaky Integrate-and-Fire (LIF) neuron design fabricated in TSMC's 28 nm CMOS technology as proof of concept to build an energy-efficient mixed-signal Neuromorphic System-on-Chip (NeuroSoC). The fabricated neuron consumes 1.61 fJ/spike and occupies an active area of 34 $\mu m^{2}$, leading to a maximum spiking frequency of 300 kHz at 250 mV power supply. These performances are used in a software model to emulate the dynamics of a Spiking Neural Network (SNN). Employing supervised backpropagation and a surrogate gradient technique, the resulting accuracy on the MNIST dataset, using 4-bit post-training quantization stands at 82.5\%. The approach underscores the potential of such ASIC implementation of quantized SNNs to deliver high-performance, energy-efficient solutions to various embedded machine-learning applications.
Along with the prosperity of generative artificial intelligence (AI), its potential for solving conventional challenges in wireless communications has also surfaced. Inspired by this trend, we investigate the application of the advanced diffusion models (DMs), a representative class of generative AI models, to high dimensional wireless channel estimation. By capturing the structure of multiple-input multiple-output (MIMO) wireless channels via a deep generative prior encoded by DMs, we develop a novel posterior inference method for channel reconstruction. We further adapt the proposed method to recover channel information from low-resolution quantized measurements. Additionally, to enhance the over-the-air viability, we integrate the DM with the unsupervised Stein's unbiased risk estimator to enable learning from noisy observations and circumvent the requirements for ground truth channel data that is hardly available in practice. Results reveal that the proposed estimator achieves high-fidelity channel recovery while reducing estimation latency by a factor of 10 compared to state-of-the-art schemes, facilitating real-time implementation. Moreover, our method outperforms existing estimators while reducing the pilot overhead by half, showcasing its scalability to ultra-massive antenna arrays.
This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.
Even though a variety of methods have been proposed in the literature, efficient and effective latent-space control (i.e., control in a learned low-dimensional space) of physical systems remains an open challenge. We argue that a promising avenue is to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it possesses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iii) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we show how these properties enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.
Learning a discriminative model that distinguishes the specified target from surrounding distractors across frames is essential for generic object tracking (GOT). Dynamic adaptation of target representation against distractors remains challenging because prevailing trackers exhibit limited discriminative capability. To address this issue, we present a new visual prompting mechanism for generic object tracking, termed PiVOT. PiVOT introduces mechanisms that leverage the pretrained foundation model (CLIP) to automatically generate and refine visual prompts online, thereby enabling the tracker to suppress distractors through contrastive guidance. To transfer contrastive knowledge from the foundation model to the tracker, PiVOT automatically propagates this knowledge online and dynamically generates and updates visual prompts. Specifically, it proposes a prompt initialization mechanism that produces an initial visual prompt highlighting potential target locations. The foundation model is then used to refine the prompt based on appearance similarities between candidate objects and reference templates across potential targets. After refinement, the visual prompt better highlights potential target locations and reduces irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate instance-aware feature maps guided by the visual prompts, which are incrementally and automatically updated during tracking, thereby effectively suppressing distractors. Extensive experiments across multiple benchmarks indicate that PiVOT, with the proposed prompting mechanism, can suppress distracting objects and improve tracking performance.
Existing gesture generation methods primarily focus on upper body gestures based on audio features, neglecting speech content, emotion, and locomotion. These limitations result in stiff, mechanical gestures that fail to convey the true meaning of audio content. We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures. Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise, avoiding melody distortion and guiding results towards specified emotions. Moreover, aligning semantic and gestures in the latent space provides better generalization capabilities. ExpGest, a diffusion model-based gesture generation framework, is the first attempt to offer mixed generation modes, including audio-driven gestures and text-shaped motion. Experiments show that our framework effectively learns from combined text-driven motion and audio-induced gesture datasets, and preliminary results demonstrate that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter -- the best linear minimum mean-square error estimator of the state trajectory -- is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input-output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.
Explicit solutions to optimal control problems are rarely obtainable. Of particular interest are the explicit solutions derived for minimax problems, providing a framework to address adversarial conditions and uncertainty. This work considers a multi-disturbance minimax Linear Regulator (LR) framework for positive linear time-invariant systems in continuous time, which, analogous to the Linear-Quadratic Regulator (LQR) problem, can be utilized for the stabilization of positive systems. The problem is studied for nonnegative and state-bounded disturbances. Dynamic programming theory is leveraged to derive explicit solutions to the minimax LR problem for both finite and infinite time horizons. In addition, a fixed-point method is proposed that computes the solution for the infinite horizon case, and the minimum L1-induced gain of the system is studied. We motivate the prospective scalability properties of our framework with a large-scale water management network.
A payment channel network is a blockchain-based overlay mechanism that allows parties to transact more efficiently than directly using the blockchain. These networks are composed of payment channels that carry transactions between pairs of users. Due to its design, a payment channel cannot sustain a net flow of money in either direction indefinitely. Therefore, a payment channel network cannot serve transaction requests arbitrarily over a long period of time. We introduce DEBT control, a joint routing and flow-control protocol that guides a payment channel network towards an optimal operating state for any steady-state demand. In this protocol, each channel sets a price for routing transactions through it. Transacting users make flow-control and routing decisions by responding to these prices. A channel updates its price based on the net flow of money through it. The protocol is developed by formulating a network utility maximization problem and solving its dual through gradient descent. We provide convergence guarantees for the protocol and also illustrate its behavior through simulations.
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
This paper presents BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation, with a focus on systematic evaluation of discriminator combination strategies. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal co- herence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal en- velope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this com- bination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similar- ity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD), Multi-Resolution STFT (M-STFT), Periodicity error (Periodicity)) and subjective evaluations (MOS, SMOS). To support reproducibility, we provide detailed architectural descriptions, training configurations, and complete implementation details. The code, pre-trained models, and audio demo samples are available at: this https URL.
Pressure sensors are an integrated component of modern Heating, Ventilation, and Air Conditioning (HVAC) systems. As these pressure sensors operate within the 0-10 Pa range, support high sampling frequencies of 0.5-2 kHz, and are often placed close to human proximity, they can be used to eavesdrop on confidential speech, since human speech has a similar audible range of 0-10 Pa and a bandwidth of 4 kHz for intelligible quality. This paper presents WaLi, which reconstructs intelligible speech from the low-resolution and noisy pressure sensor data with the following technical contributions: (i) WaLi reconstructs intelligible speech from a minimum of 0.5 kHz sampling frequency of pressure sensors, whereas previous work can only detect hot words/phrases. WaLi uses a complex-valued conformer and Complex Global Attention Block (CGAB) to capture inter-phoneme and intra-phoneme dependencies that exist in the low-resolution pressure sensor data. (ii) WaLi handles the transient noise injected from HVAC fans and duct vibrations by reconstructing both the clean magnitude and phase of the missing frequencies of the low-frequency aliased components. We evaluate our attack on practical HVAC systems located in two anonymous industrial facilities. Extensive studies on real-world pressure sensors show an LSD of 1.24 and an NISQA-MOS of 1.78 for 0.5 kHz to 8 kHz upsampling. We believe that such levels of accuracy pose a significant threat when viewed from a privacy perspective that has not been addressed before for pressure sensors. We also provide defenses for the attack.
Hearables are wearable computers that are worn on the ear. Bone conduction microphones (BCMs) are used with air conduction microphones (ACMs) in hearables as a supporting modality for multimodal speech enhancement (SE) in noisy conditions. However, existing works don't consider the following practical aspects for low-power implementations on hearables: (i) They do not explore how lowering the sampling frequencies and bit resolutions in analog-to-digital converters (ADCs) of hearables jointly impact low-power processing and multimodal SE in terms of speech quality and intelligibility. And (iii) They don't process signals from ACMs/BCMs at a sub-Nyquist sampling rate because, in their frameworks, they lack a wideband reconstruction methodology from their narrowband parts. We propose SUBARU (\textbf{Sub}-Nyquist \textbf{A}udio \textbf{R}esolution \textbf{U}psampling), which achieves the following: SUBARU (i) intentionally uses sub-Nyquist sampling and low bit resolution in ADCs, achieving a 3.31x reduction in power consumption; and (ii) achieves streaming operations on mobile platforms and SE in in-the-wild noisy conditions with an inference time of 1.74ms and a memory footprint of less than 13.77MB.
We aim to achieve keyless covert communication with a positive-rate in Rayleigh block-fading channels. Specifically, the transmitter and the legitimate receiver are assumed to have either causal or non-causal knowledge of the \ac{CSI} for both the legitimate and the warden channels, while the warden only knows the statistical distribution of the \ac{CSI}. Two problem formulations are considered in this work: (a) Power allocation: maximizing the sum covert rate subject to a maximum power constraint, and (b) Rate allocation: minimizing the power consumption subject to a minimum covert rate constraint. Both problems are formulated based on recent information theoretical results on covert communication over state-dependent channels. When the \ac{CSI} of each fading block is known non-causally, we propose a novel three-step method to solve both the power and rate allocation problems. In the case where the \ac{CSI} is known causally, the power allocation problem can be formulated as \ac{MDP} and be solved using a \ac{DDQN} approach. Although the rate allocation problem under causal \ac{CSI} does not directly conform to an \ac{MDP} structure, it can be approximately solved using the \ac{DDQN} trained for power allocation. Simulation results demonstrate the effectiveness of the proposed power and rate allocation methods and provide comprehensive performance comparisons across different allocation schemes.
Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on this https URL.
We present an inverse dynamic game-based algorithm to learn parametric constraints from a given dataset of local Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed-integer linear programs (MILP) encoding the Karush-Kuhn-Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the local Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods accurately inferred constraints and designed safe interactive motion plans for various classes of constraints, both convex and non-convex, from interaction demonstrations of agents with nonlinear dynamics.
Numerical voice impression (VI) control (e.g., scaling brightness) enables fine-grained control in text-to-speech (TTS). However, it faces two challenges: no public corpus and impression leakage, where reference audio biases synthesized voice away from the target VI. To address the first challenge, we introduce LibriTTS-VI, the first public VI corpus built on LibriTTS-R. For the second, we hypothesize a single reference causes leakage by entangling speaker identity and VI. To mitigate this, we propose 1) disentangled training with two utterances from the same speaker for speaker and VI conditioning, and 2) a reference-free method controlling the impression solely via target VI. Experimentally, our best method improves controllability: 11-dimensional VI mean squared error drops from 0.61 to 0.41 objectively and 1.15 to 0.92 subjectively. A comparison with a prompt-based TTS reveals imprecise numerical control and entanglement between VI and text semantics, which our methods overcome.
Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.
Quadruped robots are designed to achieve agile and robust locomotion by drawing inspiration from legged animals. However, most existing control methods for quadruped robots lack a key capacity observed in animals: the ability to exhibit diverse compliance behaviors while ensuring stability when experiencing external forces. In particular, achieving adjustable compliance while maintaining robust safety under force disturbances remains a significant challenge. In this work, we propose a safety aware compliant locomotion framework that integrates adjustable disturbance compliance with robust failure prevention. We first train a force compliant policy with adjustable compliance levels using a teacher student reinforcement learning framework, allowing deployment without explicit force sensing. To handle disturbances beyond the limits of compliant control, we develop a safety oriented policy for rapid recovery and stabilization. Finally, we introduce a learned safety critic that monitors the robot's safety in real time and coordinates between compliant locomotion and recovery behaviors. Together, this framework enables quadruped robots to achieve smooth force compliance and robust safety under a wide range of external force disturbances.
Control Barrier Functions (CBFs) are a powerful tool for ensuring the safety of autonomous systems, yet applying them to nonholonomic robots in cluttered, dynamic environments remains an open challenge. State-of-the-art methods often rely on collision-cone or velocity-obstacle constraints which, by only considering the angle of the relative velocity, are inherently conservative and can render the CBF-based quadratic program infeasible, particularly in dense scenarios. To address this issue, we propose a Dynamic Parabolic Control Barrier Function (DPCBF) that defines the safe set using a parabolic boundary. The parabola's vertex and curvature dynamically adapt based on both the distance to an obstacle and the magnitude of the relative velocity, creating a less restrictive safety constraint. We prove that the proposed DPCBF is valid for a kinematic bicycle model subject to input constraints. Extensive comparative simulations demonstrate that our DPCBF-based controller significantly enhances navigation success rates and QP feasibility compared to baseline methods. Our approach successfully navigates through dense environments with up to 100 dynamic obstacles, scenarios where collision cone-based methods fail due to infeasibility.
Recently, diffusion models have gained popularity and attention in trajectory optimization due to their capability of modeling multi-modal probability distributions. However, addressing nonlinear equality constraints, i.e, dynamic feasibility, remains a great challenge in diffusion-based trajectory optimization. Recent diffusion-based trajectory optimization frameworks rely on a single-shooting style approach where the denoised control sequence is applied to forward propagate the dynamical system, which cannot explicitly enforce constraints on the states and frequently leads to sub-optimal solutions. In this work, we propose a novel direct trajectory optimization approach via model-based diffusion, which directly generates a sequence of states. To ensure dynamic feasibility, we propose a gradient-free projection mechanism that is incorporated into the reverse diffusion process. Our results show that, compared to a recent state-of-the-art baseline, our approach leads to zero dynamic feasibility error and approximately 4x higher success rate in a quadrotor waypoint navigation scenario involving dense static obstacles.
Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.
We propose enforcing constraints on Model-Based Diffusion by introducing emerging barrier functions inspired by interior point methods. We demonstrate that the standard Model-Based Diffusion algorithm can lead to catastrophic performance degradation in highly constrained environments, even on simple 2D systems due to sample inefficiency in the Monte Carlo approximation of the score function. We introduce Emerging-Barrier Model-Based Diffusion (EB-MBD) which uses progressively introduced barrier constraints to avoid these problems, significantly improving solution quality, without expensive projection operations such as projections. We analyze the sampling liveliness of samples at each iteration to inform barrier parameter scheduling choice. We demonstrate results for 2D collision avoidance and a 3D underwater manipulator system and show that our method achieves lower cost solutions than Model-Based Diffusion, and requires orders of magnitude less computation time than projection based methods.
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Code and dataset are available at our website: this https URL
This paper introduces AirCNN, a novel paradigm for implementing convolutional neural networks (CNNs) via over-the-air (OTA) analog computation. By leveraging multiple reconfigurable intelligent surfaces (RISs) and transceiver designs, we engineer the ambient wireless propagation environment to emulate the operations of a CNN layer. To comprehensively evaluate AirCNN, we consider two types of CNNs, namely classic two-dimensional (2D) convolution (Conv2d) and light-weight convolution, i.e., depthwise separable convolution (ConvSD). For Conv2d realization via OTA computation, we propose and analyze two RIS-aided transmission architectures: multiple-input multiple-output (MIMO) and multiple-input single-output (MISO), balancing transmission overhead and emulation performance. We jointly optimize all parameters, including the transmitter precoder, receiver combiner, and RIS phase shifts, under practical constraints such as transmit power budget and unit-modulus phase shift requirements. We further extend the framework to ConvSD, which requires distinct transmission strategies for depthwise and pointwise convolutions. Simulation results demonstrate that the proposed AirCNN architectures can achieve satisfactory classification performance. Notably, Conv2d MISO consistently outperforms Conv2d MIMO across various settings, while for ConvSD, MISO is superior only under poor channel conditions. Moreover, employing multiple RISs significantly enhances performance compared to a single RIS, especially in line-of-sight (LoS)-dominated wireless environments.
The transition to prescriptive maintenance (PsM) in manufacturing is critically constrained by a dependence on predictive models. Such purely predictive models tend to capture statistical associations in the data without identifying the underlying causal drivers of failure, which can lead to costly misdiagnoses and ineffective measures. This fundamental limitation results in a key challenge: while we can predict that a failure may occur, we lack a systematic method to understand why a failure occurs. This paper proposes a model based on causal machine learning to bridge this gap. Our objective is to move beyond diagnosis to active prescription by simulating and evaluating potential fixes to optimise KPIs such as Overall Equipment Effectiveness (OEE). For this purpose, a pre-trained causal foundation model is used as a ``what-if'' simulator to estimate the effects of potential fixes. By estimating the causal effect of each intervention on system-level KPIs, specific actions can be recommended for the production line. This can help identify plausible root causes and quantify their operational impact. The model is evaluated using semi-synthetic manufacturing data and compared with non-causal and causal baseline machine learning models. This paper provides a technical basis for a human-centred approach, allowing engineers to test potential solutions in a causal environment to make more effective operational decisions and reduce costly downtimes.
The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.
High-resolution structure determination by cryo-electron microscopy (cryo-EM) requires the accurate fitting of an atomic model into an experimental density map. Traditional refinement pipelines such as Phenix.real_space_refine and Rosetta are computationally expensive, demand extensive manual tuning, and present a significant bottleneck for researchers. We present this http URL, an end-to-end deep learning framework that automates and accelerates molecular structure refinement. Our approach utilizes a one-step diffusion model that integrates a density-aware loss function with robust stereochemical restraints, enabling rapid optimization of a structure against experimental data. this http URL provides a unified and versatile solution capable of refining protein complexes as well as DNA/RNA-protein complexes. In benchmarks against Phenix.real_space_refine, this http URL consistently achieves substantial improvements in both model-map correlation and overall geometric quality metrics. By offering a scalable, automated, and powerful alternative, this http URL aims to serve as an essential tool for next-generation cryo-EM structure refinement. Web server: this https URL Source code: this https URL.
The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap -- a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack -- and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model's own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration -- empirical validation begins Australian Grand Prix, 8 March 2026.