Few-shot prompting provides an effective way to adapt auditory large language models to low-resource tasks such as children's speech recognition. However, most auditory large language models are not explicitly trained to perform inference in this demonstration-conditioned format, limiting the extent to which they can benefit from few-shot prompting. To address this limitation, we introduce Few-Shot Aware GRPO (FSA-GRPO), an RL-based post-training recipe that uses a specially designed reward to encourage the model to leverage few-shot demonstrations, thereby strengthening its few-shot adaptation ability. Notably, training with only high-resource adult ASR data improves the model's general few-shot adaptation ability, yielding gains not only in children's speech recognition but also in speech translation and audio understanding. We further study data selection and auxiliary reward weighting to identify an effective training recipe. Our experiments show that when in-domain data are unavailable or cannot be used for training, FSA-GRPO is more effective than direct tuning on related out-of-domain data.
This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.
We introduce Echo-POSED, a self-supervised framework for real-time transthoracic echocardiography (TTE) guidance that recommends probe adjustments directly from 2D ultrasound images, without the need for expert-labelled views or tracked probe trajectories. Instead, it trains on 2D views sliced from routinely acquired 3D echocardiography volumes, enforcing equivariance to probe motions while remaining invariant to cardiac phase, yielding a pose representation on $\mathrm{SO}(3)\times\mathrm{SO}(3)$. Across a held-out split and public external 3D--TTE datasets (including vendor shift), Echo-POSED maintains geometric consistency under virtual perturbations and enables intra- and inter-patient guidance simulations, achieving a combined mean angular error of 8.2 degrees between the guided and target views in intra-patient simulations with cardiac motion.
We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p<0.0001) for clinically actionable nodules >=10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.
Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: this https URL.
Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade-off: regression models produce over-smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral-Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency-decoupled refinement. SDIR first extracts a stable low-frequency synoptic skeleton, then iteratively refines high-frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual-path design: the Synoptic Frequency-Guided Former (SFG-Former) with Scale-Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR-Refiner) with Scale-Conditioned Fourier Neural Operators for fine residuals. A Physically Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces a turbulence-consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA methods in spatial accuracy while achieving spectral fidelity competitive with diffusion-based methods, enabling reliable high-resolution operational nowcasting. Code link: this https URL.
Advanced Air Mobility (AAM) uses electric vertical take-off and landing (eVTOL) vehicles to address urban congestion and emissions. However, corridor design, operation management, and separation standards remain underexamined for safe high-density operations. This paper applies the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to systematically review relevant literature from IEEE Xplore and Web of Science, focusing on publications from 2010 to 2024. A Context, Intervention, Mechanism, and Outcome (CIMO) framework guided the development of research questions. After screening 2,039 journal and conference papers, 62 articles met the inclusion criteria. The findings reveal a lack of integrated corridor design approaches, limited operational strategies, and reliance on standards originally designed for conventional aviation. A unified corridor design and separation definition frameworks and taxonomies are proposed to address these shortcomings, informing future investigations and operational frameworks for safe, efficient eVTOL operation deployment in urban settings.
In this study, we address the IMU-based heave motion estimation problem for inertial navigation systems. Unlike existing approaches, we propose a data-driven framework in which a bank of IIR filters, each associated with a specific frequency range, is optimized using a synthetically generated dataset of realistic heave-acceleration tuples. The synthetic heave signal generation pipeline starts by synthesizing random wave signals from established wave energy spectra and then processing them through heave response amplitude operators reported in the literature. The corresponding vertical acceleration measurements are obtained by double-differentiating the heave signals and corrupting them with realistic low- and high-frequency disturbances observed in real IMU recordings. A Fourier-transform-based method is used to estimate the mean peak period and select the appropriate filter. Simulation results from both offline and real-time tests demonstrate that the proposed method is robust to varying sea regimes and provides accurate heave estimation, with a maximum RMSE not exceeding the larger of 5 cm or 5% of the significant heave height.
Ultrasound localization microscopy (ULM) enables micrometer-scale microvascular imaging by localizing and tracking intravascular microbubbles, but its dependence on exogenous contrast agents and long acquisition times limits clinical translation. This study presents a high-frame-rate contrast-free super-resolution ultrasound microvascular imaging method based on high-frequency ultrafast ultrasound and nonlinear beamforming of backscatter signals from native blood flow. Using only 125 milliseconds of in vivo ultrafast data per image, the proposed method achieved an imaging frame rate of 8 frames/s in a rabbit kidney model. The reconstructed microvascular images resolved vessels with a global spatial resolution of 22.2 um over a field of view of 23.04 x 15.18 mm2, where the wavelength of ultrasound was 67.5 um. This corresponds to a three-fold improvement over conventional power Doppler imaging under the same acquisition duration. Compared with conventional flow imaging, the proposed method provided improved microvascular contrast and finer vessel delineation without microbubble injection. These results demonstrate a practical pathway toward high frame rate, contrast-free super-resolution ultrasound imaging for microvascular assessment.
We show that incorporating fairness constraints into virtual power plant (VPP) operations can incentivize consumer participation and thus improve the aggregator's long-run profitability. VPPs rely on sustained participation from heterogeneous consumers to provide a variety of grid services whose timing and frequency are often uncertain. As a result, consumers' willingness and ability to provide flexibility evolve over time, creating a dynamic link between past participation and future resource availability. We develop a dynamic aggregation framework to study how fairness in service allocation affects future participation and long-run profitability. By linking current dispatch decisions to future resource availability, we show that fairer allocations can strengthen consumer engagement, expand aggregate availability, and create additional value during high-price and high-demand events. To balance fairness and operational efficiency, we introduce a slack-augmented allocation mechanism that preserves most of the participation benefits from fairness while avoiding unnecessary reductions in service procurement. We derive conditions under which the resulting availability gains outweigh the short-run cost of redistribution and validate the approach using real-world consumer behavior and electricity market data from Norway.
A unified, closed-form analytical PI/PID tuning method is presented for all-pole plants up to third order that yields a strictly monotonic (zero-overshoot) step response with minimum settling time. The design target is the binomial closed loop p^n/(s+p)^n, which is monotonic with robustness depending only on the order n. Because adding a left-half-plane zero to a fixed pole pattern only slows the response, the minimum-settling solution requires the controller zeros to be cancelled, which forces the controller numerator to divide the plant denominator. Carrying this principle through shows that an exact, real-gained solution exists for any stable plant precisely up to second order with a PI controller and third order with a PID controller; the residual binomial factor acquires a complex pair beyond that, which a generic plant does not contain. Explicit gains are derived for first-order plants (PI), second-order plants with real and complex poles (PI and PID), and third-order plants with three real poles and with one real pole plus a complex pair (PID). The second-order PI case is treated in full as the lowest-order instance. Monotonicity guarantees Mt = 1, hence Ms less then 2, phase margin above 60 degree, and gain margin above 6 dB, tightening to universal constants for the binomial family. Numerical verification confirms the results.
This paper considers the problem of simultaneously controlling an interceptor's impact time and impact angle using its lateral acceleration as the sole control input. With a single control input, the nonlinear engagement kinematics is inherently underactuated, which complicates guidance law synthesis. To overcome this challenge, a hierarchical sliding mode-based guidance law is developed to concurrently regulate the two terminal constraints. The proposed architecture consists of a two-layer sliding manifold. The first layer comprises two sub-sliding surfaces corresponding to the impact time and impact angle error dynamics, respectively, while the second layer introduces a composite sliding manifold that combines the two individual sub-surfaces. Then, a variable-gain adaptive guidance law is designed to ensure time and angle-constrained interception against a stationary target, which is further extended to intercept a constant velocity target. Simulations are conducted for various engagement scenarios to attest to the efficacy of the proposed approach.
A novel power delivery framework, comprising a package-embedded inductor topology and an inductance-island methodology, is introduced to maximize both inductance and current densities in vertical power delivery (VPD). The framework leverages multiple multi-phase converters, a common strategy in high-performance computing systems, to enhance efficiency and scalability. The proposed topology employs an array of tightly coupled spiral square inductors sharing a common magnetic rod, serving multiple converters operating in the same conversion phase. The array is optimized to maximize coupling and minimize conversion losses, achieving superior inductance and current densities of 250 nH/mm^2 and 10 A/mm^2, respectively. At the system level, the inductance-island methodology partitions the power delivery network into multiple islands, each dedicated to a converter phase and supplying a portion of the load current, thereby enabling scalable and efficient distribution. To validate the framework, the inductor array is designed and simulated in ANSYS Maxwell 3D and Mechanical, exhibiting an average quality factor of 23.6 and efficiency of 97.4% at 2 A load current, 6 V input, and 10 MHz switching frequency. The inductor array netlist is extracted from ANSYS and co-designed in Cadence Virtuoso with a distributed dual-phase power conversion system, ensuring joint optimization of passive and active components. The co-designed converter achieves a significant efficiency gain of 5.65% on average and up to 11.04% at 40 A load over a similar converter with uncoupled inductors, demonstrating the practical benefits of the approach.
Over-the-air computation (AirComp) is widely used for model aggregation in wireless distributed learning. Although it enhances communication efficiency, we believe the AirComp aggregation has limited effectiveness due to the difference between its target problem and that of distributed learning. In this paper, we develop a rigorous formulation for optimal model aggregation in wireless distributed learning. Using this formulation, we show that AirComp aggregation generally assumes a mismatched statistical model for local parameters. We then propose a statistical framework for model aggregation, called global unknown estimation (GUE). It captures the statistical relation between the local and global model parameters, allowing to interpret model aggregation as an inference task. We validate the efficiency of GUE through numerical experiments. Our results show that, in the low SNR regime, GUE can reduce the required power for model aggregation by approximately 15 dB compared to AirComp aggregation. Remarkably, this gain is obtained without additional computational overhead
We introduce D^3S Consensus, a physics-based, closed-form algorithm that unifies depth-from-defocus (DfD) and stereo to achieve highly accurate depth estimation throughout an extended working range beyond the depth-of-field (DoF) of cameras. Given a pair of dual-defocus stereo images, the method estimates an overdetermined set of depth using a novel DfD theory, Dual Differential Defocus (D^3), and (S)tereo in a coupled fashion. It then picks the most confident depth prediction from the set by enforcing consensus between these physically independent cues to reject unreliable estimates. Analysis shows that D^3S achieves a comparable working range under the same error tolerance with 10x smaller baseline than previous triangulation-based depth estimation systems. This enables compact passive binocular rangefinders with substantially smaller form factors than conventional stereo and DfD designs. We demonstrate the first D^3S prototype with only 4 mm baseline and 12 mm EFL. It generates up to 900 x 1800-pixel depth maps with 1-cm mean absolute error over 0.3-1.64 m from a snapshot acquisition. This has surpassed the reported accuracy of certain commercially available stereo cameras with much larger form factors.
In this study, we conduct a comprehensive comparative analysis of generative and discriminative deep learning-based speech enhancement methods, specifically in noise reduction tasks. Our investigation focuses on evaluating their effectiveness under high and low signal-to-noise ratio conditions, considering both matched and mismatched training scenarios. We further investigate the impact of training data volume, model convergence speed, and interpret the performance differences in terms of objective results for the considered training paradigms. Additionally, we compare the complexity-performance trade-off and the practical viability of these approaches. To further strengthen the evaluation, we study the hallucination characteristics of generative approaches in terms of word error rate and phoneme similarity. The insights derived from this study provide empirical evidence to assist researchers and practitioners in understanding whether the perceptual gains of different approaches justify their computational cost in practical applications.
Control barrier functions (CBFs) have become a standard tool in safety critical-control systems. CBFs convert state constraints into real time control conditions that certify forward invariance (meaning that once the system starts in a safe region, it remains there for all future times) and minimally modify a nominal controller only when safety is at risk. In power systems, CBF based methods have been proposed for frequency and voltage safety, but they largely remain disconnected from three key features that are central to power system operation: differential algebraic equation (DAE) models that capture network power flow constraints, safety specifications involving algebraic variables such as bus voltages, and formal verification of the resulting closed loop system. This paper closes this gap by developing a CBF framework for power system DAE models that supports safety constraints on both dynamic and algebraic variables. The framework provides real time safety filtering through an optimization layer that wraps around an existing controller and minimally modifies its command to enforce safety. In addition, it provides formal verification (i.e., a mathematical guarantee that all admissible trajectories satisfy the prescribed safety constraints) through an offline reachability based certificate of safe operation. The result is a unified filter and verify methodology for enforcing and certifying frequency and voltage safety in power systems while preserving the DAE structure of the underlying model.
Splatting (GS)-based shared geometry framework adopts a two-stage training strategy, in which an explicit, subject-specific Gaussian scaffold encoding anatomical geometry is first learned from the isotropic structural scan and then reused to fit appearance for target modalities acquired with sparse slices. Experiments on the UK Biobank, GBM, and ABCD datasets for through-plane super-resolution across multiple modalities (T2-weighted, FLAIR, DWI, ASL), degradation factors ($\times 3$, $\times 5$, $\times 7$), and pathological abnormalities (glioblastoma) demonstrate state-of-the-art reconstruction fidelity. The shared Gaussian geometry enables arbitrary-view generation for target modalities with strong structural consistency and further shows potential for self-supervised in-plane super-resolution. This work establishes explicit geometry-guided representations as a novel, flexible, and interpretable pathway toward retrospective multi-contrast MRI harmonization and reliable clinical reference construction. Source code is available at: this https URL
Reconfigurable holographic surface (RHS)-aided integrated sensing and communication (ISAC) systems hold great promise for achieving both sensing and communication with low hardware costs and high energy efficiency. However, existing works largely overlook practical hardware impairments in RHSs, particularly faulty RHS elements with uncontrollable amplitudes, which degrade system performance if left unaddressed. This work aims to fill the gap by i) quantifying the impact of faulty RHS elements on ISAC performance and ii) optimizing the functional RHS elements to preserve the ISAC performance. Specifically, we derive the misspecified Cramer-Rao bound (MCRB) for sensing and the signal-to-interference-and-noise ratio (SINR) for communication to measure the performance loss caused by faulty elements. We then formulate an optimization problem that minimizes MCRB, subject to constraints on SINR, transmit power budget, and RHS amplitude. The high non-convexity of the formulated problem poses a significant challenge, which we address by reformulating and proposing a block coordinate descent-based solution incorporating majorization-minimization and successive convex approximation techniques. Simulation results verify that the proposed approach achieves an average 13.7% performance gain compared to the fault-unaware benchmark.
This paper investigates the stabilization of linear systems subject to simultaneous, mismatched time delays in both the control input and system output vectors. The proposed control framework is developed in two primary stages. First, an asymptotically stabilizing delayed state-feedback controller is synthesized by leveraging recent advancements in Linear Matrix Inequality (LMI) techniques. Second, this controller is realized using novel time-delay compensators \cite{trinhnam26}. This architecture successfully accommodates an output measurement delay $\tau_y$ that is independent of the input delay $\tau_u$, enabling direct estimation of the delayed state-feedback control law. The proposed methodology is then extended to target output controllers to account for simultaneous, mismatched time delays in both the control input and system output vectors.
With the growing integration of renewable energy sources into power grids, the risks of oscillation caused by interactions between grid-tied inverters and the grids are becoming increasingly prominent. Although existing studies have made significant progress in inverter modeling and oscillatory stability analysis, most of them do not sufficiently consider complex mirror frequency coupling effects (MFCE) under unbalanced operating conditions, leading to unreliable models and erroneous stability analysis results. To address this inadequacy, this work develops a novel sequence impedance modeling scheme that can be widely applied to unbalanced operating conditions. In particular, taking a representative type of grid-forming inverter for instance, i.e., droop-controlled inverter (DCI), a single-input single-output sequence impedance modeling method based on harmonic linearization (HL) is proposed to comprehensively model both a given DCI and the connected grid. By accounting for multi-frequency interactions within the DCI, this method captures MFCE and unbalanced factors, leading to a more accurate impedance model. Further, the dominant factors influencing system stability are identified with a combination of normalized sensitivity analysis and proportional weighting. Finally, the detailed impacts of these dominant factors on system stability margin under three typical unbalanced operating conditions are analyzed through the Bode criterion. The effectiveness and reliability of the whole scheme proposed in this work are validated on the constructed grid-connected droop-controlled experimental platform.
This paper develops an impulse-supervised confined exploration framework for learning local optimal controller for a class of nonlinear systems. The proposed approach combines continuous-time approximate dynamic programming (ADP) with an impulsive supervisory layer, where impulsive braking confines the state within a prescribed region in which a local linear approximation of the nonlinear system is valid. This enables desired persistent excitation required for parameter convergence while preventing large state deviations that invalidate local optimality. The resulting hybrid closed-loop system enforces invariance of the exploration region through state-triggered braking inputs. Simulation results on a nonlinear mechanical system demonstrate effectiveness of the proposed approach.
The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.
A thermomagnetic pendulum is introduced as a coupled thermo-magnetic-mechanical system consisting of a ferromagnetic bob under gravity and an offset permanent magnet. Heating drives the bob temperature above and below the Curie point, causing magnetic attraction to vanish and recover as the bob moves and cools. A multiphysics model is developed in which the magnetic torque depends nonlinearly on the bob temperature field and pendulum configuration. The formulation couples transient three-dimensional heat transfer, a temperature-dependent magnetization law, and pendulum dynamics. Simulations show angular torque asymmetry, rapid force reduction near the Curie point, and sustained oscillations.
The increasing penetration of electric vehicles (EVs) introduces new challenges for emergency evacuation planning due to limited driving range, long charging times, and constrained charging infrastructure, particularly under disaster induced disruptions. This paper proposes a novel optimization based evacuation framework for EVs using Equivalent Circuit Models (ECMs) to jointly address routing, charging, and congestion management. By leveraging electrical analogies, traffic flow is modeled as electrical current, travel time as resistance, and driving range as voltage, enabling the use of Kirchhoff laws to enforce flow balance and energy feasibility constraints. The proposed controllable ECM incorporates binary switches to regulate route selection and explicitly models charging delays and range replenishment at both Fixed Charging Stations (FCSs) and Mobile Charging Stations (MCSs). The resulting formulation leads to an integer programming problem that determines optimal evacuation routes, charging durations, and the placement and number of MCSs to minimize evacuation time. The framework is extended to multiple origin destination pairs using the principle of superposition and supports fairness aware performance metrics, including worst case, average, and variance based evacuation times. Simulation studies on large scale transportation networks in California demonstrate that the proposed approach significantly improves evacuation efficiency and robustness, particularly in scenarios with limited charging access, highlighting the critical role of MCSs in EV based emergency evacuations.
Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition. This paper introduces SpeakerCard-1M, a bilingual speaker-centric resource for evidence-grounded SV, derived from VoxCeleb1/2 and CN-Celeb1/2, where the "-1M" suffix refers to the 1.78M utterance-level captions contained in the release. We adopt a tool-first, LLM-last approach: ten acoustic probes produce field-level evidence, the evidence is aggregated into speaker profiles under a schema that separates relatively stable traits from utterance-level states, and bilingual Speaker Cards are rendered by a constrained LLM that sees only the structured fields. The release includes 56.7K Speaker Card records over 10.2K speakers, 1.78M utterance-level captions, and speaker-ID-disjoint hard-negative triplets. We further define two SV-oriented cross-modal protocols, bidirectional Speaker-Text Retrieval (T2S-R / S2T-R) and Attribute-Conditioned Verification (AC-Verify), and compare a dual-encoder baseline against recent audio language models under a zero-shot forced-choice setting. Joint audio-text training increases VoxCeleb1-O EER by 0.31% absolute over the audio-only baseline. Under a style-symmetric LLM-generated counterfactual protocol, eight recent audio language models (7B-30B+ parameters, both open- and closed-source) score 49-77% on pitch-level AC-Verify under two-way forced choice, compared with 88.66% reached by our dual encoder.
Graph signal denoising is a fundamental task in graph signal processing. While the node-oriented filtering approach enhances spatial adaptability, it suffers from spectral rigidity due to its reliance on the graph Fourier transform. Conversely, emerging fractional-domain transforms provide crucial spectral flexibility but are fundamentally limited by their globally shared filtering paradigm, failing to accommodate localized topological variations. To bridge this gap, this paper proposes a generalized node-oriented fractional filtering (NOFF) framework that seamlessly integrates localized spatial adaptability with proactive spectral modulation across various fractional transforms. However, straightforwardly assigning independent full-rank filters to all vertices incurs a prohibitive parameter space, leading to severe overfitting on random noise. To mitigate this, we introduce the low-rank NOFF (LRNOFF) architecture. By imposing a strict low-rank constraint, LRNOFF inherently acts as a powerful implicit regularizer, preventing noise memorization and ensuring the extraction of robust spectral bases. Furthermore, we develop an efficient computational implementation termed LRNOFF-Fast, which drastically reduces computational and memory overhead while preserving theoretical optimality. Experiments on real-world datasets demonstrate that the proposed framework achieves state-of-the-art performance.
Large-scale digital transformation programs must simultaneously sustain existing operations and navigate deep unknowns emerging from IT-business-operations interactions -a challenge conventional project governance frameworks inadequately address. Based on a longitudinal case study of a transformation program, we investigate the ''Sirius Days,'' a monthly senior management retreat identified as a critical success factor. We show that this framework constitutes a pilot-organization: an organizational 'dispositif' (or apparatus) that deconstructs established knowledge or assumptions, formulates rigorous conjectures, and tests them in real conditions. It generated five resilience levers -systemic characterization of unknowns, early anomaly discernment, expansion of performance norms, social capital creation through a community of inquiry, and expansion of organizational agility across scales -revealing a model of an organizational 'dispositif' that operationalizes navigating unknowns across cognitive, social, and normative dimensions.
Multimodal multi-object tracking (MOT) under complex illumination remains challenging due to insufficient joint modeling of spatial and modal features and the limited adaptability of fixed fusion strategies. To address these issues, this paper proposes a spatial-modal convolution fusion and distillation-prompt-based multimodal MOT framework. A spatial-modal fusion backbone is first constructed, where a Basic module performs spatial feature extraction and modal interaction via decoupled 3D convolution, while a Mixed module models nonlinear cross-modal correlations through amplitude-phase decomposition. In addition, a representation collapse network is designed for adaptive multimodal fusion. A Distillation Prompt Guidance (DPG) module generates dynamic modal weights under teacher supervision, and a Global Modal Difference Aggregation (GMDA) module preserves discriminative information during multimodal representation collapse. Extensive experiments on the UniRTL dataset demonstrate the effectiveness of the proposed method. The proposed tracker achieves 63.31 HOTA and 79.21 MOTA on the RNT modality, outperforming several state-of-the-art methods while maintaining favorable inference efficiency. The source code and pretrained models are publicly available at this https URL.
To ensure worst-case physical layer security, this paper proposes a robust beamforming framework for secure integrated sensing and communication (ISAC) systems. Different from conventional designs that focus on maximizing the ergodic secrecy rate, the proposed method aims to minimize instantaneous information leakage risk. We formulate a multi-objective optimization problem that jointly suppresses the worst-case eavesdropper signal-to-interference-plus-noise ratio (SINR), improving sensing accuracy, and ensuring the quality of service (QoS) for legitimate users. To address the resulting non-convex problem, we develop a hierarchical iterative algorithm, in which the outer loop refines the continuous uncertainty regions based on the updated sensing performance, and the inner loop optimizes beamforming under the refined uncertainty regions. Theoretical analysis and simulation results demonstrate that the proposed method achieves per-transmission security guarantees with practical complexity.
Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.
Recently, industrial pioneers like Amazon, Tencent, ByteDance, and Huawei have been adopting BBR as their congestion control algorithm for live-streaming applications, including TikTok Live. However, BBR, originally crafted for bulk data transmission, faces multiple challenges in live-streaming scenarios. In this paper, we first explore two key issues associated with BBR due to inaccurate bandwidth estimation in live-streaming scenarios: (i) BBR cannot easily exit its startup phase, resulting in a fierce self-inflicted loss. (ii) BBR sends data at a lower rate than the available bandwidth during its stable phase. We then propose BBR-Copilot, an auxiliary congestion control component that cooperates with BBR, making BBR better adapt to live-streaming scenarios. BBR-Copilot allows for proactively generating accurate bandwidth measurement samples by smartly creating and sending extra data. We implement the BBR-Copilot prototype upon QUIC and evaluate it via testbed. Experimental evaluation results show that BBR-Copilot effectively enhances BBR's performance in live-streaming scenarios.
Cross-border electricity exchanges are crucial for operating and planning highly renewable power systems. Many studies reduce spatial granularity to keep the models tractable and prescribe cross-border exchanges exogenously, often by reusing historical import/export time series. Such assumptions become inconsistent as renewable penetration changes the magnitude and timing of flows. This paper proposes a machine-learning (ML) surrogate framework that maps available nodal time series data (e.g., hourly demand and renewable generation) to synthetic, interconnector-level flow time series. The goal is to provide consistent flow profiles that are used as fixed boundary conditions in reduced power system optimization models (PSOMs). To improve downstream feasibility when surrogate flows are imposed in optimization, we further introduce a custom loss for the neural-network surrogate that penalizes physically impossible flow patterns. We demonstrate the framework on a pan-European single-node per country DC optimal power flow setting using the open-source LEGO PSOM with ENTSO-E TYNDP 2024 National Trends assumptions for 2030. We assess two model classes: k-nearest neighbors (KNN) and feedforward neural networks (SQU), using both full and reduced feature sets. The SQU models generalize more robustly than KNN to unseen climate years and substantially improve upon scaled historical benchmarks in terms of predictive accuracy. When imposed as fixed boundary flows in single-node PSOMs, the ML-generated profiles produce outcomes that closely match the results of the full European simulation, while delivering substantial runtime reductions (up to ~500x). These results indicate that ML-based flow surrogates can provide decision-relevant interconnector flows for tractable reduced studies in high-renewable systems.
While Renewable Energy Communities (RECs) and Collective Self-Consumption (CSC) schemes have emerged as promising tools to accelerate renewable energy adoption and support the net-zero transition, their full potential can only be realised when complemented by demand-side flexibility that aligns consumption with renewable generation. Water storage heaters can function as distributed thermal storage, absorbing excess renewable energy at the community level. This work quantifies both the benefits of water storage heaters flexibility for energy consumers in a CSC community in France (such as energy bill reduction, increase of self-consumption), and the challenges related to the implementation and user acceptance. At the first stage, an annual simulation analysis is performed on a community of 41 households and a large solar PV plant, comparing a scenario without a CSC community, a scenario with a standard CSC community, and a scenario with a CSC community with flexibility from water storage heaters, which showed that an average benefit of 70euro/year per household can be achieved due to flexibility and an increase of 6% and 22% of solar PV community self-consumption and self-production respectively. In the second stage, we present the results of the real-world deployment in the community, analysing its technical performance and user reception, and examine the main factors shaping user engagement and satisfaction.
Recent small-signal stability studies of AC grids have shifted towards analysing power systems as interconnections of subsystems and leveraging their input-output properties to derive scalable stability certificates. Two subsystem representations appear frequently in the literature: the PQ model, coupling powers to phase angle and voltage magnitude, and the IV model, coupling currents to voltages. In this paper, we derive both models without simplifying the bus or line dynamics and show that a loop transformation relates the two. One of the main results in the paper is to then show analytically that each representation may exhibit unstable poles depending primarily on the operating point (IV model) or the presence of high-frequency passive dynamics (PQ model). In particular, such unstable poles in the subsystems can occur even when the aggregate interconnection is stable and well-behaved. These effects are validated numerically, including a case study using the full-order dynamics of a synchronous generator with an exciter and transformer. Our results highlight that care must be taken when choosing a subsystem representation, as neglecting high-frequency dynamics or device operating points may obscure unstable poles that must be stabilised by the network interconnection and must be accounted for in system identification.
Channel knowledge maps (CKMs) are designed to predict channel state information (CSI) from user locations, thereby enabling low-overhead CSI acquisition. However, existing CKM construction methods often require hours-to-days of training time and dense measurements, resulting in substantial deployment cost. In this paper, we propose Voxel-CKM, a novel voxelized radio frequency (RF) radiance field framework for fast and few-shot CKM construction. The core idea is to replace implicit neural representations with explicit voxel grids to efficiently capture the spatial variation of wireless channels. Building upon this, we further introduce a compact vector-matrix (VM) decomposition to parameterize these voxel grids using a small set of matrices and vectors, which significantly accelerates convergence and facilitates fast CKM construction. To enable few-shot learning, we incorporate a transmitter prior as an inductive bias to guide the learning process under sparse measurements. Additionally, a total-variation (TV) regularization loss is proposed to mitigate overfitting and stabilize optimization. Experiments show that Voxel-CKM substantially accelerates training convergence and improves performance in the few-shot regime.
To meet the stringent requirements of future motion systems exhibiting time-varying and/or position-dependent behavior, online data must be leveraged to improve control performance. This paper presents a recursive algorithm for simultaneous learning of feedforward and compliance compensation parameters. A multivariate regression formulation is proposed that jointly estimates friction, mass, jerk, and compliance compensation parameters while mitigating parameter coupling. Experimental results on a high-tech semiconductor metrology and inspection system demonstrate an order-of-magnitude improvement in servo performance.
The integration of offshore wind power plants (OW-PPs) into weak grids can pose stability challenges due to the interaction between inverter-based resources (IBRs), Flexible AC Transmission Systems (FACTS) and the grid. In this context, long HVAC transmission systems, relatively common for OWPPs, can exacerbate the stability challenges. Therefore, this paper introduces a novel methodology for estimating the equivalent short-circuit ratio (ESCR) at the offshore point of connection (PoC), combining analytical two-port network (TPN) modeling with electromagnetic transient (EMT) simulations. The approach derives the Thevenin equivalent impedance for passive and active components, enabling accurate ESCR computations without complex derivations. Limitations of traditional SCR metrics are addressed by incorporating the dynamics of the converters, such as static synchronous compensators (STATCOMs), into a hybrid EMT-TPN method for synthesizing equivalent impedances. The process is then verified on the CIGRE OWPP benchmark and is found to capture ESCR variations with cable lengths, shunt reactors, and grid strength. Additionally, the results emphasize the correlation between the ESCR and voltage stability, highlighting the role of STATCOMs in supporting voltage stability in weak grids. This modular framework aids in OWPP design and stability analysis for converter-dominated systems.
This paper introduces a systematic gray-box identification framework for voltage-source converter models based solely on terminal time-series data. The proposed approach combines a physically informed white-box standard model with iterative time-domain calibration to estimate controller parameters that mimic the behavior of the black-box model in electromagnetic transient (EMT) simulations. Unlike conventional frequency-domain identification methods, the framework leverages time-domain data more effectively to better constrain the surrogate model across a broader operating range and capture reference-signal dynamics. To evaluate the accuracy of the identified model, the paper presents additional frequency-domain validation metrics based on Nyquist analysis and singular value decomposition, allowing for both quantitative assessment of model divergence and qualitative classification of mismatch types. The methodology is tested on cases with increasing structural uncertainty, from exact parametric recovery to an actual detailed EMT black-box model. Results demonstrate that the proposed framework can accurately recover parameters when the internal structure is known, adjust for moderate structural mismatch with extra degrees of freedom, and offer a reliability measure for small-signal stability analysis of converter models protected by intellectual property
The modern power system, increasingly composed of Inverter-Based Resources (IBR) from multiple manufacturers, requires new study and design techniques that balance accuracy with the need to protect the Intellectual Property (IP) of various stakeholders. One possible method to support detailed electromagnetic transient (EMT) simulations is to convert the original equipment manufacturers (OEM) models into shareable black-box versions using dynamic link libraries (DLLs). This technique prevents IP violations while potentially maintaining simulation accuracy by embedding the original components within the shareable DLL. Thereby, this work aims explicitly to enhance simulation fidelity by translating full-switching models of offshore wind turbines (OWTs). In this context, the paper offers valuable recommendations, including how to convert interpolation-based elements, preserve simulation speed, recognize limitations, and outline future improvements
This paper analyzes and identifies a space-based Global Navigation Satellite System (GNSS) interference source that has caused scores of powerful transient wide-area interference events over continental Europe, Greenland, and Canada since 2019. While terrestrial or near-terrestrial sources are primarily responsible for the recent uptick in GNSS interference worldwide, space-based interferers are of special concern given their potential for vast geographic reach and their portent of a qualitative escalation in GNSS interference. Based on data collected between 2019 and 2026 from a network of terrestrial GNSS reference stations, this paper (1) develops a received-power-based detection framework; (2) details the spatial, temporal, and spectral patterns of wide-area interference events caused by the source; (3) presents and analyzes identification techniques that blend received-power and time-difference-of-arrival measurements; and (4) applies these techniques to confidently identify the GNSS interference source as a constellation of Russian early warning satellites in Molniya ("lightning") orbits.
Learning-based and physics-informed methods are increasingly used for inverse estimation in controlled nonlinear dynamical systems. However, in many such approaches, the theoretic requirements that make unknown-input reconstruction meaningful, namely well-posedness in the sense of Hadamard, are often disregarded or weakly addressed through generic regularization terms with no explicit guarantees. In this work, we adopt a complementary viewpoint in which these control-theoretic and structural conditions inform the estimator design and constrain its training. We thus develop a physics-informed input-state neural estimator for joint unknown-input and state estimation in nonlinear controlled systems with partial measurements. In the present work, this general framework is instantiated on a model of autonomic cardiac regulation, provides a concrete study case. The estimator is formulated as an inverse neural map conditioned on time and measured outputs, and is trained under data fidelity and dynamical consistency constraints. To ensure it complies with the same structural requirements imposed in robust estimation, we derive left-invertibility conditions by differential-algebraic elimination and embed the resulting constraints directly into the training objective. We further analyze a priori the stability of the inverse mapping to output perturbations and derive a conservative Lipschitz bound that guides the tuning of cost functional hyper-parameters. The framework is evaluated on simulated data, where ground truth data is available, and on two distinct datasets of real cardiovascular recordings. The results show that incorporating control-theoretic solvability constraints into physics-informed learning improves the reliability of inverse inference beyond forward consistency alone.
Data-driven approaches for learning power flow models suffer from weak generalization across varying network topologies and limited computational scalability. Existing methods typically rely on training over a large set of grid topologies, which becomes impractical for large networks. This paper proposes a scalable and computationally efficient framework for topology-adaptive learning of power flow solutions. We propose a modular architecture consisting of bus-level Gaussian Process (GP) models, where each GP collects local features based on bus-level \textit{egonet} definition. The localized bus-level feature includes first-order power and admittance sensitivities, nodal injections and node degree. In addition to the modular architecture, we propose using Random Fourier Features (RFF) for feature reduction, which further enhances the computational scalability. We evaluate the effectiveness of the proposed method by simulations across multiple benchmark networks under N-1, N-2, and N-3 contingencies. Results for the PEGASE 1354 bus system under N-3 contingencies demonstrate high predictive quality, with an $R^2$ score of 0.983 and a voltage-magnitude RMSE of 0.0023 p.u. The framework maintains recall rates exceeding 98\% for detecting voltage limit violations across all test cases. Furthermore, the approach exhibits scalability, completing training and testing for the PEGASE 1354 system in 116.47 seconds while outperforming existing benchmarks in zero-shot generalization without requiring additional training samples.
While the hybrid energy storage system (HESS) can theoretically mitigate battery degradation in electric vehicles, its practical implementation remains highly limited. To delineate the specific scenarios and application boundaries where supercapacitors remain feasible, this study proposes a multi-dimensional techno-economic feasibility evaluation framework. First, a cross-vehicle sizing method based on dynamic programming is established to quantify physical mass-volume packaging constraints and identify feasible supercapacitor candidates across different vehicle types. Building upon the optimal sizing parameters derived from the battery aging Pareto front, an expert-guided deep reinforcement learning energy management strategy is integrated to yield near-optimal online performance, ensuring a fair life-cycle economic assessment. Finally, a comprehensive feasibility matrix is constructed to systematically evaluate mass, volume, battery lifespan, additional supercapacitor costs, total cost of ownership, future energy storage prices, and the influence of emerging solid-state batteries. Results reveal that city buses remain the most promising vehicle type for HESS due to minimal additional costs and sufficient packaging space. Current mass-volume penalties and limited economic benefits hinder HESS application in passenger vehicles and heavy-duty trucks, respectively. This situation may only improve if supercapacitor prices drop significantly in the future. Beyond vehicle types, the HESS feasibility is governed by load-frequency characteristics. Furthermore, looking toward the 2030+ solid-state battery era, we highlight that integrating increasingly affordable supercapacitors can provide substantial asset protection leverage.
Audio-Visual Event Recognition (AVER) is essential for intelligent urban monitoring systems, where robust multimodal understanding of complex environments is required. This paper proposes a stable hybrid cross-attention fusion framework for audio-visual event recognition in smart urban environments. The proposed architecture combines pretrained Video Masked Autoencoder (VideoMAE) and Audio Spectrogram Transformer (AST) representations with FiLM-based audio conditioning, bidirectional cross-attention fusion, multimodal Transformer encoding, and modality-temporal attention. To improve computational efficiency and training stability, frozen pretrained backbones and cached feature extraction are employed. Extensive experiments on the AVE dataset show that the proposed framework achieves the highest average performance among the evaluated unimodal and multimodal baselines across multiple evaluation metrics, obtaining a best validation accuracy of 91.74% and a test accuracy of 83.85 plus/minus 1.40% over five independent runs. The results indicate that the proposed hybrid fusion strategy effectively captures complementary audio-visual information and provides robust multimodal representation learning for challenging realworld urban monitoring scenarios.
Reliable and affordable electricity supply remains a challenge for remote and regional communities, motivating the deployment of renewable-based microgrids supported by flexible storage and advanced planning methods. This paper proposes an integrated techno-economic framework for optimal microgrid design and robustness assessment, and applies it to a 1000-household residential community in Rockhampton, Queensland (Australia). The framework links time-series simulation, dispatch-based operation, and lifecycle costing to evaluate hybrid configurations comprising photovoltaic and wind generation, battery storage, diesel backup, grid exchange, and an optional hydrogen subsystem (electrolyzer--hydrogen storage--fuel cell). Key indicators include net present cost (NPC), cost of energy (COE), renewable penetration, energy purchased/sold, and emissions-related outcomes. To avoid conclusions that depend on a single set of assumptions, the study performs systematic sensitivity analysis across financial, technical and policy drivers: discount rate, technology capital costs, fuel price, load uncertainty, renewable resource variability, carbon pricing/emissions cost, and grid outage duration, supplemented by a no-hydrogen attribution case. The results demonstrate that several sensitivity dimensions induce nonlinear shifts in the optimal design, including breakpoints where capital-intensive renewable--storage expansion becomes economically preferable. The proposed framework enables transparent comparison of hydrogen-enabled and battery-centric solutions and provides planning guidance for resilient, low-emission community microgrids under Australian operating conditions.
Pinching antenna systems (PASS) have recently emerged as a promising architecture for flexible indoor wireless communications. However, most existing pinching antenna (PA) array designs for multi-user PASS either offer limited beam adaptation accuracy or require prohibitively high deployment cost. In this paper, we investigate a more practical constrained pinching antenna array (C-PAA)-assisted downlink PASS, where multiple PAs are grouped into a movable array and can be finely adjusted within the array at the wavelength scale. To improve the system spectral efficiency, a sum-rate maximization problem is formulated by jointly considering the array-center position and the fine-grained antenna distribution within the C-PAA. First, the structural properties of the C-PAA are characterized, and an explicit upper bound on the array aperture is derived. Then, tractable approximations for the effective channel gain and the achievable user rate are developed. Furthermore, the optimization problem of the multi-user sum-rate is analyzed, where the system sum-rate function is shown to exhibit a favorable unimodal behavior under practically relevant conditions, which enables an efficient one-dimensional search for the optimal C-PAA position. To further reduce the computational complexity, a closed-form approximate solution for the near-optimal array-center position is derived. Numerical results verify the accuracy of the developed analysis and demonstrate that the proposed C-PAA scheme closely approaches the ideal upper bound and significantly outperforms conventional fixed-spacing and existing PA array benchmarks.
Acoustic feedback limits the maximum gain in hearing aids. In addition to several approaches based on adaptive filtering, recently a deep-neural-network-based feedback cancellation (DFC) approach has been proposed, which is trained via an open-loop framework. Since open-loop-trained DFC (DFC-OL) can become unstable during inference at high gains, in this paper we propose an in-the-loop-trained DFC (DFC-IL) that integrates the DFC directly into the optimisation loop. This allows the model to be exposed to unstable conditions during training. A two-stage training strategy involving pre-training on stable systems and fine-tuning on a wider gain range enables DFC-IL to learn robust howling reduction. Experimental results on measured feedback paths demonstrate that in scenarios with small gains, the proposed DFC-IL performs similarly to DFC-OL, and both exceed the performance of adaptive filters. In scenarios with high amplification gains, DFC-IL clearly outperforms DFC-OL by maintaining system stability.
Computing the Lipschitz constant of the solution map of a multi-parametric quadratic program is important for the analysis of optimization-based control. This problem is governed by three factors: the parameter dimension, the number of decision variables, and the number of constraints. While empirical evidence has long suggested exponential complexity, a rigorous complexity-theoretic proof has been lacking. In this paper, we fill this gap by proving that this problem is not only NP-hard but also APX-hard. Furthermore, we reveal that: (a) the problem becomes polynomial-time solvable when the number of constraints or decision variables is fixed; and (b) both NP-hardness and APX-hardness persist even in the scalar parameter case. These results confirm that the complexity stems from the number of constraints and variables, rather than the parameter dimension. Numerical experiments further validate these theoretical findings.
Supervisory control of discrete-event systems provides formal guarantees of correctness with respect to a plant model and specification. However, these guarantees heavily rely on the plant model, which could deviate from nominal behavior due to modeling errors or faults. Recent notions of discrete robustness model deviations as a set of additional transitions that are added to the plant. The discrete robustness is defined as all sets of extra transitions for which the supervised system still guarantees a desired specification. However, this notion suffers from scalability due to the large solution space and conservatism since most deviations are infeasible in practice. This paper proposes to address these two issues using a neurosymbolic computing framework for discrete robustness analysis of safety properties. First, a neural reasoning layer based on Large Language Models infers a set of feasible deviation transitions from system models, specifications, and domain knowledge. Next, a symbolic layer computes the discrete robustness guarantees over the inferred deviation set. We evaluate our framework on three case studies, demonstrating that our method identifies a smaller set of feasible deviations while preserving robustness guarantees comparable to those of full transition-based analysis.
The growing penetration of distributed energy resources (DERs) intensifies congestion in distribution networks by introducing bidirectional power flows and increasing competition for limited network capacity, underscoring the need for effective and efficient congestion management, including flexible grid-access schemes. This paper proposes a bilevel optimization model for the dynamic allocation of connection capacity to DERs under non-firm connection agreements, aligning the objectives of distribution system operator (DSO) and DER owners. The upper-level problem, representing the DSO, determines the allocated connection capacity for all DERs, defined as maximum time-varying power limits, subject to distribution system constraints and the last-in-first-out (LIFO) allocation rule. The lower-level problem, representing DER owners, maximizes the profit of each DER within the allocated power limits. The proposed model is tested on a modified CIGRE medium-voltage (MV) network, demonstrating a balanced trade-off between grid utilization and economic efficiency. Furthermore, the model enhances DER integration, enforces transparent allocation rules, reduces variability in allocation patterns, and achieves up to an 80% reduction in total curtailment costs compared with benchmark methods.
In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at this https URL .
Motivated by recent developments in stochastic modeling of clock jitter in Analog-to-Digital Converters (ADCs) as autoregressive processes of order one (AR(1)), we study the density and stability properties of AR(1)-jittered sampling sets for Paley-Wiener signals. We show that, despite having the correct asymptotic density both on average and almost surely, such sets almost surely fail to be stable sampling sets. We complement this negative result with a finite-dimensional analysis, showing that the corresponding jittered sinc matrices are nonetheless well-conditioned with high probability.
Integrated sensing and communication (ISAC) enables simultaneous sensing and data transmission but exposes a critical vulnerability: probing signals may be intercepted, revealing both the transmitted information and the act of sensing itself. Existing physical layer security approaches mitigate interception yet operate with detectable signals, leaving sensing activity observable to a passive warden. This paper introduces sub-noise-floor pseudo-random probing (SNF-PRP), a covert sensing framework for OFDM-based ISAC systems under an energy-detection adversary model. SNF-PRP establishes an $\epsilon$-covertness guarantee via Kullback-Leibler (KL) divergence, exploits an $N_{\mathrm{sc}}$-fold spreading gain absent from prior wideband analyses, and derives in closed form the minimum integration length required to achieve a target Cramér-Rao bound (CRB). Simulations under 5G~NR n78 numerology confirm sub-0.5\,m range and sub-0.5\,m/s velocity accuracy with KL divergence $5.8\times$ below the covertness threshold, validating joint feasibility at $-12$\,dB and $-15$\,dB probing powers.
This paper presents a geometric adaptive control scheme for a quadrotor unmanned aerial vehicle, where the effects of unknown, unstructured disturbances are mitigated by a multilayer neural network that is adjusted online. The stability of the proposed controller is analyzed with Lyapunov stability theory on the special Euclidean group, and it is shown that the tracking errors are uniformly ultimately bounded with an ultimate bound that can be abridged arbitrarily. A mathematical model of wind disturbance on the quadrotor dynamics is presented, and it is shown that the proposed adaptive controller is capable of rejecting the effects of wind disturbances successfully. These are illustrated by numerical examples.
This paper proposes a geometric adaptive controller for a quadrotor unmanned aerial vehicle with artificial neural networks. It is assumed that the dynamics of a quadrotor is disturbed by arbitrary, unstructured forces and moments caused by wind. To address this, the proposed control system is augmented with multilayer neural networks, and the weights of neural networks are adjusted online according to an adaptive law. By utilizing the universal approximation theorem, it is shown that the effects of unknown disturbances can be mitigated. More specifically, under the proposed control system, the tracking errors in the position and the heading direction are uniformly ultimately bounded where the ultimate bound can be reduced arbitrarily. These are developed directly on the special Euclidean group to avoid complexities or singularities inherent to local parameterizations. The efficacy of the proposed control system is first illustrated by numerical examples. Then, several indoor flight experiments are presented to demonstrate that the proposed controller successfully rejects the effects of wind disturbances even for aggressive, agile maneuvers.
Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X-ray) angiograms remain the standard for stenosis diagnosis, they are invasive, time- and resource-intensive, and therefore only performed on patients with a high probability of disease based on symptoms and prior clinical tests. However, a subset of patients, especially those without symptoms, may remain undiagnosed. Detecting indications of stenosis from ECGs, which are fast, cheap, non-invasive, and thus routinely acquired even in asymptomatic patients, would support early diagnosis. However, as no reliable stenosis-specific signal has been identified in ECGs, they can not currently be used for stenosis risk stratification. To address this, we introduce StenCE, a pretraining framework, allowing stratification of patients based on features derived directly from ECGs. Evaluations across varying stenosis severity thresholds and additional ECG disease classification tasks demonstrate consistent performance improvements across different ECG encoders, outperforming previous work. The obtained models successfully detect signals for stenosis diagnosis in ECGs and are the first to achieve high performance in severe stenosis classification. The source code is available at this https URL.
Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (this https URL) for codes and more generated songs.
Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local feature responses within the same modality can disagree with evidence from other sources. This work investigates how to adjust multimodal representations before they are merged by a downstream predictor. We develop a compact calibration module that compares each modality with the others at the summary level, extracts cues of cross-source support and conflict, and converts these cues into instance-wise and dimension-wise modulation signals. The calibration is applied to the original modality features rather than to already fused representations, enabling the model to suppress misleading components, preserve weak but useful evidence, and emphasize responses that are better supported by the current multimodal context. The module is designed as a plug-in component and can be attached to different fusion backbones without changing their prediction heads. Across five benchmarks covering sentiment understanding, action recognition, audio-visual event detection, and audio-visual emotion classification, the proposed pre-combination calibration strategy improves performance under both sequence-based and convolutional fusion settings. Additional analyses under modality removal, synthetic corruption, training dynamics, and feature-level visualization show that calibrating signals before fusion can reduce interference from unreliable modalities and produce more stable multimodal optimization.
Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose \textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \textbf{+7.4\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \textit{13B} parameters across three benchmarks using \textbf{22$\times$} fewer parameters; scaling to \textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at this https URL.
Photonic graph states with advanced topologies can enable measurement-based quantum computing, distributed quantum sensing, and quantum interconnects. However, the efficient generation of photonic graph states is limited by the probabilistic nature of photonic entangling operations and the exponential dependence of generation rate on resource cost. In this work, we study photonic graph state synthesis as a cost-aware decomposition problem, exploiting local Clifford (LC) equivalence to identify more synthesis-friendly representations of the target graph state before decomposition. Specifically, we propose Cost-aware Fusion-based Decomposition (CFD), a three-stage heuristic framework that decomposes a target graph state into ring, star, and linear motifs, and assembles them via Type-I fusion operations to minimize fusion overhead and physical-qubit consumption. We further show that selecting the LC-equivalent graph state with the minimum number of edges provides a highly effective proxy for near-optimal synthesis: in many cases it matches the best generation rate observed within the LC equivalence class under CFD, and in most remaining cases it remains close to it. Numerical evaluations on graph state orbit data and 2D and 3D lattice graph states demonstrate that CFD achieves up to 84.6\% reduction in resource overhead compared to baseline constructions, and yields improvements in photonic generation rate spanning multiple orders of magnitude. These results suggest that combining structure-aware motif decomposition with LC equivalence is a practical and scalable strategy for photonic graph state synthesis.
This paper develops a data-driven realization of the generalized Koopman operator (GeKo), in which states and inputs are lifted independently and the dynamics are expressed as a tensor bilinear system. The first contribution is a time-sequenced multi-step Khatri-Rao kernel regression formulation that exposes the operator to evolved snapshots along trajectories rather than only single one-step pairs, which reduces compounded prediction error. Secondly, we develop a kernel- and input-agnostic structured SVD reduction that compresses the lifted state and input spaces while preserving the Khatri-Rao realization. We instantiate the framework with random Fourier features and describe a complete predictive-control pipeline, including a multi-step roll-out diagnostic that guides the choice of MPC horizon. The framework is validated on the chaotic Lorenz system, where the learned reduced-order GeKo model stabilizes an unstable equilibrium from a range of initial conditions.
Interpretable brain-computer interface classifiers that generalize across subjects without calibration remain an open challenge. We test whether prototype-based cross-attention can provide competitive, interpretable event-related potential (ERP) classification under deployment-compatible conditions. We propose ERP-XTTN, a cross-attention architecture that routes input EEG patches to fixed difference-wave prototypes via query-key-only cross-attention with no value projection, so classification depends entirely on attention routing and attention faithfulness is structural rather than post-hoc. Prototypes are derived automatically from extrema in the training-fold difference wave. We evaluate across three public sources (BNCI Horizon 2020, HRI Cursor, and ERP CORE) spanning eight ERP components (ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400), using leave-one-subject-out (LOSO) evaluation with causal filtering at two channel counts (3-channel and full montage), against EEGNet and xDAWN with Riemannian geometry (xDAWN+RG). The mean gap between the best baseline and ERP-XTTN was .018 AUROC at 3 channels and .034 at full montage, arising from two largely distinct sources: a temporal-flexibility cost relative to EEGNet and a spatial-exploitation cost relative to xDAWN+RG, the latter driven by signal-to-noise ratio at full montage. Beyond accuracy, the transparent routing reveals cross-subject signal structure that black-box models cannot: false positives resembled true positives more than true negatives did, indicating that classification errors are neurophysiologically explicable. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with a small interpretability cost at minimal montages. To our knowledge, this is the first epoch-level LOSO benchmark on ERP CORE.
Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate this http URL propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.
The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomous AI workloads with no human in the loop, 550 km above the Earth. Microsoft, AWS, and a growing list of orbital computing ventures are moving cloud-scale processing off the ground and into orbit. What none of them have answered yet is the governance question -- when autonomous AI systems at orbital data center scale make wrong decisions in space, what stops those decisions before they become irreversible? We introduce Glass Box: a runtime constitutional AI verification layer that intercepts every candidate action from an onboard AI policy and evaluates it against six physics-grounded constitutional constraints and seven Linear Temporal Logic (LTL) safety invariants before a single command reaches any spacecraft subsystem. Every approved action carries a weighted explainability score E(a_t) in [0,1] and a complete constitutional audit log. We demonstrate Glass Box within Project October: a fully simulated five-layer autonomous orbital intelligence architecture for CubeSat-class spacecraft. We prove that Glass Box verification overhead is O(N_c) in the number of constitutional rules, independent of model size or spacecraft state dimension. We present a complete formal specification of the constitutional constraint grammar, seven LTL safety invariants verified by Z3 and NuSMV model checking, and a detailed worked example of Glass Box intercepting an unsafe inference request at eclipse-entry under degraded battery state. As orbital computing scales toward data center infrastructure, runtime constitutional verification is no longer a research novelty -- it is mission-critical safety infrastructure that every autonomous orbital platform will eventually require.
Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a consumer smartphone. We present CoughSense, a system that sorts cough recordings into five classes. These are healthy, COVID-19, asthma or respiratory condition, bronchitis, and pneumonia. We aggregated 18,301 recordings from four public datasets (Coswara, CoughVID, Virufy, and the West China Hospital Pediatric Cough Dataset) and used the OpenAI Whisper encoder as a pretrained backbone for cough disease classification. The main contribution is active-frame QKV attention pooling, which restricts attention to the first 200 of 1500 encoder tokens. This avoids the silence-dilution problem that arises because a 3-second cough fills only 150 tokens of Whisper's 30-second input window. Other training parts handle the 19 to 1 class imbalance and the four-dataset domain shift. These include WeightedRandomSampler, SpecAugment, Balanced Mixup with forced minority pairing, a supervised contrastive auxiliary loss, FiLM symptom conditioning, and gradient-reversal domain adaptation. A dual-encoder model fuses Whisper with the OPERA-CT respiratory foundation model through cross-attention. CoughSense (Whisper-tiny, 8.6M parameters) reached 82.3 percent balanced accuracy on five-fold cross-validation (macro-F1 of 0.817, AUC of 0.941). It beat an ImageNet-pretrained EfficientNet-B2 by 11.1 points and a ViT trained from scratch by 29.6 points. All five classes passed 74 percent recall and four of five passed 80 percent. The dual-encoder model reached 85.4 percent balanced accuracy. Active-frame pooling is the largest single contributor across all ablation components at 5.1 points, which should help any short-audio task using Whisper as a backbone.
Rotatable antenna (RA) technology has emerged as a promising solution to enhance spectrum efficiency by exploiting additional spatial degrees of freedom (DoFs) in multiple access networks. However, the relative performance superiority among different multiple access schemes remains largely unclear due to the unique capability of RA in reconfiguring the directional gain pattern. In this letter, we conduct a theoretical comparison between non-orthogonal multiple access (NOMA) and orthogonal multiple access (OMA) schemes in RA-assisted communication systems in terms of transmit power minimization, subject to constraints on antenna rotational range and users' target rates. To address the associated non-convex optimization problem, a particle swarm optimization (PSO) algorithm is employed to optimize the rotational angle. Simulation results demonstrate that RA-assisted schemes significantly reduce transmit power compared to fixed-antenna benchmarks. Furthermore, RA-assisted NOMA may perform worse than time-division multiple access (TDMA) for symmetric user deployments, while it exhibits superior robustness and energy efficiency in asymmetric scenarios.
Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.
Diffusion models achieve high-fidelity radio map construction through iterative denoising, yet their sampling cost limits practicality in dynamic wireless systems where radio maps must be refreshed repeatedly. Meanwhile, classical propagation models encode valuable scene-level knowledge that standard diffusion inference discards entirely by initializing from pure Gaussian noise. This paper bridges propagation priors and diffusion refinement through a mid-start sampling strategy. A matched propagation prior is perturbed to an intermediate diffusion timestep, and the pretrained diffusion backbone executes only the remaining reverse steps, focusing computation on multipath-aware refinement rather than full reconstruction from noise. We provide theoretical analysis establishing an upper bound on the initialization gap, a sufficient condition under which truncation improves reconstruction fidelity, and a formal characterization of prior-quality sensitivity under aggressive truncation. Experiments on IRT4HighRes show that, at $P_{\text{start}}=0.5$, the proposed method achieves a $2.01\times$ speedup while simultaneously improving NMSE, RMSE, SSIM, and PSNR over the full-step baseline. A prior-quality ablation across three propagation models of different fidelity confirms that reconstruction quality tracks prior quality, with the sensitivity amplified under shorter reverse trajectories, consistent with the theoretical predictions. These results also suggest that mid-start reconstruction quality can serve as a proxy for ranking the scene-level fidelity of different propagation models.
Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: this https URL.
Underwater manipulation often occurs under degraded visibility due to turbidity, glare, and gripper occlusion, limiting the reliability of vision-based perception during approach and grasping. In such settings, soft grippers are well suited for compliant interaction, but they typically lack an onboard pre-contact cue that can guide approach and closure when vision is unreliable. This extended abstract explores active electrosense as a lightweight sensing modality that can provide a proximity-like signal prior to contact by measuring perturbations of an applied electric field in conductive media. We instrument an octopus-inspired gripper with a discrete electrode layout and record multi-channel sensing voltages using off-the-shelf hardware. Simulation and tank experiments with a suspended conductive sphere show structured, object-dependent changes in the multi-electrode voltage readout relative to empty-water baselines, with detectability varying across excitation of 5 to 20 V and frequencies from 1 mHz to 1 kHz. These findings motivate systematic investigation of gripper-integrated electrosense as a complementary pre-contact cue for underwater soft manipulation.
Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $\rho>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($\rho \geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.
In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.
Artificial-intelligence surrogates can support second-by-second thermal-hydraulic forecasting, but models selected and frozen offline may become condition-locked once deployed outside their pretraining envelope. This study develops a guarded continual-adaptation framework for experimental thermal-hydraulic loop data in which role-separated agents - Monitor, Diagnosis, Adaptation, Safety-Auditor, and Orchestrator - diagnose error signatures, prioritize candidate model families, and review promotions, while deterministic champion-challenger gates and background shadow learning retain final authority over model replacement. Seven surrogate families were screened by blocked three-fold cross-validation, and a temporal Fourier neural operator was selected as the initial champion for 60-s-history-to-10-s-trajectory forecasting on two held-out transients, with three seeds per adaptive mode. Static deployment gave a channel-averaged MAE of 7.06 and a 56.8% warning-exceedance ratio; rule-based adaptation reduced MAE to 6.54, whereas shadow refresh alone remained close to Static. The MA-Full mode, in which the role-separated multi-agent council reviews every evaluated stream step, achieved the lowest mean error, 5.72, and 35.8% exceedance, corresponding to a 19.0% improvement over Static. Paired bootstrap intervals against Static excluded zero, although intervals among adaptive modes overlapped and the six paired units limit broad statistical claims. Validated promotions from the neural operator to Transformer and graph neural network indicate that logged, gate-controlled adaptation can support auditable surrogate evolution while deterministic gates retain deployment authority.
This paper investigates the secrecy sum-rate (SSR) performance of optical intelligent reflecting surface (OIRS)-assisted multi-user visible light communication (VLC) systems under line-of-sight (LoS) blockages. To mitigate physical obstructions and internal eavesdropping, a joint optimization problem is formulated to maximize the SSR through the co-design of the transmission precoder and OIRS units assignment. Due to the binary constraints and coupled variables, the problem is highly non-convex. To solve it efficiently, an alternating optimization (AO) framework integrating the concave-convex procedure (CCCP) and first-order Taylor approximations is developed. Simulation results demonstrate the convergence of the proposed algorithm and show that increasing the number of OIRS reflecting units yields significant SSR gains.
A class of Kuramoto models with a general coupling function that can be expressed in terms of a finite number of harmonics, each comprising sinusoidal terms, is studied. We propose a novel approach for certifying local phase synchronization in this class for all initial conditions lying on an arc. The trace parametrization property and Gram matrix representation of a trigonometric polynomial are utilized along with Putinar's Positivstellensatz to obtain semidefinite programming certificates for the stability of the phase-difference system, which in turn implies synchronization of the original system. The results can be extended to any system of coupled oscillators where the forward-invariance on arcs can be established.
This article presents a finite-horizon linear quadratic regulator for the control of the first-order Lighthill-Whitham-Richards traffic model with a triangular fundamental diagram. The in-domain control action is realized through variable speed limits implemented as a source term in the governing hyperbolic partial differential equation. Unlike prior studies on infinite-horizon formulations, this article develops a finite-horizon LQR framework, deriving a space and time varying state feedback function for hyperbolic PDEs. The solution to the finite time optimal control problem relies on the solution of another PDE, called the Riccati PDE. The resulting nonlinear Riccati PDE is solved analytically via the parametric method of characteristics. The Riccati PDE solution is a function of both time and space, as well as the traffic regime. A sensitivity analysis demonstrates the effects of the LQR parameters for both the infinite and finite time horizon problem in different traffic situations, while siulations validate the finite-horizon LQR's ability to guarentee finite-time convergence. Comapred to the infinite-horizon LQR, the proposed approach achieves significantly improved control performance across various scenarios, making it particularly suitable for time-sensitive traffic management applications.
We demonstrated hybrid free-space optics (FSO) and D-band (110-170GHz) millimetre wave transmitter enabled by a single phase-locked laser pair, simultaneously enabling ultra-low RF phase noise and optical linewidth for communications. Based on this, we further study combined capacity with beam angle misalignment using >100Gb/s signalling.
Graph Neural Networks (GNNs) have emerged as a powerful tool for wireless resource allocation that leverages the underlying graph structure of communication networks. Their transferability property enables models trained on small-scale graphs to generalize to large-scale deployments with little performance deterioration, a desirable property for currently growing networks. Wireless networks are sparse regimes, where a single node is connected to a small number of other users. This work establishes theoretical results for transferability of GNNs over graphs derived from sparse Random Geometric Graphs (RGGs). In particular, we focus on conflict graphs of RGGs used to model interference among links. Our approach considers the closeness between RGGs and Deterministic Grid Graphs (DGG) to establish bounds in the performance loss when a model is transferred across scales. We validate our theoretical findings through the problem of link scheduling, demonstrating that our learned policies consistently outperform existing benchmarks at scale. Finally, we examine the impact of our theoretical assumptions on empirical performance.
We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.
Motivated by the vision of making sixth-generation (6G) networks sustainable, we study the sparse antenna/array activation problems in uplink cell-free massive multiple-input multiple-output (CF mMIMO) networks. We first develop an antenna-level optimal bilinear equalizer (OBE) weighting framework, in which each access point-user equipment (AP-UE) pair is assigned a matrix-valued long-term weight to shape the contribution of individual antenna elements, thereby generalizing the conventional large-scale fading decoding (LSFD) strategy from scalar coefficients to antenna-element-aware weighting. Building on this structure, we formulate sparse antenna activation as structured sparsity-inducing mean square error (MSE) minimization problems, and design four activation schemes at two granularities: antenna-level and array-level, each with UE-specific and network-wide (all-UEs) variants. The resulting convex problems are solved efficiently via the proximal method with closed-form group-wise updates, while the network-wide schemes are modeled through hierarchical sparsity and handled by a tree-structured proximal operator. Numerical results under correlated Rician channels and a detailed power consumption model demonstrate that the OBE weighting scheme consistently improves spectral efficiency over the LSFD, with gains increasing with the number of antennas. Meanwhile, the studied sparse activation schemes can achieve substantial energy efficiency improvement and power reduction with controllable spectral efficiency loss.
This letter proposes a novel distributed bearing-based pose estimator for time-varying multi-robot systems. The method uses angles computed from body-frame bearings to estimate the robots' positions in $\mathbb{R}^3$ without knowledge of their orientations. The orientations in $\mathrm{SO}(3)$ are recovered from the estimated positions, the bearings, and the bearing derivatives. The proposed observer only requires the (directed) sensing topology to be \textit{angle-rigid}, a weaker condition than the commonly used ones like bearing rigidity. Local uniform exponential stability of the proposed observer is established under the assumption of persistently exciting motions for a subset of robots. Simulations are presented and discussed to evaluate the scheme's effectiveness and practicality.
Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.
In this paper we introduce a novel block-based regression strategy for image denoising based on edge-aware Steered-Mixture-of-Experts (SMoE) models. SMoEs provide very sparse image representations, able to model sharp edges as well as smooth transitions in images efficiently with few parameters. A multi-model inference strategy is developed that improves significantly the denoising capacity of single SMoE models. We show that the important edge reconstruction properties of SMoEs are well preserved, even when many models are fused under severe noise. We investigate model-inference from local neighborhood blocks as well as from distant blocks using block-matching as in BM3D. Our initial results indicate that SMoE multi-model regression can provide promising results compared to state-of-the-art BM3D with excellent edge quality.
The absence of formal performance guarantees in machine learning (ML) has limited its adoption for safety-critical power system applications, where confidence and interpretability are as vital as accuracy. In this work, we present a probabilistic guarantee for power flow learning and voltage risk estimation, derived through the framework of Gaussian Process (GP) regression. Specifically, we establish a bound on the expected estimation error that connects the GP's predictive variance to confidence in voltage risk estimates, ensuring statistical equivalence with Monte Carlo-based ACPF risk quantification. To enhance model learnability in the low-data regime, we first design the Vertex-Degree Kernel (VDK), a topology-aware additive kernel that decomposes voltage-load interactions into local neighborhoods for efficient large-scale learning. Building on this, we introduce a network-swipe active learning (AL) algorithm that adaptively samples informative operating points and provides a principled stopping criterion without requiring out-of-sample validation. Together, these developments mitigate the principal bottleneck of ML-based power flow, its lack of guaranteed reliability, by combining data efficiency with analytical assurance. Empirical evaluations across IEEE 118-, 500-, and 1354-bus systems confirm that the proposed VDK-GP achieves mean absolute voltage errors below 1E-03 p.u., reproduces Monte Carlo-level voltage risk estimates with 15x fewer ACPF computations, and achieves over 120x reduction in evaluation time while conservatively bounding violation probabilities.
Inspired by the use of adaptive kernel-based Cohen's class time-frequency distributions (CCTFDs) for cross-term suppression, this paper aims to explore novel adaptive kernel functions for denoising, with a particular focus on non-stationary signal processing in practical applications}. We integrate Wiener filter principle and the time-frequency filtering mechanism of CCTFD to design the least-squares adaptive filter method in the Wigner-Ville distribution (WVD) domain, giving birth to the least-squares adaptive filter-based CCTFD whose kernel function can be adjusted with the input signal automatically to achieve the minimum mean-square error denoising in the WVD domain. {Numerical experiments on typical simulated radar signals and real-world electrocardiogram data comprehensively demonstrate that the proposed adaptive CCTFD outperforms several state-of-the-art methods in noise suppression.
This work presents a cognitive radar (CR) framework designed to track multiple aircraft under unknown disturbances using massive multiple-input multiple-output (MMIMO) systems. Since uniform power allocation is suboptimal across varying signal-to-noise ratios (SNRs), we couple an adaptive waveform design driven by Partially Observable Monte Carlo Planning (POMCP). By assigning an independent POMCP tree to each target, the system efficiently predicts target states. These predictions inform a constrained optimization problem that actively directs transmit energy toward weaker targets while maintaining sufficient power for stronger ones. Results confirm that the proposed POMCP method improves the detection probability for low-SNR targets from 0.6 to nearly 0.9, and yields more accurate tracking of the weakest target than a non-adaptive orthogonal waveform or a cognitive uniform-power POMCP baseline.
This work establishes a step forward in advancing data-driven trajectory-based methods for stochastic systems with unknown mathematical dynamics. In contrast to scenario-based approaches that rely on independent and identically distributed (i.i.d.) trajectories, this work develops a data-driven framework where each trajectory is gathered over a finite horizon and exhibits temporal dependence, referred to as a non-i.i.d. trajectory. To ensure safety of dynamical systems using such trajectories, the current body of literature primarily considers dynamics subject to unknown-but-bounded disturbances, which facilitates robust analysis. While promising, such bounds may be violated in practice and the resulting worst-case robust analysis tends to be overly conservative. To overcome these key challenges, this paper considers stochastic systems with unknown mathematical dynamics, influenced by process noise with arbitrary distributions. In the proposed framework, data is collected from stochastic systems under multiple realizations within a finite-horizon experiment, where each realization generates a non-i.i.d. trajectory. Leveraging the concept of stochastic control barrier certificates constructed from data, this work quantifies probabilistic safety guarantees with a certified confidence level. To achieve this, the proposed conditions are formulated as a sum-of-squares (SOS) optimization problem, relying solely on empirical average of the collected trajectories and statistical features of the process noise. The efficacy of the approach has been validated on three stochastic benchmarks with unknown models and arbitrary noise distributions. In one case study, it is shown that while no safety controller exists for the robust analysis of the system under bounded disturbances, the proposed stochastic framework yields a safety controller together with quantified probabilistic safety guarantees.
Graph neural networks (GNNs) excel in processing non-Euclidean data, but traditional spectral GNNs rely on static bases and fundamentally lack active spectral regulation. Although the graph fractional Fourier transform (GFRFT) introduces cross-domain modulation, it applies a uniform fractional parameter across all frequencies. This ignores frequency heterogeneity and restricts the models' adaptive capacity in graph node classification tasks. In the paper, we propose two novel types of multiple-parameter GFRFTs (MPGFRFTs) and establish their corresponding theoretical frameworks, including essential properties, computational complexity, and parameters differentiability. By assigning independent, learnable fractional parameters to distinct frequency bands, MPGFRFTs enable fine-grained spectral regulation. Then, we operationalize this mathematical framework by designing the adaptive multiple-parameter fractional spectral regulation (MPFSR) module, a plug-and-play component for mainstream spectral models. We also establish rigorous theoretical bounds on the spectral stability of this module, guaranteeing a stable and reliable convergence during the end-to-end parameters optimization. Experiments demonstrate that integrating the proposed MPFSR module alleviates the constraints of static bases and yields performance gains in node classification on complex graphs, advancing a novel paradigm for active spectral modulation in graph representation learning.
We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal frequency division multiplexing (OFDM) channel generation. Unlike the standard diffusion model that uses a scalar-valued time index to represent the global noise level, we extend this notion to an element-wise time indicator to capture local error variations more accurately. Non-identical diffusion enables us to characterize the reliability of each element (e.g., subcarriers in OFDM) within the noisy input, leading to improved generation results when the initialization is biased. Specifically, we focus on the recovery of wireless multi-input multi-output (MIMO) OFDM channel matrices, where the initial channel estimates exhibit highly uneven reliability across elements due to the pilot scheme. Conventional time embeddings, which assume uniform noise progression, fail to capture such variability across pilot schemes and noise levels. We introduce a matrix that matches the input size to control element-wise noise progression. Following a similar diffusion procedure to existing methods, we show the correctness and effectiveness of the proposed non-identical diffusion scheme both theoretically and numerically. For MIMO-OFDM channel generation, we propose a dimension-wise time embedding strategy. We also develop and evaluate multiple training and generation methods and compare them through numerical experiments.
Traditional directed graph signal processing generally depends on fixed representation matrices, whose rigid structures limit the model's ability to adapt to complex graph topologies. To address this issue, this study employed the unified graph representation matrix (UGRM) to propose a generalized graph Fourier transform (UGRM-GFT) method based on singular value decomposition (SVD) for signal analysis on directed graphs and Cartesian product graphs. We defined UGRM-GFT for general directed graphs by introducing a parameterized UGRM that incorporates traditional representations such as the Laplacian matrix and adjacency matrix. The SVD is used to construct spectral transform pairs with both left and right singular vectors. We extended this approach to two types of UGRM-GFTs applied to directed Cartesian product graphs. UGRM-GFT-I performs SVD directly on the composite UGRM matrix of the two-dimensional graph structure, suitable for globally coupled graph signals. UGRM-GFT-II separately applies SVD to the UGRMs of the two-factor graphs and then combines the results, significantly reducing computational complexity while preserving spectral expressiveness. Theoretical analysis confirmed the monotonicity of the proposed method with respect to the parameters alpha and k embedded in the UGRM. Experimental results on real-world datasets demonstrated that the proposed method significantly outperforms traditional fixed-matrix approaches in denoising tasks, with a particular emphasis on signal-to-noise ratio and bandwidth efficiency.
Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes FoundCAC, a universal foundational framework that resolves two challenges hindering the generalization of existing pipelines: the difficulty of scaling training data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase degradation diversity and construct AODLibpro, a large-scale, unbiased lens library based on a uniform sampling strategy that quantifies spatial-variation patterns and severity. In terms of model design, to leverage Point Spread Functions (PSFs) as guidance while maintaining the blind paradigm, we propose a multi-stage vector-quantized representation learning scheme. This paradigm is specifically designed to construct a Latent PSF Representation (LPR), explicitly encoding complex continuous PSFs into a discrete degradation prior to regularize the highly ill-posed restoration process. Through a simple yet effective codebook-freezing strategy, our framework leverages the discrete prior to elevate full-shot restoration performance and unlock highly efficient few-shot adaptation for unseen lenses. Experiments on diverse aberrations of synthetic LensLib and real-world lenses demonstrate that our framework achieves state-of-the-art zero-shot generalization while enabling highly efficient few-shot adaptation for specific lenses. The source code and datasets will be made publicly available at this https URL.
Accurate direction-of-arrival estimation with hybrid analog--digital millimeter-wave sensor arrays is important for localization, environment sensing, and measurement beam control for sensing applications. However, the limited number of radio-frequency chains and training beams in practical hardware makes it difficult to approach the angular resolution of fully digital arrays. This paper develops a covariance-guided discrete Fourier transform (DFT) beam selection framework tailored to beamspace ESPRIT for hybrid millimeter-wave receivers. A short hybrid training phase realizes a virtual centro-symmetric subarray and yields a sample covariance that is processed by forward--backward averaging, nonnegative least-squares power and noise fitting, and a Toeplitz positive-semidefinite projection to reconstruct a denoised full-aperture covariance matrix. This covariance is then used to score and select, within each coarse sector, small contiguous blocks of DFT beams that concentrate signal energy and preserve effective aperture under a strict beam budget. The selected beams feed a sparse beamspace ESPRIT stage that operates only on actually available adjacent beam pairs, so that the overall complexity is dominated by a single low-dimensional ESPRIT call. Monte Carlo simulations for a thirty-two-element uniform linear array with three paths indicate that, in the considered scenarios, the proposed method can reduce the gap to the Cramér--Rao bound, lower the failure rate, and provide favorable accuracy--runtime trade-offs compared with a sectorization-based baseline built from the same codebook and estimator. For the unitary DFT codebook studied here, the fine-stage beam selector reduces to a covariance-guided contiguous energy-window rule; the broader score formulation also accommodates non-unitary effective beam dictionaries arising from hardware non-idealities.
In this paper, a constrained control approach to variable speed limit (VSL) control for macroscopic partial differential equations (PDE) traffic models is developed. Control Lyapunov function (CLF) theory for ordinary differential equations (ODE) is extended to account for spatially and temporally varying states and control inputs. The stabilizing CLF is then unified with safety constraints through the introduction of spatially varying control barrier functions (sCBF). These methods are applied to in-domain VSL control of the Lighthill-Whitham-Richards (LWR) model to regulate traffic density to a desired profile while ensuring the density remains below prescribed limits enforced by the sCBF. Results show that incorporating constrained control minimally affects the stabilizing control input while successfully maintaining the density with the defined safe set.
Integrating contention-based random access procedures into low Earth orbit (LEO) satellite communication (SatCom) systems poses new challenges, including long propagation delays, large Doppler shifts, and a large number of simultaneous access attempts. These factors degrade the efficiency and responsiveness of conventional random access schemes, particularly in scenarios such as satellite-based internet of things and direct-to-device services. In this paper, we propose a deep learning-based random access framework designed for LEO SatCom systems. The framework incorporates an early preamble collision classifier that uses multi-antenna correlation features and a lightweight 1D convolutional neural network to estimate the number of collided users at the earliest stage. Based on this estimate, we introduce an opportunistic transmission scheme that balances access probability and resource efficiency to improve success rates and reduce delay. Simulation results under 3GPP-compliant LEO settings confirm that the proposed framework achieves higher access success probability, lower delay, better physical uplink shared channel utilization, and reduced computational complexity compared to existing schemes.
In the context of distributed fusion estimation, directly transmitting local estimates to the fusion center may cause a privacy leakage concerning exogenous inputs. Thus, it is crucial to protect exogenous inputs against full eavesdropping while achieving distributed fusion estimation. To address this issue, a noise injection strategy is provided by injecting mutually independent noises into the local estimates transmitted to the fusion center. To determine the covariance matrices of the injected noises, a constrained minimization problem is constructed by minimizing the sum of mean square errors of the local estimates while ensuring ({\epsilon}, {\delta})-differential privacy. Suffering from the non-convexity of the minimization problem, an approach of relaxation is proposed, which efficiently solves the minimization problem without sacrificing differential privacy level. Then, a differentially private distributed fusion estimation algorithm based on the covariance intersection approach is developed. Further, by introducing a feedback mechanism, the fusion estimation accuracy is enhanced on the premise of the same ({\epsilon}, {\delta})-differential privacy. Finally, an illustrative example is provided to demonstrate the effectiveness of the proposed algorithms, and the trade-off between differential privacy level and fusion estimation accuracy.
Fetal ultrasound is the cornerstone of antenatal care, and accurate recognition of a small set of standard anatomical planes underpins biometry, growth surveillance, and detection of structural anomalies. Deep learning classifiers now match or exceed expert accuracy on curated benchmarks, but most remain opaque and miscalibrated, leaving clinicians without the calibrated confidence or faithful explanations needed for safe decision support. We systematically reviewed 78 studies published between January 1, 2015 and April 30, 2026 that paired automated fetal plane classification with explainability or predictive uncertainty quantification, following PRISMA 2020. Pooled balanced accuracy across six standard planes was 0.93 (95% CI 0.91 to 0.95), but only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction. We propose CALIB-XFUS, a 22-item reporting framework that operationalises calibration, explanation faithfulness, and fairness for regulated fetal ultrasound artificial intelligence. The framework spans six domains: clinical task and indication for use; dataset provenance and representativeness; model and training pipeline; calibration and selective prediction; explanation faithfulness and clinician validation; and post-market surveillance. We argue that uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible and regulatorily expected under the FDA Good Machine Learning Practice principles and the EU AI Act high-risk obligations.
Sampling-based model predictive control (MPC) is effective for nonlinear systems but often produces non-smooth control inputs due to random sampling. To address this issue, we extend the model predictive path integral (MPPI) framework with deterministic sampling and improvements from cross-entropy method (CEM)--MPC, such as iterative optimization, proposing deterministic sampling MPPI (dsMPPI). This combination leverages the exponential weighting of MPPI alongside the efficiency of deterministic samples. Experiments demonstrate that dsMPPI achieves smoother trajectories compared to state-of-the-art methods.
In this paper, we address the problem of circumnavigation of a stationary target by a heterogeneous group comprising of $\textbf{n}$ autonomous agents, having unicycle kinematics. The agents are assumed to have constant linear speeds, we control only the angular speeds. Assuming limited sensing capabilities of the agents, only a subset of agents, termed as \textit{leaders}, know the target location. The rest, termed as \textit{followers}, do not. We propose a distributed guidance law which drives all the agents towards the desired objective; global asymptotic stability (GAS) is ensured by using Zubov's theorem. The efficacy of the approach is demonstrated through both numerical simulations and hardware experiments.
Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. In this work we propose a Lyapunov-Constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We learn a linear lifted surrogate of the error dynamics via Extended Dynamic Mode Decomposition (EDMD) and solve the Discrete Algebraic Riccati Equation (DARE) to obtain a closed-form quadratic candidate Control Lyapunov Function (CLF). This CLF is incorporated into the SAC actor update as a Lagrangian penalty that aggregates the worst-case tail of violations via a Conditional Value-at-Risk (CVaR) objective, concentrating constraint pressure on rare but severe instability events. We further introduce three structural EDMD refinements spectral-radius normalization of the lifted A-matrix prior to the DARE solve, a physically meaningful LQR state cost, and a value-bias anchor enforcing V(0)=0 that make the closed-form CLF well-posed for higher-dimensional lifted models such as the cartpole and 3D quadrotor. The ablation study shows that a hard Lagrangian constraint is essential, replacing it with reward shaping (Lyap-RS-SAC) destabilizes learning and collapses return on quadrotor tasks.
Cell-free massive-multiple-input-multiple-output (CFmMIMO) is a key enabler for sixth-generation (6G) wireless communication networks, where distributed access points (APs) jointly serve user equipments (UEs). In commonly adopted channel models for CFmMIMO networks, inter-AP channel correlation is assumed to be absent, thereby eliminating the potential benefits of centralized processing. However, by carefully designing the pilot transmission phase, the AP received signals during pilot transmission can become correlated, and thus, centralization can improve channel estimation performance, despite the absence of inter-AP channel correlation. In this paper, we propose a channel estimation scheme, termed master-assisted channel estimation (MACE), that aims to leverage inter-AP signal correlation by means of partially centralized processing and hence improve channel estimation performance. In MACE, a subset of APs fuse and forward their received pilot signals to a master AP, which then performs channel estimation using the fused signals together with its locally received signals. This scheme strikes a balance between local and fully centralized processing by leveraging inter-AP signal correlation, while reducing fronthaul signaling and computational complexity. Numerical experiments demonstrate that MACE consistently outperforms local channel estimation, where inter-AP signal correlation is neglected.
6G will require on-device antenna systems to operate at ultra-high frequency bands, achieve robust beamforming on the compact user devices, and be blockage-robust. Conventional edge-mounted antennas on devices have limited apertures, suffer from the 'death grip' caused by user-induced blockage, and have poor scalability at mmWave and sub-THz bands. To address these issues, motivated by the rapid evolution of transparent materials and antennas, we propose ScreenAnt in this work--which integrates a transparent antenna array onto the screens of future mobile devices. Specifically, we propose using a transparent on-screen uniform planar array and develop a framework to model its electromagnetic property, spatial configuration, and blockage robustness under realistic user-induced blockage. We also design a gradient-ascent-based algorithm to efficiently optimize power and phase control of on-screen antennas to maximize ScreenAnt's spectral efficiency. Our thorough simulations show that the proposed ScreenAnt can increase the uplink spectral efficiency by over 50% compared to edge-mounted antennas at 28 GHz, and by more than 150% at 300 GHz. ScreenAnt also demonstrates strong robustness against user-induced blockage, paving the way for practical and high-capacity 6G user device designs.
Non-destructive testing using ultrasound is based on the interaction of sound waves with the object being tested and any defects it may contain. The aim is to extract as much information as possible about the object and its defects from the scattered wave field. In this paper, the concept of information in the context of ultrasonic testing is formalized and quantified physically for the first time. To this end, a balance equation for information is derived, analogous to Poynting's theorem for elastic energy. Various examples demonstrate how structural information is generated and annihilated within a component and along which pathways it travels from the defect to the sensor. Subsequently, the significance and potential of this new information concept for practical ultrasonic testing, structural health monitoring, numerical simulation, and machine learning are discussed. Finally, similarities and differences to mathematical Shannon information and statistical Fisher information are highlighted.
Modern UAV architectures increasingly aim to unify high-level autonomy and low-level flight control on a single General-Purpose Operating System (GPOS). However, complex multi-core System-on-Chips (SoCs) introduce significant timing indeterminism due to shared resource contention. This paper performs an architectural analysis of the PREEMPT RT Linux kernel on a Raspberry Pi 5, specifically isolating the impact of kernel activation paths (deferred execution SoftIRQs versus real-time direct activation) on a 250 Hz control loop. Results show that under heavy stress, the standard kernel is unsuitable, exhibiting worst-case latencies exceeding 9 ms. In contrast, PREEMPT RT reduced the worst-case latency by nearly 88 percent to under 225 microseconds, enforcing a direct wake-up path that mitigates OS noise. These findings demonstrate that while PREEMPT RT resolves scheduling variance, the residual jitter on modern SoCs is primarily driven by hardware memory contention.
X-ray computed tomography (XCT) is widely used for non-destructive testing of Nomex honeycomb structures in aerospace manufacturing, but industrial inspection still relies heavily on manual interpretation and supervised models trained on limited labeled data. This work introduces NL-MambaXCT, a Mamba-based framework that combines self-supervised masked image modelling with a Nested Learning (NL) formulation for automated, label-efficient defect classification from production XCT slices. The backbone is a four-stage 2D encoder with RegNet convolutional blocks in the early stages and Mamba-based sequence mixing with attention in the deeper stages. It is pretrained by masked image modelling on 19,961 unlabeled industrial XCT slices and fine-tuned on 2,000 relabeled Nomex XCT slices split by production order. NL is instantiated through two-timescale parameter dynamics: selected projections maintain slow exponential-moving-average traces alongside fast weights, while a deep-momentum optimizer introduces an additional slow parameter-update trajectory. On the held-out test set, the MIM-pretrained NL-MambaXCT model achieves 96.91% accuracy and 96.8% macro F1, outperforming CNN, attention, and single-timescale Mamba baselines by 3.11--10.31 percentage points in accuracy. The results suggest that combining masked self-supervision with NL-style fast/ slow learning dynamics is a promising strategy for robust defect classification in Nomex honeycomb XCT inspection.
We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.
This paper investigates the secrecy sum rate (SSR) of rate-splitting multiple access (RSMA)-based visible light communication (VLC) systems considering internal eavesdropping, where legitimate users may intercept private data intended for others. We formulate an optimization problem to maximize the SSR of the system, which is inherently non-convex due to the complex coupling of the objective function and constraints. To this end, two different approaches based on the convex-concave procedure (CCCP) and semidefinite relaxation (SDR) are leveraged to solve the non-convex parameterized problem. A central focus of this work is the investigation of channel similarity (CS), which serves as a metric for quantifying spatial correlation, and its impact on SSR performance. To mitigate the performance degradation caused by high spatial correlation, we propose a channel similarity reduction (CSR) clustering strategy that proactively minimizes CS to restore the system's degrees of freedom (DoF). Numerical results are provided to demonstrate the performance of the two proposed algorithms under various levels of CS. More importantly, the findings reveal that our proposed CSR-clustering strategy significantly outperforms existing baselines, effectively overcoming the secrecy performance ceiling caused by high spatial correlation.
This paper presents the Domain-Agnostic Incremental Learning for Audio Classification Task of the DCASE 2026 Challenge. Incremental learning refers to sequentially learning new tasks with the same system while maintaining its knowledge and performance on the previously learned task. Domain-incremental learning for sound classification refers to learning the same sound classes but in different acoustic domains, and was formalized as a data challenge for the first time in DCASE 2026. Participants will train a system to learn ten sound classes in three different domains, with learning at each incremental task not having access to previous task data. Submitted systems will be ranked by the overall average accuracy calculated over the three domains. During the development stage, the provided baseline system obtains a modest performance of 52.5\% accuracy over the last two domains, mostly due to erroneous inference of the domain for the test sample.
Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.
In this paper, we propose a novel, computationally efficient reduced order method to solve linear parabolic inverse source problems. Our approach provides accurate numerical solutions without relying on specific training data. The forward solution is constructed using a Krylov sequence, while the source term is recovered via the conjugate gradient (CG) method. Under a weak regularity assumption on the solution of the parabolic partial differential equations (PDEs), we establish convergence of the forward solution and provide a rigorous error estimate for our method. Numerical results demonstrate that our approach offers substantial computational savings compared to the traditional finite element method (FEM) and retains equivalent accuracy.
Leader-based Byzantine-fault-tolerant (BFT) protocols provide low latency and simple communication structure, but they give the leader short-term control over transaction inclusion. A malicious leader can keep the protocol live while delaying or excluding time-sensitive transactions such as auction bids, oracle updates, liquidations, and bridge messages. Existing responses often build a fixed censorship-resistance, hiding, or ordering mechanism into the protocol path, forcing all transactions to pay for the same protection level. name follows the end-to-end principle: the consensus layer exposes inclusion primitives rather than hardcoding stronger policies. Higher-layer protocols can then choose their own submission strategies and resources, whether through replication, erasure coding, or other mechanisms, to obtain the censorship-resistance, hiding, ordering, or execution guarantees they need. At the core of BigDipper is censorship-resistant data availability, or DA-CR, which certifies available replica-contributed mini-blocks for use by leader-based consensus. A central design goal is that data remains sharded on the consensus critical path: validators do not reconstruct or execute the full payload before voting, but instead check commitments, availability evidence, and the DA-CR inclusion rule. We define DA-CR guarantees for data-tampering resistance, honest mini-block inclusion, and residual leader influence. We then give concrete constructions based on erasure coding and linear commitments, analyze client-tunable transaction submission, and instantiate BigDipper inside HotStuff-2.
In this work, we consider the $H_2$ optimal model reduction of dynamical systems that are linear in the state equation and up to quadratic nonlinearity in the output equation. As our primary theoretical contributions, we derive gradients of the squared $H_2$ system error with respect to the reduced model quantities and, from the stationary points of these gradients, introduce Gramian-based first-order necessary conditions for the $H_2$ optimal approximation of a linear quadratic output (LQO) system. The resulting $H_2$ optimality framework neatly generalizes the analogous Gramian-based optimality framework for purely linear systems. Computationally, we show how to enforce the necessary optimality conditions using Petrov-Galerkin projection; the corresponding projection matrices are obtained from a pair of Sylvester equations. Based on this result, we propose an iteratively corrected algorithm for the $H_2$ model reduction of LQO systems, which we refer to as LQO-TSIA (linear quadratic output two-sided iteration algorithm). Numerical examples are included to illustrate the effectiveness of the proposed computational method against other existing approaches.
This paper presents a differentiable optimization framework for the design of constrained Linear Differential Microphone Arrays (LDMAs). The proposed method leverages a non-uniform delay-and-sum beamformer as a light-weight base system model, proving its ability to achieve the optimal beampattern of LDMAs by jointly optimizing microphone positions and filter weights. The formulation enables the optimized design of a filter with a distortion-free constraint in the desired sound direction, while also imposing constraints on microphone positioning to ensure consistent performance. Through evaluation on multiple metrics, including Mean Squared Error (MSE), Directivity Index (DI), White Noise Gain (WNG), and computation time, and comparison with state-of-the-art methods, this approach demonstrates a flexible, directive, robust, and hardware-efficient design.
We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.
This paper presents the Robust Recurrent Deep Network (R2DN), a scalable parameterization of robust recurrent neural networks for machine learning and data-driven control. We construct R2DNs as the feedback interconnection of a linear time-invariant system and a 1-Lipschitz deep feedforward network, and directly parameterize the weights so that our models are stable (contracting) and robust to small input perturbations (Lipschitz) by design. Our parameterization uses a structure similar to the previously-proposed recurrent equilibrium network (REN), but without the requirement to iteratively solve an equilibrium layer at each time-step. This speeds up both model inference and backpropagation on GPUs, and makes it computationally feasible to scale up the network size, batch size, and input sequence length in comparison to RENs. We compare R2DNs to RENs on three representative problems in nonlinear system identification, observer design, and learning-based feedback control. We find that training and inference are both up to an order of magnitude faster with similar test set performance, and that they scale more favorably with respect to model expressivity.
We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In the proposed pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are released at this https URL.
Millimeter-wave radar provides robust perception in fog, smoke, dust, and low light, making it attractive for size-, weight-, and power-constrained robotic platforms. Existing radar imaging methods typically rely on synthetic aperture or multi-frame aggregation to improve resolution, which is impractical for small aerial, inspection, or wearable systems. We present RadarSFD, a conditional latent diffusion framework that reconstructs dense LiDAR-like point clouds from a single radar frame without motion or SAR. Our approach transfers geometric priors from a pretrained monocular depth estimator into the diffusion backbone, anchors them to radar inputs via channel-wise latent concatenation, and regularizes outputs with a dual-space objective combining latent and pixel-space losses. On the RadarHD benchmark, RadarSFD achieves state-of-the-art performance against baseline models. Qualitative results show recovery of fine walls and narrow gaps, and experiments across new environments confirm strong generalization. Ablation studies highlight the importance of pretrained initialization, radar BEV conditioning, and the dual-space loss. Together, these results establish a practical single-frame, no-SAR mmWave radar pipeline for dense point cloud perception in compact robotic systems.
Diffusion probability models have shown significant promise in offline reinforcement learning by directly modeling trajectory sequences. However, existing approaches primarily focus on time-domain features while overlooking frequency-domain features, leading to frequency shift and degraded performance according to our observation. In this paper, we investigate the RL problem from a new perspective of the frequency domain. We first observe that time-domain-only approaches inadvertently introduce shifts in the low-frequency components of the frequency domain, which results in trajectory instability and degraded performance. To address this issue, we propose Wavelet Fourier Diffuser (WFDiffuser), a novel diffusion-based RL framework that integrates Discrete Wavelet Transform to decompose trajectories into low- and high-frequency components. To further enhance diffusion modeling for each component, WFDiffuser employs Short-Time Fourier Transform and cross attention mechanisms to extract frequency-domain features and facilitate cross-frequency interaction. Extensive experiment results on the D4RL benchmark demonstrate that WFDiffuser effectively mitigates frequency shift, leading to smoother, more stable trajectories and improved decision-making performance over existing methods.
In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.
While the recent developments in large language models (LLMs) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an LLM-based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an LLM as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), sparse retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.
High-fidelity spectrum cartography is important for spectrum monitoring and wireless situational awareness, especially in satellite-based wide-area sensing scenarios where measurements are sparse, noisy, and often low-bit quantized. In such settings, two coupled challenges arise: accurate reconstruction from severely incomplete measurements and efficient allocation of additional sensing resources under a limited sensing budget. Existing methods usually address these problems separately, and, for reconstruction, they often rely on priors that are insufficiently expressive under sparse and quantized measurements. This paper proposes Generative Spectrum Cartography (GSC), a diffusion-based posterior inference framework for spectrum cartography with uncertainty-aware active sensing. Specifically, spectrum map recovery is formulated as a Bayesian inverse problem under a learned diffusion model prior, and closed-form posterior mean updates are derived for both linear and quantized measurement models. By embedding these updates into the reverse diffusion process, GSC enables gradient-free and measurement-consistent posterior sampling without relying on computationally costly likelihood-gradient guidance. The resulting posterior samples are further used to estimate spatial uncertainty and to guide diversity-aware selection of additional measurement locations for active sensing. Experiments on simulated electromagnetic maps and a high-fidelity simulated satellite monitoring scenario show that GSC achieves higher PSNR, lower LPIPS, and more efficient sensing than representative baseline methods under sparse, noisy, and low-bit quantized measurements.
We revisit the Gaussian broadcast channel (GBC) and explore the rate region achieved by purely discrete inputs with treating interference as noise (TIN) decoding. Specifically, we introduce a simple scheme based on superposition coding with identically and independently distributed (i.i.d.) inputs drawn from discrete constellations, e.g., pulse amplitude modulations (PAM). Most importantly, we prove that the resulting achievable rate region under TIN decoding is within a constant gap to the capacity region of the GBC, where the gap is independent of all channel parameters. In addition, we show via simulation that the weak user can achieve a higher rate with PAM than with Gaussian signaling in some cases.