This paper presents a cost optimization framework for electric vehicle (EV) charging stations that leverages on-site photovoltaic (PV) generation and explicitly accounts for electricity price uncertainty through a Bertsimas--Sim robust formulation. The model is formulated as a linear program that satisfies vehicle energy demands, respects charging and grid capacity constraints, and minimizes procurement cost. Evaluations on real charging data from the Caltech ACN dataset show average savings of about 12\% compared to a first-come--first-served baseline, with peak monthly reductions up to 19.2\%. A lightweight sensitivity analysis indicates that a modest $\sim$5\% increase in nominal cost can reduce worst-case exposure by 14\%. Computational tests confirm real-time feasibility, with instances of up to 50 concurrent EVs solved in under 5 seconds on a standard laptop. The proposed method provides a practical, grid-friendly, and scalable solution for future EV charging operations.
The increasing integration of renewable energy introduces a great challenge to the supply and demand balance of the power grid. To address this challenge, this paper formulates a Stackelberg Markov game (SMG) between an aggregator and multiple users, where the aggregator sets electricity prices and users make demand and storage decisions. Considering that users' storage levels are private information, we introduce private states and propose the new concepts of private Markovian strategies (PMS) and private Markovian equilibrium (PME). We establish the existence of a pure PME in the lower-level Markov game and prove that it can be computed in polynomial time. Notably, computing equilibrium in general Markov games is hard, and polynomial-time algorithms are rarely available. Based on these theoretical results, we develop a scalable solution framework combining centralized and decentralized algorithms for the lower-level PME computation with upper-level pricing optimization. Numerical simulations with up to 50 users based on real data validate the effectiveness and scalability of the proposed methods, whereas prior studies typically consider no more than 5 users.
Non-invasive glucose monitors often fail outside the lab because existing datasets ignore hardware noise, environmental drift, and person-to-person physiology. We introduce the first ultra-realistic near-infrared (NIR) simulator that injects 12-bit ADC quantisation, +/-0.1% LED ageing, photodiode dark noise, 15-45 C temperature, 30-90% relative humidity, contact-pressure variation, Fitzpatrick I-VI melanin, and diurnal glucose excursions (dawn phenomenon). Using this platform (rho glucose-NIR = 0.21), we benchmark six methods: Enhanced Beer-Lambert (physics-engineered ridge regression), three physics-informed neural networks (PINNs), a selective radiative-transfer PINN, and a shallow DNN. Beer-Lambert achieves 13.6 mg/dL RMSE, 95.8% Clarke-A and 93.8% +/-15% accuracy with only 56 parameters and 0.01 ms inference, outperforming the best PINN (14.6 mg/dL) and the SDNN baseline (35.1 mg/dL). Results overturn the assumption that deeper PINNs dominate and supply an open, end-to-end reference stack for rapid prototyping of embedded optical glucose sensors.
The probabilistic power flow (PPF) problem is essential to quantifying the distribution of the nodal voltages due to uncertain injections. The conventional PPF problem considers a fixed topology, and the solutions to such a PPF problem are associated with this topology. A change in the topology might alter the power flow patterns and thus require the PPF problem to be solved again. The previous PPF model and its solutions are no longer valid for the new topology. This practice incurs both inconvenience and computation burdens as more contingencies are foreseen due to high renewables and a large share of electric vehicles. This paper presents a novel topology-adaptive approach, based on the meta-model Neural Process (MMNP), for finding the solutions to PPF problems under varying N-1 topologies, particularly with one-line failures. By leveraging context set-based topology representation and conditional distribution over function learning techniques, the proposed MMNP enhances the robustness of PPF models to topology variations, mitigating the need for retraining PPF models on a new configuration. Simulations on an IEEE 9-bus system and IEEE 118-bus system validate the model's performance. The maximum %L1-relative error norm was observed as 1.11% and 0.77% in 9-bus and 118-bus, respectively. This adaptive approach fills a critical gap in PPF methodology in an era of increasing grid volatility.
Existing deep learning models for chest radiology often neglect patient metadata, limiting diagnostic accuracy and fairness. To bridge this gap, we introduce MetaCheX, a novel multimodal framework that integrates chest X-ray images with structured patient metadata to replicate clinical decision-making. Our approach combines a convolutional neural network (CNN) backbone with metadata processed by a multilayer perceptron through a shared classifier. Evaluated on the CheXpert Plus dataset, MetaCheX consistently outperformed radiograph-only baseline models across multiple CNN architectures. By integrating metadata, the overall diagnostic accuracy was significantly improved, measured by an increase in AUROC. The results of this study demonstrate that metadata reduces algorithmic bias and enhances model generalizability across diverse patient populations. MetaCheX advances clinical artificial intelligence toward robust, context-aware radiographic disease detection.
Traditional single-input single-output (SISO) systems face fundamental limitations in achieving accurate three-dimensional (3D) localization due to limited spatial degrees of freedom (DoF) and the adverse impact of multipath propagation. This paper proposes a novel fluid antenna system (FAS)-active reconfigurable intelligent surface (ARIS) framework that transforms multipath effects from a hindrance into a resource for enhanced localization. By synergistically combining the signal amplification capabilities of ARIS with the spatial diversity enabled by FAS, the proposed system achieves robust 3D user equipment (UE) positioning -- without relying on auxiliary information such as time-of-arrival (ToA) or frequency diversity. The system exploits both line-of-sight (LoS) and non-line-of-sight (NLoS) components through a tailored signal decoupling strategy. We design novel UE pilot sequences and ARIS phase configurations to effectively separate LoS and NLoS channels, enabling independent parameter estimation. A multi-stage estimation algorithm is then applied: the multiple signal classification (MUSIC) algorithm estimates angle-of-arrival (AoA) from the direct path, while maximum likelihood estimation with interior-point refinement recovers cascaded channel parameters from the reflected path. Finally, geometric triangulation using least-squares estimation determines the UE's 3D position based on the extracted AoA information. Comprehensive performance analysis, including the derivation of Cramér-Rao bounds for both channel and position estimation, establishes theoretical benchmarks. Simulation results confirm that the proposed FAS-ARIS framework achieves near-optimal localization accuracy while maintaining robustness in rich multipath environments -- effectively turning conventional localization challenges into advantages.
Securing information in future mobile networks is challenging, especially for devices with limited computational resources. Physical layer security (PLS) offers a viable solution by leveraging wireless channel randomness. When full secrecy is unattainable, the partial secrecy regime provides a realistic alternative. This work analyzes partial secrecy performance under the generalized multicluster fluctuating two-ray (MFTR) fading model, which subsumes many classical fading cases. We study a system with a transmitter (A), legitimate receiver (B), and eavesdropper (E), both B and E using antenna arrays with maximal ratio combining (MRC), under i.n.i.d. fading. Exact and closed-form approximations are derived for key secrecy metrics: generalized secrecy outage probability (GSOP), average fractional equivocation (AFE), and average information leakage rate (AILR). The results, validated by Monte Carlo simulations, retain constant complexity regardless of diversity order. The MFTR model's flexibility enables comprehensive assessment across fading conditions, showing that more MRC branches at B enhance secrecy performance depending on the A-E link characteristics.
We study a stochastic model for the installation of renewable energy capacity under demand uncertainty and jump driven dynamics. The system is governed by a multidimensional Ornstein-Uhlenbeck (OU) process driven by a subordinator, capturing abrupt variations in renewable generation and electricity load. Installation decisions are modeled through control actions that increase capacity in response to environmental and economic conditions. We consider two distinct solution approaches. First, we implement a structured threshold based control rule, where capacity is increased proportionally when the stochastic capacity factor falls below a fixed level. This formulation leads to a nonlinear partial integro-differential equation (PIDE), which we solve by reformulating it as a backward stochastic differential equation with jumps. We extend the DBDP solver in \cite{hure2020deep} to the pure jump setting, employing a dual neural network architecture to approximate both the value function and the jump sensitivity. Second, we propose a fully data driven deep control algorithm that directly learns the optimal feedback policy by minimizing the expected cost functional using neural networks. This approach avoids assumptions on the form of the control rule and enables adaptive interventions based on the evolving system state. Numerical experiments highlight the strengths of both methods. While the threshold based BSDE approach offers interpretability and tractability, the deep control strategy achieves improved performance through flexibility in capacity allocation. Together, these tools provide a robust framework for decision support in long term renewable energy expansion under uncertainty.
With recent advancements in Connected Autonomous Vehicles (CAVs), Green Light Optimal Speed Advisory (GLOSA) emerges as a promising eco-driving strategy to reduce the number of stops and idle time at intersections, thereby reducing energy consumption and emissions. Existing studies typically improve energy and travel efficiency for individual CAVs without considering their impacts on the entire mixed-traffic platoon, leading to inefficient traffic flow. While Reinforcement Learning (RL) has the potential to achieve platoon-level control in a mixed-traffic environment, the training of RL is still challenged by (i) car-following safety, i.e., CAVs should not collide with their immediate preceding vehicles, and (ii) red-light safety, i.e., CAVs should not run red lights. To address these challenges, this paper develops a platoon-centric, safe RL-based GLOSA system that uses a multi-agent controller to optimize CAV speed while achieving a balance between energy consumption and travel efficiency. We further incorporate Control Barrier Functions (CBFs) into the RL-based policy to provide explicit safety guarantees in terms of car-following safety and red-light safety. Our simulation results illustrate that our proposed method outperforms state-of-the-art methods in terms of driving safety and platoon energy consumption.
Wearable photoplethysmography (PPG) is embedded in billions of devices, yet its optical waveform is easily corrupted by motion, perfusion loss, and ambient light, jeopardizing downstream cardiometric analytics. Existing signal-quality assessment (SQA) methods rely either on brittle heuristics or on data-hungry supervised models. We introduce the first fully unsupervised SQA pipeline for wrist PPG. Stage 1 trains a contrastive 1-D ResNet-18 on 276 h of raw, unlabeled data from heterogeneous sources (varying in device and sampling frequency), yielding optical-emitter- and motion-invariant embeddings (i.e., the learned representation is stable across differences in LED wavelength, drive intensity, and device optics, as well as wrist motion). Stage 2 converts each 512-D encoder embedding into a 4-D topological signature via persistent homology (PH) and clusters these signatures with HDBSCAN. To produce a binary signal-quality index (SQI), the acceptable PPG signals are represented by the densest cluster while the remaining clusters are assumed to mainly contain poor-quality PPG signals. Without re-tuning, the SQI attains Silhouette, Davies-Bouldin, and Calinski-Harabasz scores of 0.72, 0.34, and 6173, respectively, on a stratified sample of 10,000 windows. In this study, we propose a hybrid self-supervised-learning--topological-data-analysis (SSL--TDA) framework that offers a drop-in, scalable, cross-device quality gate for PPG signals.
Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at this https URL.
Blood oxygen saturation (SpO2) is a vital marker for healthcare monitoring. Traditional SpO2 estimation methods often rely on complex clinical calibration, making them unsuitable for low-power, wearable applications. In this paper, we propose a transfer learning-based framework for the rapid adaptation of SpO2 estimation to energy-efficient wearable devices using low-sampling-rate (25Hz) dual-channel photoplethysmography (PPG). We first pretrain a bidirectional Long Short-Term Memory (BiLSTM) model with self-attention on a public clinical dataset, then fine-tune it using data collected from our wearable We-Be band and an FDA-approved reference pulse oximeter. Experimental results show that our approach achieves a mean absolute error (MAE) of 2.967% on the public dataset and 2.624% on the private dataset, significantly outperforming traditional calibration and non-transferred machine learning baselines. Moreover, using 25Hz PPG reduces power consumption by 40% compared to 100Hz, excluding baseline draw. Our method also attains an MAE of 3.284% in instantaneous SpO2 prediction, effectively capturing rapid fluctuations. These results demonstrate the rapid adaptation of accurate, low-power SpO2 monitoring on wearable devices without the need for clinical calibration.
Accurate and generalizable blood pressure (BP) estimation is vital for the early detection and management of cardiovascular diseases. In this study, we enforce subject-level data splitting on a public multi-wavelength photoplethysmography (PPG) dataset and propose a generalizable BP estimation framework based on curriculum-adversarial learning. Our approach combines curriculum learning, which transitions from hypertension classification to BP regression, with domain-adversarial training that confuses subject identity to encourage the learning of subject-invariant features. Experiments show that multi-channel fusion consistently outperforms single-channel models. On the four-wavelength PPG dataset, our method achieves strong performance under strict subject-level splitting, with mean absolute errors (MAE) of 14.2mmHg for systolic blood pressure (SBP) and 6.4mmHg for diastolic blood pressure (DBP). Additionally, ablation studies validate the effectiveness of both the curriculum and adversarial components. These results highlight the potential of leveraging complementary information in multi-wavelength PPG and curriculum-adversarial strategies for accurate and robust BP estimation.
Various neural network architectures are used in many of the state-of-the-art approaches for real-time nonlinear state estimation in dynamical systems. With the ever-increasing incorporation of these data-driven models into the estimation domain, models with reliable margins of error are required -- especially for safety-critical applications. This paper discusses a novel hybrid, data-driven state estimation approach based on the physics-informed attentive neural process (PI-AttNP), a model-informed extension of the attentive neural process (AttNP). We augment this estimation approach with the regression-based split conformal prediction (CP) framework to obtain quantified model uncertainty with probabilistic guarantees. After presenting the algorithm in a generic form, we validate its performance in the task of grey-box state estimation of a simulated under-actuated six-degree-of-freedom quadrotor with multimodal Gaussian sensor noise and several external perturbations typical to quadrotors. Further, we compare outcomes with state-of-the-art data-driven methods, which provide significant evidence of the physics-informed neural process as a viable novel approach for model-driven estimation.
The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists' limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors' workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (2) Improved methods for medical keyword representation enhance the system's ability to capture nuances in medical terminology; (3) Strategies to overcome RNN-based models' limitations, particularly in capturing long-range dependencies within medical descriptions; (4) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI's potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care.
Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This paper presents a comprehensive study on the interactions and robustness of various multimodal fusion strategies under varying degrees of modality dropout. We build upon a state-of-the-art audio-visual speech enhancement system and integrate four distinct speaker identity cues: lip embeddings for synchronized contextual information, a voice speaker embedding extracted via cross-attention for acoustic consistency, a static face embedding for speaker identity, and a novel dynamic expression embedding for frame-wise emotional features. We systematically evaluate different combinations of these modalities under two key training regimes: zero dropout and 80% modality dropout. Extensive experiments demonstrate that while a full multimodal ensemble achieves optimal performance under ideal (zero dropout) conditions, its effectiveness diminishes significantly when test-time dropout occurs without prior exposure during training. Crucially, we show that training with a high (80%) modality dropout rate dramatically enhances model robustness, enabling the system to maintain superior performance even under severe test-time missing modalities. Our findings highlight that voice embeddings exhibit consistent robustness, while the proposed expression embedding provides valuable complementary information. This work underscores the importance of training strategies that account for real-world imperfection, moving beyond pure performance maximization to achieve practical reliability in multimodal speech enhancement systems.
The aorta is the body's largest arterial vessel, serving as the primary pathway for oxygenated blood within the systemic circulation. Aortic aneurysms consistently rank among the top twenty causes of mortality in the United States. Thoracic aortic aneurysm (TAA) arises from abnormal dilation of the thoracic aorta and remains a clinically significant disease, ranking as one of the leading causes of death in adults. A thoracic aortic aneurysm ruptures when the integrity of all aortic wall layers is compromised due to elevated blood pressure. Currently, three-dimensional computed tomography (3D CT) is considered the gold standard for diagnosing TAA. The geometric characteristics of the aorta, which can be quantified from medical imaging, and stresses on the aortic wall, which can be obtained by finite element analysis (FEA), are critical in evaluating the risk of rupture and dissection. Deep learning based image segmentation has emerged as a reliable method for extracting anatomical regions of interest from medical images. Voxel based segmentation masks of anatomical structures are typically converted into structured mesh representation to enable accurate simulation. Hexahedral meshes are commonly used in finite element simulations of the aorta due to their computational efficiency and superior simulation accuracy. Due to anatomical variability, patient specific modeling enables detailed assessment of individual anatomical and biomechanics behaviors, supporting precise simulations, accurate diagnoses, and personalized treatment strategies. Finite element (FE) simulations provide valuable insights into the biomechanical behaviors of tissues and organs in clinical studies. Developing accurate FE models represents a crucial initial step in establishing a patient-specific, biomechanically based framework for predicting the risk of TAA.
In this paper, we propose a novel definition of stationary graph signals, formulated with respect to a symmetric graph shift, such as the graph Laplacian. We show that stationary graph signals can be generated by transmitting white noise through polynomial graph channels, and that their stationarity is preserved under polynomial channel transmission. In this paper, we also investigate Kalman filtering to dynamical systems characterized by polynomial state and observation matrices. We demonstrate that Kalman filtering maintains the stationarity of graph signals, while effectively incorporating both system dynamics and noise structure. In comparison to the static inverse filtering method and naive zero-signal strategy, the Kalman filtering procedure yields more accurate and adaptive signal estimates, highlighting its robustness and versatility in graph signal processing.
CattleSense is an innovative application of Internet of Things (IoT) technology for the comprehensive monitoring and management of cattle well-being. This research paper outlines the design and implementation of a sophisticated system using a Raspberry Pi Module 4B, RFID Card Reader, Electret Arduino Microphone Module, DHT11 Sensor, Arduino UNO, Neo-6M GPS Sensor, and Heartbeat Sensor. The system aims to provide real-time surveillance of the environment in which Cows are present and individual Cow parameters such as location, milking frequency, and heartbeat fluctuations. The primary objective is to simplify managing the Cattle in the shed, ensuring that the Cattle are healthy and safe.
Integrated sensing and communication (ISAC) is a promising technique for expanding the functionalities of wireless networks with enhanced spectral efficiency. The 3rd Generation Partnership Project (3GPP) has defined six basic sensing operation modes in wireless networks. To further enhance the sensing capability of wireless networks, this paper proposes a new sensing operation mode, i.e., the base station (BS) and user equipment (UE) cooperative sensing. Specifically, after decoding the communication data, the UE further processes the received signal to extract the target sensing information. We propose an efficient algorithm for fusing the sensing results obtained by the BS and UE, by exploiting the geometric relationship among BS, UE and targets as well as the expected sensing quality in the BS monostatic and BS-UE bistatic sensing. The results show that the proposed data fusion method for cooperative sensing can effectively improve the position and velocity estimation accuracy of multiple targets, and provide a new approach on the expansion of the sensing pattern.
In this paper, we propose a sustainable long short-term memory (LSTM)-based precoding framework for reconfigurable intelligent surface (RIS)-assisted millimeter-wave (mmWave) MIMO systems. Instead of explicit channel state information (CSI) estimation, the framework exploits uplink pilot sequences to implicitly learn channel characteristics, reducing both pilot overhead and inference complexity. Practical hardware constraints are addressed by incorporating the phase-dependent amplitude model of RIS elements, while a multi-label training strategy improves robustness when multiple near-optimal codewords yield comparable performance. Simulations show that the proposed design achieves over 90% of the spectral efficiency of exhaustive search (ES) with only 2.2% of its computation time, cutting energy consumption by nearly two orders of magnitude. The method also demonstrates resilience under distribution mismatch and scalability to larger RIS arrays, making it a practical and energy-efficient solution for sustainable 6G wireless networks.
Despite improvements in automatic speaker verification (ASV), vulnerability against spoofing attacks remains a major concern. In this study, we investigate the integration of ASV and countermeasure (CM) subsystems into a modular spoof-aware speaker verification (SASV) framework. Unlike conventional single-stage score-level fusion methods, we explore the potential of a multi-stage approach that utilizes the ASV and CM systems in multiple stages. By leveraging ECAPA-TDNN (ASV) and AASIST (CM) subsystems, we consider support vector machine and logistic regression classifiers to achieve SASV. In the second stage, we integrate their outputs with the original score to revise fusion back-end classifiers. Additionally, we incorporate another auxiliary score from RawGAT (CM) to further enhance our SASV framework. Our approach yields an equal error rate (EER) of 1.30% on the evaluation dataset of the SASV2022 challenge, representing a 24% relative improvement over the baseline system.
This short note gives a new framework for dealing with nonlinear sampled-data systems. We introduce a new idea of lifting, which is well known for linear systems, but not successfully generalized to nonlinear systems. This paper introduces a new lifting technique for nonlinear, time-invariant systems, which are different from the linear counterpart as developed in [Bamieh et al. 1991, Yamamoto 1994], etc. The main difficulty is that the direct feedthrough term effective in the linear case cannot be generalized to the nonlinear case. Instead, we will further lift the state trajectory, and obtain an equivalent time-invariant discrete-time system with function-space input and output spaces. The basic framework, as well as the closed-loop equation with a discrete-time controller, is given. As an application of this framework, we give a representation for the Koopman operator derived from the given original nonlinear system.
This paper proposes Mode-Aware Probabilistic Scheduling (MAPS), a novel adaptive control framework tailored for DC motor systems experiencing varying friction. MAPS uniquely integrates an Interacting Multiple Model (IMM) estimator with a Linear Parameter-Varying (LPV) based control strategy, leveraging real-time mode probability estimates to perform probabilistic gain scheduling. A key innovation of MAPS lies in directly using the updated mode probabilities as the interpolation weights for online gain synthesis in the LPV controller, thereby tightly coupling state estimation with adaptive control. This seamless integration enables the controller to dynamically adapt control gains in real time, effectively responding to changes in frictional operating modes without requiring explicit friction model identification. Validation on a Hardware-in-the-Loop Simulation (HILS) environment demonstrates that MAPS significantly enhances both state estimation accuracy and reference tracking performance compared to Linear Quadratic Regulator (LQR) controllers relying on predefined scheduling variables. These results establish MAPS as a robust, generalizable solution for friction-aware adaptive control in uncertain, time-varying environments, with practical real-time applicability.
Sensing-assisted predictive beamforming, as one of the enabling technologies for emerging integrated sensing and communication (ISAC) paradigm, shows significant promise for enhancing various future unmanned aerial vehicle (UAV) applications. However, current works predominately emphasized on spectral efficiency enhancement, while the impact of such beamforming techniques on the communication reliability was largely unexplored and challenging to characterize. To fill this research gap and tackle this issue, this paper investigates outage capacity maximization for UAV tracking under the sensing-assisted predictive beamforming scheme. Specifically, a cellular-connected UAV tracking scheme is proposed leveraging extended Kalman filtering (EKF), where the predicted UAV trajectory, sensing duration ratio, and target constant received signal-to-noise ratio (SNR) are jointly optimized to maximize the outage capacity at each time slot. To address the implicit nature of the objective function, closed-form approximations of the outage probabilities (OPs) at both prediction and measurement stages of each time slot are proposed based on second-order Taylor expansions, providing an efficient and full characterization of outage capacity. Subsequently, an efficient algorithm is proposed based on a combination of bisection search and successive convex approximation (SCA) to address the non-convex optimization problem with guaranteed convergence. To further reduce computational complexity, a second efficient algorithm is developed based on alternating optimization (AO). Simulation results validate the accuracy of the derived OP approximations, the effectiveness of the proposed algorithms, and the significant outage capacity enhancement over various benchmarks, while also indicating a trade-off between decreasing path loss and enjoying wide beam coverage for outage capacity maximization.
Extremely large-scale multiple-input multiple-output (XL-MIMO) systems, operating in the near-field region due to their massive antenna arrays, are a key enabler of next-generation wireless communications but face significant challenges in channel state information (CSI) feedback. Deep learning has emerged as a powerful tool by learning compact CSI representations for feedback. However, existing methods struggle to capture the intricate structure of near-field CSI while incurring prohibitive computational overhead on practical mobile devices. To overcome these limitations, we propose the Near-Field Efficient Feedback Transformer (NEFT) family for accurate and efficient near-field CSI feedback across diverse hardware platforms. Built on a hierarchical Vision Transformer backbone, NEFT is extended with lightweight variants to meet various deployment constraints: NEFT-Compact applies multi-level knowledge distillation (KD) to reduce complexity while maintaining accuracy, and NEFT-Hybrid and NEFT-Edge address encoder- and edge-constrained scenarios via attention-free encoding and KD. Extensive simulations show that NEFT achieves a 15--21 dB improvement in normalized mean-squared error (NMSE) over state-of-the-art methods, while NEFT-Compact and NEFT-Edge reduce total FLOPs by 25--36% with negligible accuracy loss. Moreover, NEFT-Hybrid lowers encoder-side complexity by up to 64%, enabling deployment in highly asymmetric device scenarios. These results establish NEFT as a practical and scalable solution for near-field CSI feedback in XL-MIMO systems.
Semantic communication (SemCom) has emerged as a transformative paradigm for future 6G networks, offering task-oriented and meaning-aware transmission that fundamentally redefines traditional bit-centric design. Recognized by leading standardization bodies including the institute of electrical and electronics engineers (IEEE) and the international telecommunication union (ITU), and actively discussed within the 3rd generation partnership project (3GPP) working groups, SemCom is rapidly gaining traction as a foundational enabler for native-AI 6G. This paper presents a comprehensive overview of recent progress in SemCom from both academic and industrial perspectives, with a focus on its ongoing and upcoming standardization activities. We systematically examine advances in representative application scenarios, architectural design, semantic-traditional system compatibility, unified evaluation metrics, and validation methodologies. Furthermore, we highlight several key enabling technologies, such as joint source-channel coding (JSCC), SemCom-based multiple access (MA) technologies such as model division MA (MDMA), and semantic knowledge base (KB), that support the practical implementation of SemCom in standard-compliant systems. Additionally, we present a case study for channel state information (CSI) feedback, illustrating the concrete performance gains of SemCom under 3GPP-compliant fading channels. Finally, we discuss emerging challenges and research opportunities for incorporating semantic-native mechanisms into the evolving 6G standardization landscape, and provide forward-looking insights into its development and global adoption.
Because GPS signals are weak, electronic systems and components that are placed near GPS receivers can easily cause disruptive electromagnetic interference through their unintended radiated emissions. In this paper, EMC limit level guidelines are presented for electronics that are intended to be placed near to GPS receivers, as often happens in automotive and other applications. One of the challenges of defining limit-levels for systems intended to be integrated with GPS receivers is that the impact of noise at the input of the receiver may vary substantially depending on the form of the noise due to the correlator function implemented by GPS receiver. The quality of the correlated signal is typically represented using the carrier-to-noise ratio ($C / N_0$). A theoretical model predicting the degredation of the carrier-to-noise ratio with radio frequency interference is presented in this paper and is validated with realistic noise sources. The model is then used to develop guidelines to assess the impact of unintended emissions from electronic devices on nearby GPS receivers based on the frequency, bandwidth, and magnitude of the noise. These guidelines provide a more nuanced method of evaluating emissions than simple limit lines that are used by many emissions standards.
Reliable uncertainty quantification (UQ) is essential in medical AI. Evidential Deep Learning (EDL) offers a computationally efficient way to quantify model uncertainty alongside predictions, unlike traditional methods such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these methods often rely on a single expert's annotations as ground truth for model training, overlooking the inter-rater variability in healthcare. To address this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts via EDL models trained with diverse ground truths and modeling strategies. MEGAN's gating network optimally combines predictions and uncertainties from each EDL model, enhancing overall prediction confidence and calibration. We extensively benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore (MES), where inter-rater variability is prevalent. In large-scale prospective UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5% reduction in Expected Calibration Error (ECE) compared to existing methods. Furthermore, MEGAN facilitated uncertainty-guided sample stratification, reducing the annotation burden and potentially increasing efficiency and consistency in UC trials.
Ellipsoidal tube-based model predictive control methods effectively account for the propagation of the reachable set, typically employing linear feedback policies. In contrast, scenario-based approaches offer more flexibility in the feedback structure by considering different control actions for different branches of a scenario tree. However, they face challenges in ensuring rigorous guarantees. This work aims to integrate the strengths of both methodologies by enhancing ellipsoidal tube-based MPC with a scenario tree formulation. The uncertainty ellipsoids are partitioned by halfspaces such that each partitioned set can be controlled independently. The proposed ellipsoidal multi-stage approach is demonstrated in a human-robot system, highlighting its advantages in handling uncertainty while maintaining computational tractability.
We propose a statistical benchmark for diffusion posterior sampling (DPS) algorithms for Bayesian linear inverse problems. The benchmark synthesizes signals from sparse Lévy-process priors whose posteriors admit efficient Gibbs methods. These Gibbs methods can be used to obtain gold-standard posterior samples that can be compared to the samples obtained by the DPS algorithms. By using the Gibbs methods for the resolution of the denoising problems in the reverse diffusion, the framework also isolates the error that arises from the approximations to the likelihood score. We instantiate the benchmark with the minimum-mean-squared-error optimality gap and posterior coverage tests and provide numerical experiments for popular DPS algorithms on the inverse problems of denoising, deconvolution, imputation, and reconstruction from partial Fourier measurements. We release the benchmark code at this https URL. The repository exposes simple plug-in interfaces, reference scripts, and config-driven runs so that new algorithms can be added and evaluated with minimal effort. We invite researchers to contribute and report results.
This paper presents a closed-form analysis of spatial correlation and degrees of freedom (DoF) for arched holographic multiple-input multiple-output (HMIMO) arrays, which can be viewed as a special form of fluid antenna systems (FAS) when their geometry is fluidically adaptable. Unlike traditional planar configurations, practical HMIMO surfaces may exhibit curvature, significantly influencing their spatial characteristics and performance. We derive exact correlation expressions for both arched uniform linear arrays and arched uniform rectangular arrays, capturing curvature effects under far field propagation. Our results reveal that isotropic scattering results in DoF being dominated by the maximum span of the HMIMO array, such that shape effects are weakened, and bending does not significantly reduce the available spatial DoF. Numerical simulations validate the accuracy of the closed-form formulas and demonstrate the robustness of DoF against curvature variations, supporting flexible array designs. These findings offer fundamental insights into geometry-aware optimization for next-generation HMIMO/FAS systems and pave the way for practical implementations of curved HMIMO arrays.
The role of energy communities in grid operations is highly dependent on the spatial distribution of their participants. In particular, when local energy producers and consumers are concentrated in different feeders, economic incentives from energy communities have the potential to affect local grid congestion. To address this challenge, we propose a feeder-aware allocation strategy that reflects grid topology in energy sharing. This strategy prioritizes energy sharing within the same feeder, thus incentivizing local generation-demand balance and improving grid operation. Different sharing coefficients are tested, such as equal, proportional, and rank-based, in both static and dynamic formulations. The proposed strategy is tested on data from a real energy community, whose participants are assumed to be distributed across four feeders. The analysis is carried out from the perspectives of the community as a whole, individual feeders, and single participants. Simulation results show that the feeder-aware strategy, in addition to promoting local energy balance, leads to higher and more stable revenues for most participants.
We propose a posterior sampling algorithm for the problem of estimating multiple independent source signals from their noisy superposition. The proposed algorithm is a combination of Gibbs sampling method and plug-and-play (PnP) diffusion priors. Unlike most existing diffusion-model-based approaches for signal separation, our method allows source priors to be learned separately and flexibly combined without retraining. Moreover, under the assumption of perfect diffusion model training, the proposed method provably produces samples from the posterior distribution. Experiments on the task of heartbeat extraction from mixtures with synthetic motion artifacts demonstrate the superior performance of our method over existing approaches.
Accurate and personalized environment recognition is essential for seamless indoor positioning and optimized connectivity, yet traditional fingerprinting requires costly site surveys and lacks user-level adaptation. We present a survey-free, on-device sensor-fusion framework that builds a personalized, lightweight multi-source fingerprint (FP) database from pedestrian dead reckoning (PDR), WiFi/cellular, GNSS, and interaction time tags. Matching is performed by an AI-enhanced dynamic time warping module (AIDTW) that aligns noisy, asynchronous sequences. To turn perception into continually improving actions, a cloud-edge online Reinforcement Learning from Human Feedback (RLHF) loop aggregates desensitized summaries and human feedback in the cloud to optimize a policy via proximal policy optimization (PPO), and periodically distills updates to devices. Across indoor/outdoor scenarios, our system reduces network-transition latency (measured by time-to-switch, TTS) by 32-65% in daily environments compared with conventional baselines, without site-specific pre-deployment.
Reliable electricity supply depends on the seamless operation of high-voltage grid infrastructure spanning both transmission and sub-transmission levels. Beneath this apparent uniformity lies a striking structural diversity, which leaves a clear imprint on system vulnerability. In this paper, we present harmonized topological models of the high-voltage grids of 15 European countries, integrating all elements at voltage levels above 110 kV. Topological analysis of these networks reveals a simple yet robust pattern: node degree distributions consistently follow an exponential decay, but the rate of decay varies significantly across countries. Through a detailed and systematic evaluation of network tolerance to node and edge removals, we show that the decay rate delineates the boundary between systems that are more resilient to failures and those that are prone to large-scale disruptions. Furthermore, we demonstrate that this numerical boundary is highly sensitive to which layers of the infrastructure are included in the models. To our knowledge, this study provides the first quantitative cross-country comparison of 15 European high-voltage networks, linking topological properties with vulnerability characteristics.
The increasing variety of means of transportation, including light vehicles like e-scooters and e-bikes, together with the increasing weight of conventional vehicles due to electrification and consumer preferences for SUVs, are raising serious concerns regarding the safety of road networks. In this paper we design a two-level control algorithm to improve the safety of heterogeneous networks: first, an access control strategy decreases the heterogeneity of the network depending on actual traffic conditions; then, a speed control strategy mitigates the probability of serious injuries in potential collisions. Both control strategies are designed based on momentum considerations, as this is regarded as the most influential variable to assess injury risk. The road network mobility simulator SUMO is adopted to implement and validate our proposed control strategies.
This paper provides a comprehensive analysis and theoretical foundation for next-generation backscatter networks that move beyond communication and integrate RF location sensing capabilities. An end-to-end system model for wideband OFDM backscatter systems is derived, including detailed characterization of propagation channels, receiver chain impairments, RF tag operation, and unsynchronized network nodes. The theoretical system model is validated through experimental evaluation using actual hardware, demonstrating the detailed model's accuracy. A practical bistatic ranging method that can operate with unsynchronized nodes is presented, along with the Cramér-Rao Lower Bound (CRLB) derived to show the achievable performance limits. Our experimental results demonstrate the system performance for communication, RF sensing, and ranging, while also benchmarking against the derived theoretical limits. This analytical framework and experimental validation establish fundamental understanding of distributed, unsynchronized backscatter systems for future machine-type communication networks that are deployed in massive scale, while remaining energy-efficient.
Conventional analog-to-digital converters (ADCs) clip when signals exceed their input range. Modulo (unlimited) sampling overcomes this limitation by folding the signal before digitization, but existing recovery methods are either computationally intensive or constrained by loose oversampling bounds that demand high sampling rates. In addition, none account for sampling jitter, which is unavoidable in practice. This paper revisits difference-based recovery and establishes new theoretical and practical guarantees. In the noiseless setting, we prove that arbitrarily high difference order reduces the sufficient oversampling factor from $2\pi e$ to $\pi$, substantially tightening classical bounds. For fixed order $N$, we derive a noise-aware sampling condition that guarantees stable recovery. For second-order difference-based recovery ($N=2$), we further extend the analysis to non-uniform sampling, proving robustness under bounded jitter. An FPGA-based hardware prototype demonstrates reliable reconstruction with amplitude expansion up to $\rho = 108$, confirming the feasibility of high-performance unlimited sensing with a simple and robust recovery pipeline.
Precision agriculture demands non-invasive, energy-efficient, and sustainable plant monitoring solutions. In this work, we present the design and implementation of a lightweight, batteryless plant movement sensor powered solely by RF energy. This sensor targets Controlled Environment Agriculture (CEA) and utilizes inertial measurements units (IMUs) to monitor leaf motion, which correlates with plant physiological responses to environmental stress. By eliminating the battery, we reduce the ecological footprint, weight, and maintenance requirements, transitioning from lifetime-based to operation-based energy storage. Our design minimizes circuit complexity while enabling flexible, adaptive readout scheduling based on energy availability and sensor data. We detail the energy requirements, RF power transfer considerations, integration constraints, and outline future directions, including multi-antenna power delivery and networked sensor synchronization.
Channel charting has emerged as a powerful tool for user equipment localization and wireless environment sensing. Its efficacy lies in mapping high-dimensional channel data into low-dimensional features that preserve the relative similarities of the original data. However, existing channel charting methods are largely developed using simulated or indoor measurements, often assuming clean and complete channel data across all frequency bands. In contrast, real-world channels collected from base stations are typically incomplete due to frequency hopping and are significantly noisy, particularly at cell edges. These challenging conditions greatly degrade the performance of current methods. To address this, we propose a deep tensor learning method that leverages the inherent tensor structure of wireless channels to effectively extract informative while low-dimensional features (i.e., channel charts) from noisy and incomplete measurements. Experimental results demonstrate the reliability and effectiveness of the proposed approach in these challenging scenarios.
Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reconstruction at competitive low bitrates while demonstrating an inherent ability for information disentanglement. We construct a two-stage language model for text-to-speech (TTS) synthesis using this codec, which, despite its lightweight design and minimal data requirements, achieves a state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to several larger models. Furthermore, the codec's design proves highly effective for voice conversion, enabling independent manipulation of speaker timbre and prosody.
Indoor sensing is challenging because of the multi-bounce effect, spherical wavefront, and spatial nonstationarity (SNS) of the near-field effect. This paper addresses radio-based environment sensing considering these issues. Specifically, graph theory (GT) is used to model the multi-bounce propagation of the near field. In this manner, indoor reflectors/scatterers are modeled as vertices in a propagation graph, the multi-bounce paths are modeled by the edges linking the vertices. Besides, the coupled multipath parameters in the near field, i.e., range and angles, are denoted directly by the coordinates of vertices. Then, the space-alternating generalized expectation-maximization (SAGE) algorithm is adapted to the proposed Graph theory-based dictionary-aided Multi-bounce SAGE (GM-SAGE), where the searching parameters including range and angle of departure/arrival (AoD/AoA) are transformed to the coordinates of scatterers in the graph. The proposed algorithm is validated through measurement-calibrated ray tracing (RT) in a complex indoor office. The results demonstrate that the proposed GM-SAGE can deal with multi-bounce channels.
Spoof diarization identifies ``what spoofed when" in a given speech by temporally locating spoofed regions and determining their manipulation techniques. As a first step toward this task, prior work proposed a two-branch model for localization and spoof type clustering, which laid the foundation for spoof diarization. However, its simple structure limits the ability to capture complex spoofing patterns and lacks explicit reference points for distinguishing between bona fide and various spoofing types. To address these limitations, our approach introduces learnable tokens where each token represents acoustic features of bona fide and spoofed speech. These attractors interact with frame-level embeddings to extract discriminative representations, improving separation between genuine and generated speech. Vast experiments on PartialSpoof dataset consistently demonstrate that our approach outperforms existing methods in bona fide detection and spoofing method clustering.
We study data-driven least squares (LS) problems with semidefinite (SD) constraints and derive finite-sample guarantees on the spectrum of their optimal solutions when these constraints are relaxed. In particular, we provide a high confidence bound allowing one to solve a simpler program in place of the full SDLS problem, while ensuring that the eigenvalues of the resulting solution are $\varepsilon$-close of those enforced by the SD constraints. The developed certificate, which consistently shrinks as the number of data increases, turns out to be easy-to-compute, distribution-free, and only requires independent and identically distributed samples. Moreover, when the SDLS is used to learn an unknown quadratic function, we establish bounds on the error between a gradient descent iterate minimizing the surrogate cost obtained with no SD constraints and the true minimizer.
In recent years, deep learning has significantly advanced sound source localization (SSL). However, training such models requires large labeled datasets, and real recordings are costly to annotate in particular if sources move. While synthetic data using simulated room impulse responses (RIRs) and noise offers a practical alternative, models trained on synthetic data suffer from domain shift in real environments. Unsupervised domain adaptation (UDA) can address this by aligning synthetic and real domains without relying on labels from the latter. The few existing UDA approaches however focus on static SSL and do not account for the problem of sound source tracking (SST), which presents two specific domain adaptation challenges. First, variable-length input sequences create mismatches in feature dimensionality across domains. Second, the angular coverages of the synthetic and the real data may not be well aligned either due to partial domain overlap or due to batch size constraints, which we refer to as directional diversity mismatch. To address these, we propose a novel UDA approach tailored for SST based on two key features. We employ the final hidden state of a recurrent neural network as a fixed-dimensional feature representation to handle variable-length sequences. Further, we use importance-weighted adversarial training to tackle directional diversity mismatch by prioritizing synthetic samples similar to the real domain. Experimental results demonstrate that our approach successfully adapts synthetic-trained models to real environments, improving SST performance.
This paper presents MPC-CDF, a new approach integrating control density functions (CDFs) within a model predictive control (MPC) framework to ensure safety-critical control in nonlinear dynamical systems. By using the dual formulation of the navigation problem, we incorporate CDFs into the MPC framework, ensuring both convergence and safety in a discrete-time setting. These density functions are endowed with a physical interpretation, where the associated measure signifies the occupancy of system trajectories. Leveraging this occupancy-based perspective, we synthesize safety-critical controllers using the proposed MPC-CDF framework. We illustrate the safety properties of this framework using a unicycle model and compare it with a control barrier function-based method. The efficacy of this approach is demonstrated in the autonomous safe navigation of an underwater vehicle, which avoids complex and arbitrary obstacles while achieving the desired level of safety.
We address target detection in a single Delay-Doppler cell using spatially distributed two-channel passive radars. An unknown illuminator of opportunity (IO) is assumed to emit a waveform lying in a known low-dimensional subspace (e.g., OFDM). Each receiver transforms its reference and surveillance signals onto the IO subspace after noise-whitening, to obtain cross-correlation (CC) measurements. To save bandwidth, receivers collaboratively exchange and linearly combine the CC output, and only a subset transmits them to a fusion center (FC) over a multiple-access channel (MAC). Collaboration weights are designed using the moments of the FC measurement to enhance detection performance.
Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for indoor perception tasks. However, collecting high-quality RF data in dynamic and diverse indoor environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions. WaveVerse introduces a language-guided 4D world generator, which includes a state-aware causal transformer for human motion generation conditioned on spatial constraints and texts, and a phase-coherent ray tracing simulator that enables the simulation of accurate and coherent RF signals. Experiments demonstrate the effectiveness of our approach in conditioned human motion generation and highlight how phase coherence is applied to beamforming and respiration monitoring. We further present two case studies in ML-based high-resolution imaging and human activity recognition, demonstrating that WaveVerse not only enables data generation for RF imaging for the first time, but also consistently achieves performance gain in both data-limited and data-adequate scenarios.
Modern power systems are becoming increasingly dynamic, with changing topologies and time-varying loads driven by renewable energy variability, electric vehicle adoption, and active grid reconfiguration. Despite these changes, publicly available test cases remain scarce, due to security concerns and the significant effort required to anonymize real systems. Such limitations call for generative tools that can jointly synthesize grid structure and nodal dynamics. However, modeling the joint distribution of network topology, branch attributes, bus properties, and dynamic load profiles remains a major challenge, while preserving physical feasibility and avoiding prohibitive computational costs. We present PowerGrow, a co-generative framework that significantly reduces computational overhead while maintaining operational validity. The core idea is dependence decomposition: the complex joint distribution is factorized into a chain of conditional distributions over feasible grid topologies, time-series bus loads, and other system attributes, leveraging their mutual dependencies. By constraining the generation process at each stage, we implement a hierarchical graph beta-diffusion process for structural synthesis, paired with a temporal autoencoder that embeds time-series data into a compact latent space, improving both training stability and sample fidelity. Experiments across benchmark settings show that PowerGrow not only outperforms prior diffusion models in fidelity and diversity but also achieves a 98.9\% power flow convergence rate and improved N-1 contingency resilience. This demonstrates its ability to generate operationally valid and realistic power grid scenarios.
This work introduces a physics-informed neural networks (PINNs)-based model predictive control (MPC) framework for susceptible-infected-recovered ($SIR$) spreading models. Existing studies in MPC design for epidemic control often assume either 1) measurable states of the dynamics, where the parameters are learned, or 2) known parameters of the model, where the states are learned. In this work, we address the joint real-time estimation of states and parameters within the MPC framework using only noisy infected states, under the assumption that 1) only the recovery rate is known, or 2) only the basic reproduction number is known. Under the first assumption, we propose MPC-PINNs and two novel PINNs algorithms, all of which are integrated into the MPC framework. First, we introduce MPC-PINNs, which are designed for $SIR$ models with control. We then propose log-scaled PINNs (MPC-LS-PINNs), which incorporate a log-scaled loss function to improve robustness against noise. Next, we present split-integral PINNs (MPC-SI-PINNs), which leverage integral operators and state coupling in the neural network training process to effectively reconstruct the complete epidemic state information. Building upon these methods, we further extend our framework for the second assumption. We establish the necessary conditions and extend our PINNs algorithms, where MPC-SI-PINNs are simplified as split-PINNs (MPC-S-PINNs). By incorporating these algorithms into the MPC framework, we simultaneously estimate the epidemic states and parameters while generating optimal control strategies. Experiment results demonstrate the effectiveness of the proposed methods under different settings.
Alzheimer's disease (AD) is a progressive neurodegenerative disease with high inter-patient variance in rate of cognitive decline. AD progression prediction aims to forecast patient cognitive decline and benefits from incorporating multiple neuroimaging modalities. However, existing multimodal models fail to make accurate predictions when many modalities are missing during inference, as is often the case in clinical settings. To increase multimodal model flexibility under high modality missingness, we introduce PerM-MoE, a novel sparse mixture-of-experts method that uses independent routers for each modality in place of the conventional, single router. Using T1-weighted MRI, FLAIR, amyloid beta PET, and tau PET neuroimaging data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), we evaluate PerM-MoE, state-of-the-art Flex-MoE, and unimodal neuroimaging models on predicting two-year change in Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores under varying levels of modality missingness. PerM-MoE outperforms the state of the art in most variations of modality missingness and demonstrates more effective utility of experts than Flex-MoE.
Accurate prediction of machining deformation in structural components is essential for ensuring dimensional precision and reliability. Such deformation often originates from residual stress fields, whose distribution and influence vary significantly with geometric complexity. Conventional numerical methods for modeling the coupling between residual stresses and deformation are computationally expensive, particularly when diverse geometries are considered. Neural operators have recently emerged as a powerful paradigm for efficiently solving partial differential equations, offering notable advantages in accelerating residual stress-deformation analysis. However, their direct application across changing geometric domains faces theoretical and practical limitations. To address this challenge, a novel framework based on diffeomorphic embedding neural operators named neural diffeomorphic-neural operator (NDNO) is introduced. Complex three-dimensional geometries are explicitly mapped to a common reference domain through a diffeomorphic neural network constrained by smoothness and invertibility. The neural operator is then trained on this reference domain, enabling efficient learning of deformation fields induced by residual stresses. Once trained, both the diffeomorphic neural network and the neural operator demonstrate efficient prediction capabilities, allowing rapid adaptation to varying geometries. The proposed method thus provides an effective and computationally efficient solution for deformation prediction in structural components subject to varying geometries. The proposed method is validated to predict both main-direction and multi-direction deformation fields, achieving high accuracy and efficiency across parts with diverse geometries including component types, dimensions and features.
This paper investigates a behavioral-feedback SIR model in which the infection rate adapts dynamically based on the fractions of susceptible and infected individuals. We introduce an invariant of motion and we characterize the peak of infection. We further examine the system under a threshold constraint on the infection level. Based on this analysis, we formulate an optimal control problem to keep the infection curve below a healthcare capacity threshold while minimizing the economic cost. For this problem, we study a feasible strategy that involves applying the minimal necessary restrictions to meet the capacity constraint and characterize the corresponding cost.
This paper addresses the Longest Filled Common Subsequence (LFCS) problem, a challenging NP-hard problem with applications in bioinformatics, including gene mutation prediction and genomic data reconstruction. Existing approaches, including exact, metaheuristic, and approximation algorithms, have primarily been evaluated on small-sized instances, which offer limited insights into their scalability. In this work, we introduce a new benchmark dataset with significantly larger instances and demonstrate that existing datasets lack the discriminative power needed to meaningfully assess algorithm performance at scale. To solve large instances efficiently, we utilize an adaptive Construct, Merge, Solve, Adapt (CMSA) framework that iteratively generates promising subproblems via component-based construction and refines them using feedback from prior iterations. Subproblems are solved using an external black-box solver. Extensive experiments on both standard and newly introduced benchmarks show that the proposed adaptive CMSA achieves state-of-the-art performance, outperforming five leading methods. Notably, on 1,510 problem instances with known optimal solutions, our approach solves 1,486 of them -- achieving over 99.9% optimal solution quality and demonstrating exceptional scalability. We additionally propose a novel application of LFCS for song identification from degraded audio excerpts as an engineering contribution, using real-world energy-profile instances from popular music. Finally, we conducted an empirical explainability analysis to identify critical feature combinations influencing algorithm performance, i.e., the key problem features contributing to success or failure of the approaches across different instance types are revealed.
We present a traditional approach to symbolic piano music continuation for the MIREX 2025 Symbolic Music Generation challenge. While computational music generation has recently focused on developing large foundation models with sophisticated architectural modifications, we argue that simpler approaches remain more effective for constrained, single-instrument tasks. We thus return to a simple, unaugmented next-token-prediction objective on tokenized raw MIDI, aiming to outperform large foundation models by using better data and better fundamentals. We release model weights and code at this https URL.
We propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought for audio question answering. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Integrated with GRPO training, these strategies enable the model to learn more effectively from informative samples. Experiments on MMAU-mini and MMAR demonstrate that Omni-CLST achieves competitive accuracy (73.80% on MMAU-mini) and establishes a new state of the art (64.30% on MMAR), highlighting its robustness and generalization capability in multimodal audio-language understanding.
Speech emotion recognition systems often predict a consensus value generated from the ratings of multiple annotators. However, these models have limited ability to predict the annotation of any one person. Alternatively, models can learn to predict the annotations of all annotators. Adapting such models to new annotators is difficult as new annotators must individually provide sufficient labeled training data. We propose to leverage inter-annotator similarity by using a model pre-trained on a large annotator population to identify a similar, previously seen annotator. Given a new, previously unseen, annotator and limited enrollment data, we can make predictions for a similar annotator, enabling off-the-shelf annotation of unseen data in target datasets, providing a mechanism for extremely low-cost personalization. We demonstrate our approach significantly outperforms other off-the-shelf approaches, paving the way for lightweight emotion adaptation, practical for real-world deployment.
Inertial motion capture is a promising approach for capturing motion outside the laboratory. However, as one major drawback, most of the current methods require different quantities to be calibrated or computed offline as part of the setup process, such as segment lengths, relative orientations between inertial measurement units (IMUs) and segment coordinate frames (IMU-to-segment calibrations) or the joint positions in the IMU frames. This renders the setup process inconvenient. This work contributes to real-time capable calibration-free inertial tracking of a kinematic chain, i.e. simultaneous recursive Bayesian estimation of global IMU angular kinematics and joint positions in the IMU frames, with a minimal state size. Experimental results on simulated IMU data from a three-link kinematic chain (manipulator study) as well as re-simulated IMU data from healthy humans walking (lower body study) show that the calibration-free and lightweight algorithm provides not only drift-free relative but also drift-free absolute orientation estimates with a global heading reference for only one IMU as well as robust and fast convergence of joint position estimates in the different movement scenarios.
Hyper-redundant tendon-driven manipulators of- fer greater flexibility and compliance over traditional manipu- lators. A common way of controlling such manipulators relies on adjusting tendon lengths, which is an accessible control parameter. This approach works well when the kinematic configuration is representative of the real operational con- ditions. However, when dealing with manipulators of larger size subject to gravity, it becomes necessary to solve a static force problem, using tendon force as the input and employing a mapping from the configuration space to retrieve tendon length. Alternatively, measurements of the manipulator posture can be used to iteratively adjust tendon lengths to achieve a desired posture. Hence, either tension measurement or state estimation of the manipulator are required, both of which are not always accurately available. Here, we propose a solution by reconciling cables tension and length as the input for the solution of the system forward statics. We develop a screw-based formulation for a tendon-driven, multi-segment, hyper-redundant manipulator with elastic joints and introduce a forward statics iterative solution method that equivalently makes use of either tendon length or tension as the input. This strategy is experimentally validated using a traditional tension input first, subsequently showing the efficacy of the method when exclusively tendon lengths are used. The results confirm the possibility to perform open-loop control in static conditions using a kinematic input only, thus bypassing some of the practical problems with tension measurement and state estimation of hyper-redundant systems.
Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for navigating indoor and hard-to-reach areas, yet their significant constraints in payload and autonomy have largely prevented their use for complex tasks like high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we introduce a novel system architecture that enables fully autonomous, high-fidelity 3D scanning of static objects using UAVs weighing under 100 grams. Our core innovation lies in a dual-reconstruction pipeline that creates a real-time feedback loop between data capture and flight control. A near-real-time (near-RT) process uses Structure from Motion (SfM) to generate an instantaneous pointcloud of the object. The system analyzes the model quality on the fly and dynamically adapts the UAV's trajectory to intelligently capture new images of poorly covered areas. This ensures comprehensive data acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR) approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB) location data to achieve superior accuracy. We implemented and validated this architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both single- and multi-UAV configurations, conclusively show that dynamic trajectory adaptation consistently improves reconstruction quality over static flight paths. This work demonstrates a scalable and autonomous solution that unlocks the potential of miniaturized UAVs for fine-grained 3D reconstruction in constrained environments, a capability previously limited to much larger platforms.
In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
In visuomotor policy learning, the control policy for the robotic agent is derived directly from visual inputs. The typical approach, where a policy and vision encoder are trained jointly from scratch, generalizes poorly to novel visual scene changes. Using pre-trained vision models (PVMs) to inform a policy network improves robustness in model-free reinforcement learning (MFRL). Recent developments in Model-based reinforcement learning (MBRL) suggest that MBRL is more sample-efficient than MFRL. However, counterintuitively, existing work has found PVMs to be ineffective in MBRL. Here, we investigate PVM's effectiveness in MBRL, specifically on generalization under visual domain shifts. We show that, in scenarios with severe shifts, PVMs perform much better than a baseline model trained from scratch. We further investigate the effects of varying levels of fine-tuning of PVMs. Our results show that partial fine-tuning can maintain the highest average task performance under the most extreme distribution shifts. Our results demonstrate that PVMs are highly successful in promoting robustness in visual policy learning, providing compelling evidence for their wider adoption in model-based robotic learning applications.
This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the modelś ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.
Optimal-Transport Distributionally Robust Optimization (OT-DRO) robustifies data-driven decision-making under uncertainty by capturing the sampling-induced statistical error via optimal transport ambiguity sets. The standard OT-DRO pipeline consists of a two-step procedure, where the ambiguity set is first designed and subsequently embedded into the downstream OT-DRO problem. However, this separation between uncertainty quantification and optimization might result in excessive conservatism. We introduce an end-to-end pipeline to automatically learn decision-focused ambiguity sets for OT-DRO problems, where the loss function informs the shape of the optimal transport ambiguity set, leading to less conservative yet distributionally robust decisions. We formulate the learning problem as a bilevel optimization program and solve it via a hypergradient-based method. By leveraging the recently introduced nonsmooth conservative implicit function theorem, we establish convergence to a critical point of the bilevel problem. We present experiments validating our method on standard portfolio optimization and linear regression tasks.
This paper introduces a novel framework for estimation theory by introducing a second-order diagnostic for estimator design. While classical analysis focuses on the bias-variance trade-off, we present a more foundational constraint. This result is model-agnostic, domain-agnostic, and is valid for both parametric and non-parametric problems, Bayesian and frequentist frameworks. We propose to classify the estimators into three primary power regimes. We theoretically establish that any estimator operating in the `power-dominant regime' incurs an unavoidable mean-squared error penalty, making it structurally prone to sub-optimal performance. We propose a `safe-zone law' and make this diagnostic intuitive through two safe-zone maps. One map is a geometric visualization analogous to a receiver operating characteristic curve for estimators, and the other map shows that the safe-zone corresponds to a bounded optimization problem, while the forbidden `power-dominant zone' represents an unbounded optimization landscape. This framework reframes estimator design as a path optimization problem, providing new theoretical underpinnings for regularization and inspiring novel design philosophies.
Nonlinear optimization-based policies have seen large success in recent years, primarily due to the incredible capabilities of nonlinear Model Predictive Control (nMPC). These policies require solving computationally demanding nonlinear optimization programs (NLP) online at each time-step. The solution map of these NLPs, viewed as a function of the measured state of the system and design parameters, may not be differentiable, which poses significant challenges if the policy is designed with a policy optimization scheme. In this paper, we propose a principled way to regularize NLPs to obtain a surrogate derivative even if the NLP is not differentiable. The surrogate problem is differentiable by design and its solution map coincides with the solution of the unregularized problem. We demonstrate the effectiveness of our approach in a free-final-time optimal control problem and a receding-horizon nonlinear MPC example.
Visuotactile sensors typically employ sparse marker arrays that limit spatial resolution and lack clear analytical force-to-image relationships. To solve this problem, we present \textbf{MoiréTac}, a dual-mode sensor that generates dense interference patterns via overlapping micro-gratings within a transparent architecture. When two gratings overlap with misalignment, they create moiré patterns that amplify microscopic deformations. The design preserves optical clarity for vision tasks while producing continuous moiré fields for tactile sensing, enabling simultaneous 6-axis force/torque measurement, contact localization, and visual perception. We combine physics-based features (brightness, phase gradient, orientation, and period) from moiré patterns with deep spatial features. These are mapped to 6-axis force/torque measurements, enabling interpretable regression through end-to-end learning. Experimental results demonstrate three capabilities: force/torque measurement with R^2 > 0.98 across tested axes; sensitivity tuning through geometric parameters (threefold gain adjustment); and vision functionality for object classification despite moiré overlay. Finally, we integrate the sensor into a robotic arm for cap removal with coordinated force and torque control, validating its potential for dexterous manipulation.
In this work, deep neural networks made up of multiple hidden Long Short-Term Memory (LSTM) and Feedforward layers are trained to predict the thermal behavior of the joint motors of robot manipulators. A model-free and scalable approach is adopted. It accommodates complexity and uncertainty challenges stemming from the derivation, identification, and validation of a large number of parameters of an approximation model that is hardly available. To this end, sensed joint torques are collected and processed to foresee the thermal behavior of joint motors. Promising prediction results of the machine learning based capture of the temperature dynamics of joint motors of a redundant robot with seven joints are presented.
Robots are unrelentingly used to achieve operational efficiency in Industry 4.0 along with symbiotic and sustainable assistance for the work-force in Industry 5.0. As resilience, robustness, and well-being are required in anti-fragile manufacturing and human-centric societal tasks, an autonomous anticipation and adaption to thermal saturation and burns due to motors overheating become instrumental for human safety and robot availability. Robots are thereby expected to self-sustain their performance and deliver user experience, in addition to communicating their capability to other agents in advance to ensure fully automated thermally feasible tasks, and prolong their lifetime without human intervention. However, the traditional robot shutdown, when facing an imminent thermal saturation, inhibits productivity in factories and comfort in the society, while cooling strategies are hard to implement after the robot acquisition. In this work, smart digital twins endowed with generative AI, i.e., variational autoencoders, are leveraged to manage thermally anomalous and generate uncritical robot states. The notion of thermal difficulty is derived from the reconstruction error of variational autoencoders. A robot can use this score to predict, anticipate, and share the thermal feasibility of desired motion profiles to meet requirements from emerging applications in Industry 6.0 and Society 6.0.
We investigate the task and motion planning problem for Signal Temporal Logic (STL) specifications in robotics. Existing STL methods rely on pre-defined maps or mobility representations, which are ineffective in unstructured real-world environments. We propose the \emph{Structured-MoE STL Planner} (\textbf{S-MSP}), a differentiable framework that maps synchronized multi-view camera observations and an STL specification directly to a feasible trajectory. S-MSP integrates STL constraints within a unified pipeline, trained with a composite loss that combines trajectory reconstruction and STL robustness. A \emph{structure-aware} Mixture-of-Experts (MoE) model enables horizon-aware specialization by projecting sub-tasks into temporally anchored embeddings. We evaluate S-MSP using a high-fidelity simulation of factory-logistics scenarios with temporally constrained tasks. Experiments show that S-MSP outperforms single-expert baselines in STL satisfaction and trajectory feasibility. A rule-based \emph{safety filter} at inference improves physical executability without compromising logical correctness, showcasing the practicality of the approach.
We propose a soft gradient boosting framework for sequential regression that embeds a learnable linear feature transform within the boosting procedure. At each boosting iteration, we train a soft decision tree and learn a linear input feature transform Q together. This approach is particularly advantageous in high-dimensional, data-scarce scenarios, as it discovers the most relevant input representations while boosting. We demonstrate, using both synthetic and real-world datasets, that our method effectively and efficiently increases the performance by an end-to-end optimization of feature selection/transform and boosting while avoiding overfitting. We also extend our algorithm to differentiable non-linear transforms if overfitting is not a problem. To support reproducibility and future work, we share our code publicly.
We present a probabilistic framework for modeling structured spatiotemporal dynamics from sparse observations, focusing on cardiac motion. Our approach integrates neural ordinary differential equations (NODEs), graph neural networks (GNNs), and neural processes into a unified model that captures uncertainty, temporal continuity, and anatomical structure. We represent dynamic systems as spatiotemporal multiplex graphs and model their latent trajectories using a GNN-parameterized vector field. Given the sparse context observations at node and edge levels, the model infers a distribution over latent initial states and control variables, enabling both interpolation and extrapolation of trajectories. We validate the method on three synthetic dynamical systems (coupled pendulum, Lorenz attractor, and Kuramoto oscillators) and two real-world cardiac imaging datasets - ACDC (N=150) and UK Biobank (N=526) - demonstrating accurate reconstruction, extrapolation, and disease classification capabilities. The model accurately reconstructs trajectories and extrapolates future cardiac cycles from a single observed cycle. It achieves state-of-the-art results on the ACDC classification task (up to 99% accuracy), and detects atrial fibrillation in UK Biobank subjects with competitive performance (up to 67% accuracy). This work introduces a flexible approach for analyzing cardiac motion and offers a foundation for graph-based learning in structured biomedical spatiotemporal time-series data.
Real-world speech communication is often hampered by a variety of distortions that degrade quality and intelligibility. While many speech enhancement algorithms target specific degradations like noise or reverberation, they often fall short in realistic scenarios where multiple distortions co-exist and interact. To spur research in this area, we introduce the Speech Restoration Challenge as part of the China Computer Federation (CCF) Advanced Audio Technology Competition (AATC) 2025. This challenge focuses on restoring speech signals affected by a composite of three degradation types: (1) complex acoustic degradations including non-stationary noise and reverberation; (2) signal-chain artifacts such as those from MP3 compression; and (3) secondary artifacts introduced by other pre-processing enhancement models. We describe the challenge's background, the design of the task, the comprehensive dataset creation methodology, and the detailed evaluation protocol, which assesses both objective performance and model complexity. Homepage: this https URL.
This paper introduces a learning-based control framework for a soft robotic actuator system designed to modulate intracranial pressure (ICP) waveforms, which is essential for studying cerebrospinal fluid dynamics and pathological processes underlying neurological disorders. A two-layer framework is proposed to safely achieve a desired ICP waveform modulation. First, a model predictive controller (MPC) with a disturbance observer is used for offset-free tracking of the system's motor position reference trajectory under safety constraints. Second, to address the unknown nonlinear dependence of ICP on the motor position, we employ a Bayesian optimization (BO) algorithm used for online learning of a motor position reference trajectory that yields the desired ICP modulation. The framework is experimentally validated using a test bench with a brain phantom that replicates realistic ICP dynamics in vitro. Compared to a previously employed proportional-integral-derivative controller, the MPC reduces mean and maximum motor position reference tracking errors by 83 % and 73 %, respectively. In less than 20 iterations, the BO algorithm learns a motor position reference trajectory that yields an ICP waveform with the desired mean and amplitude.
Safe and scalable deployment of end-to-end (E2E) autonomous driving requires extensive and diverse data, particularly safety-critical events. Existing data are mostly generated from simulators with a significant sim-to-real gap or collected from on-road testing that is costly and unsafe. This paper presents TeraSim-World, an automated pipeline that synthesizes realistic and geographically diverse safety-critical data for E2E autonomous driving at anywhere in the world. Starting from an arbitrary location, TeraSim-World retrieves real-world maps and traffic demand from geospatial data sources. Then, it simulates agent behaviors from naturalistic driving datasets, and orchestrates diverse adversities to create corner cases. Informed by street views of the same location, it achieves photorealistic, geographically grounded sensor rendering via the frontier video generation model Cosmos-Drive. By bridging agent and sensor simulations, TeraSim-World provides a scalable and critical~data synthesis framework for training and evaluation of E2E autonomous driving systems.
CoVariance Neural Networks (VNNs) perform graph convolutions on the empirical covariance matrix of signals defined over finite-dimensional Hilbert spaces, motivated by robustness and transferability properties. Yet, little is known about how these arguments extend to infinite-dimensional Hilbert spaces. In this work, we take a first step by introducing a novel convolutional learning framework for signals defined over infinite-dimensional Hilbert spaces, centered on the (empirical) covariance operator. We constructively define Hilbert coVariance Filters (HVFs) and design Hilbert coVariance Networks (HVNs) as stacks of HVF filterbanks with nonlinear activations. We propose a principled discretization procedure, and we prove that empirical HVFs can recover the Functional PCA (FPCA) of the filtered signals. We then describe the versatility of our framework with examples ranging from multivariate real-valued functions to reproducing kernel Hilbert spaces. Finally, we validate HVNs on both synthetic and real-world time-series classification tasks, showing robust performance compared to MLP and FPCA-based classifiers.
A rich vehicle routing problem is considered allowing multiple trips of heterogeneous vehicles stationed at distributed vehicle depots spread across diverse geographies having access to different modes of transportation. The problem arises from the real world requirement of optimizing the disaster response/preparedness time and minimizes the route duration of the vehicles to achieve the solution with the minimum highest-vehicle-route-duration. Multiple diversely-functional vertices are considered including the concept of Transhipment Ports as inter-modal resource transfer stations. Both simultaneous and split pickup and transferring of different types of delivery and pickup cargo is considered, along with Vehicle-Cargo and Transhipment Port-Cargo Compatibility. The superiority of the proposed cascaded minimization approach is shown over existing makespan minimization approaches through the developed MILP formulation. To solve the problem quickly for practical implementation within Disaster Management-specific Decision Support Systems, an extensive Heuristic Algorithm is devised. The Heuristic utilizes Decision Tree based structuring of possible routes and is able to inherently consider the compatibility issues. Preferential generation of small route elements are performed, which are integrated into route clusters; we consider multiple different logical integration approaches, as well as shuffling the logics to simultaneously produce multiple independent solutions. Finally perturbation of the different solutions are done to find better neighbouring solutions. The computational performance of the PSR-GIP Heuristic, on our created novel datasets, indicate that it is able to give good solutions swiftly for practical problems involving large integer instances which the MILP is unable to solve.
Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require "temporally dense" reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.
A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.
Imaging methods based on array signal processing often require a fixed propagation speed of the medium, or speed of sound (SoS) for methods based on acoustic signals. The resolution of the images formed using these methods is strongly affected by the assumed SoS, which, due to multipath, nonlinear propagation, and non-uniform mediums, is challenging at best to select. In this letter, we propose a Bayesian approach to marginalize the influence of the SoS on beamformers for imaging. We adapt Bayesian direction-of-arrival estimation to an imaging setting and integrate a popular minimum variance beamformer over the posterior of the SoS. To solve the Bayesian integral efficiently, we use numerical Gauss quadrature. We apply our beamforming approach to shallow water sonar imaging where multipath and nonlinear propagation is abundant. We compare against the minimum variance distortionless response (MVDR) beamformer and demonstrate that its Bayesian counterpart achieves improved range and azimuthal resolution while effectively suppressing multipath artifacts.
We present Ungar, an open-source library to aid the implementation of high-dimensional optimal control problems (OCPs). We adopt modern template metaprogramming techniques to enable the compile-time modeling of complex systems while retaining maximum runtime efficiency. Our framework provides syntactic sugar to allow for expressive formulations of a rich set of structured dynamical systems. While the core modules depend only on the header-only Eigen and this http URL libraries, we bundle our codebase with optional packages and custom wrappers for automatic differentiation, code generation, and nonlinear programming. Finally, we demonstrate the versatility of Ungar in various model predictive control applications, namely, four-legged locomotion and collaborative loco-manipulation with multiple one-armed quadruped robots. Ungar is available under the Apache License 2.0 at this https URL.
This work proposes a robust data-driven tube-based zonotopic predictive control (TZPC) approach for discrete-time linear systems, designed to ensure stability and recursive feasibility in the presence of bounded noise. The proposed approach consists of two phases. In an initial learning phase, we provide an over-approximation of all models consistent with past input and noisy state data using zonotope properties. Subsequently, in a control phase, we formulate an optimization problem, which by integrating terminal ingredients is proven to be recursively feasible. Moreover, we prove that implementing this data-driven predictive control approach guarantees robust exponential stability of the closed-loop system. The effectiveness and competitive performance of the proposed control strategy, compared to recent data-driven predictive control methods, are illustrated through numerical simulations.
Accurate prediction of the state-of-health (SOH) of lithium-ion batteries is essential for ensuring the safety, reliability, and efficient operation of electric vehicles (EVs). Battery packs in EVs experience nonuniform degradation due to cell-to-cell variability (CtCV), posing a major challenge for real-time battery management. In this work, we propose an explainable, data-driven SOH prediction framework tailored for EV battery management systems (BMS). The approach combines robust feature engineering with a DLinear. Using NASA's battery aging dataset, we extract twenty meaningful features from voltage, current, temperature, and time profiles, and select key features using Pearson correlation and Shapley additive explanations (SHAP). The SHAP-based selection yields consistent feature importance across multiple cells, effectively capturing CtCV. The DLinear algorithm outperforms long short-term memory (LSTM) and Transformer architectures in prediction accuracy, while requiring fewer training cycles and lower computational cost. This work offers a scalable and interpretable framework for SOH forecasting, enabling practical implementation in EV BMS and promoting safer, more efficient electric mobility.
Defacing is often applied to head magnetic resonance image (MRI) datasets prior to public release to address privacy concerns. The alteration of facial and nearby voxels has provoked discussions about the true capability of these techniques to ensure privacy as well as their impact on downstream tasks. With advancements in deep generative models, the extent to which defacing can protect privacy is uncertain. Additionally, while the altered voxels are known to contain valuable anatomical information, their potential to support research beyond the anatomical regions directly affected by defacing remains uncertain. To evaluate these considerations, we develop a refacing pipeline that recovers faces in defaced head MRIs using cascaded diffusion probabilistic models (DPMs). The DPMs are trained on images from 180 subjects and tested on images from 484 unseen subjects, 469 of whom are from a different dataset. To assess whether the altered voxels in defacing contain universally useful information, we also predict computed tomography (CT)-derived skeletal muscle radiodensity from facial voxels in both defaced and original MRIs. The results show that DPMs can generate high-fidelity faces that resemble the original faces from defaced images, with surface distances to the original faces significantly smaller than those of a population average face (p < 0.05). This performance also generalizes well to previously unseen datasets. For skeletal muscle radiodensity predictions, using defaced images results in significantly weaker Spearman's rank correlation coefficients compared to using original images (p < 10-4). For shin muscle, the correlation is statistically significant (p < 0.05) when using original images but not statistically significant (p > 0.05) when any defacing method is applied, suggesting that defacing might not only fail to protect privacy but also eliminate valuable information.
Cardiovascular diseases are a leading cause of death and disability worldwide. Electrocardiogram (ECG) is critical for diagnosing and monitoring cardiac health, but obtaining large-scale annotated ECG datasets is labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL) methods mitigate this by learning features without extensive labels but fail to capture fine-grained clinical semantics and require extensive task-specific fine-tuning. To address these challenges, we propose $\textbf{SuPreME}$, a $\textbf{Su}$pervised $\textbf{Pre}$-training framework for $\textbf{M}$ultimodal $\textbf{E}$CG representation learning. SuPreME is pre-trained using structured diagnostic labels derived from ECG report entities through a one-time offline extraction with Large Language Models (LLMs), which help denoise, standardize cardiac concepts, and improve clinical representation learning. By fusing ECG signals with textual cardiac queries instead of fixed labels, SuPreME enables zero-shot classification of unseen conditions without further fine-tuning. We evaluate SuPreME on six downstream datasets covering 106 cardiac conditions, achieving superior zero-shot AUC performance of $77.20\%$, surpassing state-of-the-art eSSLs by $4.98\%$. Results demonstrate SuPreME's effectiveness in leveraging structured, clinically relevant knowledge for high-quality ECG representations.
The ability to embed watermarks in images is a fundamental problem of interest for computer vision, and is exacerbated by the rapid rise of generated imagery in recent times. Current state-of-the-art techniques suffer from computational and statistical challenges such as the slow execution speed for practical deployments. In addition, other works trade off fast watermarking speeds but suffer greatly in their robustness or perceptual quality. In this work, we propose WaterFlow (WF), a fast and extremely robust approach for high fidelity visual watermarking based on a learned latent-dependent watermark. Our approach utilizes a pretrained latent diffusion model to encode an arbitrary image into a latent space and produces a learned watermark that is then planted into the Fourier Domain of the latent. The transformation is specified via invertible flow layers that enhance the expressivity of the latent space of the pre-trained model to better preserve image quality while permitting robust and tractable detection. Most notably, WaterFlow demonstrates state-of-the-art performance on general robustness and is the first method capable of effectively defending against difficult combination attacks. We validate our findings on three widely used real and generated datasets: MS-COCO, DiffusionDB, and WikiArt.
Heterogeneous morphological features and data imbalance pose significant challenges in rare thyroid carcinoma classification using ultrasound imaging. To address this issue, we propose a novel multitask learning framework, Channel-Spatial Attention Synergy Network (CSASN), which integrates a dual-branch feature extractor - combining EfficientNet for local spatial encoding and ViT for global semantic modeling, with a cascaded channel-spatial attention refinement module. A residual multiscale classifier and dynamically weighted loss function further enhance classification stability and accuracy. Trained on a multicenter dataset comprising more than 2000 patients from four clinical institutions, our framework leverages a residual multiscale classifier and dynamically weighted loss function to enhance classification stability and accuracy. Extensive ablation studies demonstrate that each module contributes significantly to model performance, particularly in recognizing rare subtypes such as FTC and MTC carcinomas. Experimental results show that CSASN outperforms existing single-stream CNN or Transformer-based models, achieving a superior balance between precision and recall under class-imbalanced conditions. This framework provides a promising strategy for AI-assisted thyroid cancer diagnosis.
We present a system for non-intrusive prediction of speech quality in noisy and enhanced speech, developed for Track 3 of the VoiceMOS 2024 Challenge. The task required estimating the ITU-T P.835 metrics SIG, BAK, and OVRL without reference signals and with only 100 subjectively labeled utterances for training. Our approach uses wav2vec 2.0 with a two-stage transfer learning strategy: initial fine-tuning on automatically labeled noisy data, followed by adaptation to the challenge data. The system achieved the best performance on BAK prediction (LCC=0.867) and a very close second place in OVRL (LCC=0.711) in the official evaluation. Post-challenge experiments show that adding artificially degraded data to the first fine-tuning stage substantially improves SIG prediction, raising correlation with ground truth scores from 0.207 to 0.516. These results demonstrate that transfer learning with targeted data generation is effective for predicting P.835 scores under severe data constraints.
Multiresolution image fusion is a key problem for real-time satellite imaging and plays a central role in detecting and monitoring natural phenomena such as floods. It aims to solve the trade-off between temporal and spatial resolution in remote sensing instruments. Although several algorithms have been proposed for this problem, the presence of outliers such as clouds downgrades their performance. Moreover, strategies that integrate robustness, recursive operation and learned models are missing. In this paper, a robust recursive image fusion framework leveraging location-aware neural networks (NN) to model the image dynamics is proposed. Outliers are modeled by representing the probability of contamination of a given pixel and band. A NN model trained on a small dataset provides accurate predictions of the stochastic image time evolution, which improves both the accuracy and robustness of the method. A recursive solution is proposed to estimate the high-resolution images using a Bayesian variational inference framework. Experiments fusing images from the Landsat 8 and MODIS instruments show that the proposed approach is significantly more robust against cloud cover, without losing performance when no clouds are present.
Vision Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing specialist medical VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and specialist medical VLMs each perform best. This study highlights the complementary strengths of specialist medical and generalist VLMs. Specialists remain valuable in modality-aligned use cases, but we find that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities. These results suggest that generalist VLMs, rather than being constrained by their lack of specialist medical pretraining, may offer a scalable and cost-effective pathway for advancing clinical AI development.
Identifying the actual cause of events in engineered systems is a fundamental challenge in system analysis. Finding such causes becomes more challenging in the presence of noise and stochastic behavior in real-world systems. In this paper, we adopt the notion of probabilistic actual causality by Fenton-Glynn, which is a probabilistic extension of Halpern and Pearl's actual causality, and propose a novel method to formally reason about causal effect of events in stochastic systems. We (1) formulate the discovery of probabilistic actual causes in computing systems as an SMT problem, and (2) address the scalability challenges by introducing an abstraction-refinement technique that improves efficiency by up to 95%. We demonstrate the effectiveness of our approach through three case studies, identifying probabilistic actual causes of safety violations in (1) the Mountain Car problem, (2) the Lunar Lander benchmark, and (3) MPC controller for an F-16 autopilot simulator.
In this chapter we provide a thorough overview of the use of energy-based models (EBMs) in the context of inverse imaging problems. EBMs are probability distributions modeled via Gibbs densities $p(x) \propto \exp{-E(x)}$ with an appropriate energy functional $E$. Within this chapter we present a rigorous theoretical introduction to Bayesian inverse problems that includes results on well-posedness and stability in the finite-dimensional and infinite-dimensional setting. Afterwards we discuss the use of EBMs for Bayesian inverse problems and explain the most relevant techniques for learning EBMs from data. As a crucial part of Bayesian inverse problems, we cover several popular algorithms for sampling from EBMs, namely the Metropolis-Hastings algorithm, Gibbs sampling, Langevin Monte Carlo, and Hamiltonian Monte Carlo. Moreover, we present numerical results for the resolution of several inverse imaging problems obtained by leveraging an EBM that allows for the explicit verification of those properties that are needed for valid energy-based modeling.
As a critical modality for structural biology, cryogenic electron microscopy (cryo-EM) facilitates the determination of macromolecular structures at near-atomic resolution. The core computational task in single-particle cryo-EM is to reconstruct the 3D electrostatic potential of a molecule from a large collection of noisy 2D projections acquired at unknown orientations. Gaussian mixture models (GMMs) provide a continuous, compact, and physically interpretable representation for molecular density and have recently gained interest in cryo-EM reconstruction. However, existing methods rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. Addressing this issue, we introduce cryoGS, a GMM-based method that integrates Gaussian splatting with the physics of cryo-EM image formation. In particular, we develop an orthogonal projection-aware Gaussian splatting, with adaptations such as a normalization term and FFT-aligned coordinate system tailored for cryo-EM imaging. All these innovations enable stable and efficient homogeneous reconstruction directly from raw cryo-EM particle images using random initialization. Experimental results on real datasets validate the effectiveness and robustness of cryoGS over representative baselines. The code will be released upon publication.
This paper investigates a novel downlink symbiotic radio framework enabled by the pinching antenna system (PASS), designed to enhance both primary and secondary transmissions through reconfigurable antenna positioning. This reconfigurability introduces additional degrees of freedom for adaptive pinching beamforming, thereby enabling constructive signal enhancement and interference suppression tailored to the locations of the backscatter device, the Internet of Things (IoT) receiver, and the primary receivers. To fully exploit these benefits, we formulate a joint transmit and pinching beamforming optimization problem that maximizes the achievable sum rate while satisfying the IoT receiver's detection error probability constraint and feasible deployment constraints for the pinching antennas. The resulting problem is inherently nonconvex and highly coupled. To address this challenge, we develop two complementary solution approaches. The first is a learning-aided gradient descent method, where the constrained optimization is reformulated into a differentiable form and solved through end-to-end learning. In this approach, the pinching antenna position matrix is reparameterized to automatically satisfy minimum spacing constraints, while transmit power and waveguide length limits are enforced via projection and normalization. The second approach is an optimization-based successive convex approximation-particle swarm optimization method, which first determines the transmit beamforming solution using successive convex approximation and subsequently optimizes pinching beamforming via a particle swarm optimization search over candidate pinching antenna placements.
In this paper, we present an agentic double deep Q-network (DDQN) scheduler for licensed/unlicensed band allocation in New Radio (NR) sidelink (SL) networks. Beyond conventional reward-seeking reinforcement learning (RL), the agent perceives and reasons over a multi-dimensional context that jointly captures queueing delay, link quality, coexistence intensity, and switching stability. A capacity-aware, quality of service (QoS)-constrained reward aligns the agent with goal-oriented scheduling rather than static thresholding. Under constrained licensed bandwidth, the proposed design reduces blocking by up to 87.5% versus threshold policies while preserving throughput, highlighting the value of context-driven decisions in coexistence-limited NR SL systems.
Design under uncertainty is a challenging problem, as a systems performance can be highly sensitive to variations in input parameters and model uncertainty. A conventional approach to addressing such problems is robust optimization, which seeks to enhance design performance by reducing sensitivity to uncertainty. Alternatively, reliability-based design focuses on optimizing performance while ensuring that failure constraints are satisfied with a specified probability. While both methods are well established, their integration into multi-objective and multi-stakeholder decision-making frameworks remains a challenging problem. In this study, we extend the Compromise Decision Support Problem (cDSP) framework to incorporate reliability-based design considerations and evaluate its performance in comparison to the conventional robust-based cDSP formulation. The developed framework has been validated on a multidisciplinary hot rod rolling process including parametric and model uncertainties. The results compare the predicted performance under robust and reliable scenarios, validating the efficiency of the approach in managing uncertainties for complex, multidisciplinary systems. Specifically, we found that the two methods exhibit markedly different performance when the predicted performance follows a non-normal distribution, a situation that arises in non-linear systems with parametric uncertainty. Based on this insight, we offer guidance to designers on the conditions under which each method is most appropriate.
This paper introduces a Data-Fused Model Predictive Control (DFMPC) framework that combines physics-based models with data-driven representations of unknown dynamics. Leveraging Willems' Fundamental Lemma and an artificial equilibrium formulation, the method enables tracking of changing, potentially unreachable setpoints while explicitly handling measurement noise through slack variables and regularization. We provide guarantees of recursive feasibility and practical stability under input-output constraints for a specific class of reference signals. The approach is validated on the iRonCub flying humanoid robot, integrating analytical momentum models with data-driven turbine dynamics. Simulations show improved tracking and robustness compared to a purely model-based MPC, while maintaining real-time feasibility.
This paper introduces a Markov chain-based approach for the analysis and optimization of spare-management policies in large-scale satellite constellations. Focusing on the direct strategy, we model spare replenishment as a periodic-review reorder-point/order-quantity policy, where spares are deployed directly to constellation planes. The stochastic behavior of satellite failures and launch vehicle lead times is captured through Markov representations of both failure and replenishment dynamics. Based on this efficient and accurate framework, we construct and solve an optimization problem aimed at minimizing operational costs. The effectiveness of the proposed method is demonstrated through a case study using a real-world mega-constellation.
This letter explores intelligent scheduling of sensor-to-controller communication in networked control systems, particularly when data transmission incurs a cost. While the optimal controller in a standard linear quadratic Gaussian (LQG) setup can be computed analytically, determining the optimal times to transmit sensor data remains computationally and analytically challenging. We show that, through reformulation and the introduction of auxiliary binary variables, the scheduling problem can be cast as a computationally efficient mixed-integer linear program (MILP). This formulation not only simplifies the analysis but also reveals structural insights and provides clear decision criteria at each step. Embedding the approach within a model predictive control (MPC) framework enables dynamic adaptation, and we prove that the resulting scheduler performs at least as well as any deterministic strategy (e.g., periodic strategy). Simulation results further demonstrate that our method consistently outperforms traditional periodic scheduling.
Modern power systems face increasing vulnerability to sophisticated cyber-physical attacks beyond traditional N-1 contingency frameworks. Existing security paradigms face a critical bottleneck: efficiently identifying worst-case scenarios and rapidly coordinating defensive responses are hindered by intensive computation and time delays, during which cascading failures can propagate. This paper presents a novel tri-level robust constrained reinforcement learning (RCRL) framework for robust power system security. The framework generates diverse system states through AC-OPF formulations, identifies worst-case N-K attack scenarios for each state, and trains policies to mitigate these scenarios across all operating conditions without requiring predefined attack patterns. The framework addresses constraint satisfaction through Beta-blending projection-based feasible action mapping techniques during training and primal-dual augmented Lagrangian optimization for deployment. Once trained, the RCRL policy learns how to control observed cyber-physical attacks in real time. Validation on IEEE benchmark systems demonstrates effectiveness against coordinated N-K attacks, causing widespread cascading failures throughout the network. The learned policy can successfully respond rapidly to recover system-wide constraints back to normal within 0.21 ms inference times, establishing superior resilience for critical infrastructure protection.
Optimal transport (OT) has recently been shown as a promising criterion for unsupervised restoration when no explicit prior model is available. Despite its theoretical appeal, OT still significantly falls short of supervised methods on challenging tasks such as super-resolution, deraining, and dehazing. In this paper, we propose a \emph{sparsity-aware optimal transport} (SOT) framework to bridge this gap by leveraging a key observation: the degradations in these tasks exhibit distinct sparsity in the frequency domain. Incorporating this sparsity prior into OT can significantly reduce the ambiguity of the inverse mapping for restoration and substantially boost performance. We provide analysis to show exploiting degradation sparsity benefits unsupervised restoration learning. Extensive experiments on real-world super-resolution, deraining, and dehazing demonstrate that SOT offers notable performance gains over standard OT, while achieving superior perceptual quality compared to existing supervised and unsupervised methods. In particular, SOT consistently outperforms existing unsupervised methods across all three tasks and narrows the performance gap to supervised counterparts.
Adversarial training and adversarial purification are two widely used defense strategies for enhancing model robustness against adversarial attacks. However, adversarial training requires costly retraining, while adversarial purification often suffers from low efficiency. More critically, existing defenses are primarily designed under the perturbation-based adversarial threat model, which is ineffective against recently introduced unrestricted adversarial attacks. In this paper, we propose an effective and efficient defense framework that counters both perturbation-based and unrestricted adversarial attacks. Our approach is motivated by the observation that adversarial examples typically lie near the decision boundary and are highly sensitive to pixel-level perturbations. To address this, we introduce adversarial anti-aliasing, a preprocessing technique that mitigates adversarial noise by reducing the magnitude of pixel-level perturbations. In addition, we propose adversarial super-resolution, which leverages prior knowledge from clean datasets to benignly restore high-quality images from adversarially degraded ones. Unlike image synthesis methods that generate entirely new images, adversarial super-resolution focuses on image restoration, making it more suitable for purification. Importantly, both techniques require no additional training and are computationally efficient since they do not rely on gradient computations. To further improve robustness across diverse datasets, we introduce a contrastive learning-based adversarial deblurring fine-tuning method. By incorporating adversarial priors during fine-tuning on the target dataset, this method enhances purification effectiveness without the need to retrain diffusion models.
The use of quantum stochastic models is widespread in dynamical reduction, simulation of open systems, feedback control and adaptive estimation. In many applications only part of the information contained in the filter's state is actually needed to reconstruct the target observable quantities; thus, filters of smaller dimensions could be in principle implemented to perform the same this http URL this work, we propose a systematic method to find, when possible, reduced-order quantum filters that are capable of exactly reproducing the evolution of expectation values of interest. In contrast with existing reduction techniques, the reduced model we obtain is exact and in the form of a Belavkin filtering equation, ensuring physical this http URL is attained by leveraging tools from the theory of both minimal realization and non-commutative conditional expectations. The proposed procedure is tested on prototypical examples, laying the groundwork for applications in quantum trajectory simulation and quantum feedback control.
Despite their groundbreaking performance, state-of-the-art autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training/environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long-standing challenge when deploying agents in the real world. Here, we introduce a Distributionally Robust Free Energy model (DR-FREE) that instills this core property by design. It directly wires robustness into the agent decision-making mechanisms via free energy minimization. By combining a robust extension of the free energy principle with a novel resolution engine, DR-FREE returns a policy that is optimal-yet-robust against ambiguity. The policy has an explicit, soft-max, structure that reveals the mechanistic role of ambiguity on optimal decisions and requisite Bayesian belief updating. We evaluate DR-FREE on an experimental testbed involving real rovers navigating an ambiguous environment filled with obstacles. Across all the experiments, DR-FREE enables robots to successfully navigate towards their goal even when, in contrast, state-of-the-art free energy models fail. In short, DR-FREE can tackle scenarios that elude previous methods: this milestone may inspire both deployment in multi-agent settings and, at a perhaps deeper level, the quest for a biologically plausible explanation of how natural agents -- with little or no training -- survive in capricious environments.
This paper extends sliding-mode control theory to nonlinear systems evolving on smooth manifolds. Building on differential geometric methods, we reformulate Filippov's notion of solutions, characterize well-defined vector fields on quotient spaces, and provide a consistent geometric definition of higher-order sliding modes. We generalize the regular form to non-Euclidean settings and design explicit first- and second-order sliding-mode controllers that respect the manifold structure. Particular attention is given to the role of topological obstructions, which are illustrated through examples on the cylinder, Möbius bundle, and 2-sphere. Our results highlight how geometric and topological properties fundamentally influence sliding dynamics and suggest new directions for robust control in nonlinear spaces.
Stacked intelligent metasurfaces (SIMs), which integrate multiple programmable metasurface layers, have recently emerged as a promising technology for advanced wave-domain signal processing. SIMs benefit from flexible spatial degree-of-freedom (DoF) while reducing the requirement for costly radio-frequency (RF) chains. However, current state-of-the-art SIM designs face challenges such as complex phase shift optimization and energy attenuation from multiple layers. To address these aspects, we propose incorporating meta-fibers into SIMs, with the aim of reducing the number of layers and enhancing the energy efficiency. First, we introduce a meta-fiber-connected 2-layer SIM that exhibits the same flexible signal processing capabilities as conventional multi-layer structures, and explains the operating principle. Subsequently, we formulate and solve the optimization problem of minimizing the mean square error (MSE) between the SIM channel and the desired channel matrices. Specifically, by designing the phase shifts of the meta-atoms associated with the transmitting-SIM and receiving-SIM, a non-interference system with parallel subchannels is established. In order to reduce the computational complexity, a closed-form expression for each phase shift at each iteration of an alternating optimization (AO) algorithm is proposed. We show that the proposed algorithm is applicable to conventional multi-layer SIMs. The channel capacity bound and computational complexity are analyzed to provide design insights. Finally, numerical results are illustrated, demonstrating that the proposed two-layer SIM with meta-fiber achieves over a 25% improvement in channel capacity while reducing the total number of meta-atoms by 59% as compared with a conventional seven-layer SIM.
This work studies a class of optimal control problems with scalar inputs and general constraints, whose solutions follow a bang-ride pattern that always activates a constraint and enables efficient numerical computation. As a motivating example, fast battery charging leads to computationally demanding optimal control problems when detailed electrochemical models are used. Recently proposed optimization-free heuristics reduce this computational cost while producing input profiles observed in practice, following a bang-ride pattern and applying the maximum feasible input. We investigate when such heuristics satisfy necessary optimality conditions. By leveraging Pontryagin's maximum principle, we unify and formalize existing insights on the bang-ride structure and on the optimal control attaining the maximum feasible input under monotonicity. We further establish a novel connection between the structured optimal control and the external positivity of the costate dynamics. These results provide a rigorous theoretical foundation for heuristic charging strategies and explain the efficiency of optimization-free algorithms.
We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness.
Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets
Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 20.8\% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios.