Diabetic retinopathy (DR) is a major cause of visual impairment, and effective treatment options depend heavily on timely and accurate diagnosis. Deep learning models have demonstrated great success identifying DR from retinal images. However, relying only on predictions made by models, without any indication of model confidence, creates uncertainty and poses significant risk in clinical settings. This paper investigates an alternative in uncertainty-aware deep learning models, including a rejection mechanism to reject low-confidence predictions, contextualized by deferred decision-making in clinical practice. The results show there is a trade-off between prediction coverage and coverage reliability. The Variational Bayesian model adopted a more conservative strategy when predicting DR, subsequently rejecting the uncertain predictions. The model is evaluated by means of important performance metrics such as Accuracy on accepted predictions, the proportion of accepted cases (coverage), the rejection-ratio, and Expected Calibration Error (ECE). The findings also demonstrate a clear trade-off between accuracy and caution, establishing that the use of uncertainty estimation and selective rejection improves the model's reliability in safety-critical diagnostic use cases.
Electroencephalography (EEG) interpretation using multimodal large language models (MLLMs) offers a novel approach for analyzing brain signals. However, the complex nature of brain activity introduces critical challenges: EEG signals simultaneously encode both cognitive processes and intrinsic neural states, creating a mismatch in EEG paired-data modality that hinders effective cross-modal representation learning. Through a pivot investigation, we uncover complementary relationships between these modalities. Leveraging this insight, we propose mapping EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation. To fully enable conversational capabilities, we further introduce WaveMind-Instruct-338k, the first cross-task EEG dataset for instruction tuning. The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations across four downstream tasks, thereby offering valuable insights for both neuroscience research and the development of general-purpose EEG models.
Deep learning integration into medical imaging systems has transformed disease detection and diagnosis processes with a focus on pneumonia identification. The study introduces an intricate deep learning system using Convolutional Neural Networks for automated pneumonia detection from chest Xray images which boosts diagnostic precision and speed. The proposed CNN architecture integrates sophisticated methods including separable convolutions along with batch normalization and dropout regularization to enhance feature extraction while reducing overfitting. Through the application of data augmentation techniques and adaptive learning rate strategies the model underwent training on an extensive collection of chest Xray images to enhance its generalization capabilities. A convoluted array of evaluation metrics such as accuracy, precision, recall, and F1 score collectively verify the model exceptional performance by recording an accuracy rate of 91. This study tackles critical clinical implementation obstacles such as data privacy protection, model interpretability, and integration with current healthcare systems beyond just model performance. This approach introduces a critical advancement by integrating medical ontologies with semantic technology to improve diagnostic accuracy. The study enhances AI diagnostic reliability by integrating machine learning outputs with structured medical knowledge frameworks to boost interpretability. The findings demonstrate AI powered healthcare tools as a scalable efficient pneumonia detection solution. This study advances AI integration into clinical settings by developing more precise automated diagnostic methods that deliver consistent medical imaging results.
Early and accurate diagnosis of Alzheimer Disease is critical for effective clinical intervention, particularly in distinguishing it from Mild Cognitive Impairment, a prodromal stage marked by subtle structural changes. In this study, we propose a hybrid deep learning ensemble framework for Alzheimer Disease classification using structural magnetic resonance imaging. Gray and white matter slices are used as inputs to three pretrained convolutional neural networks such as ResNet50, NASNet, and MobileNet, each fine tuned through an end to end process. To further enhance performance, we incorporate a stacked ensemble learning strategy with a meta learner and weighted averaging to optimally combine the base models. Evaluated on the Alzheimer Disease Neuroimaging Initiative dataset, the proposed method achieves state of the art accuracy of 99.21% for Alzheimer Disease vs. Mild Cognitive Impairment and 91.0% for Mild Cognitive Impairment vs. Normal Controls, outperforming conventional transfer learning and baseline ensemble methods. To improve interpretability in image based diagnostics, we integrate Explainable AI techniques by Gradient weighted Class Activation, which generates heatmaps and attribution maps that highlight critical regions in gray and white matter slices, revealing structural biomarkers that influence model decisions. These results highlight the frameworks potential for robust and scalable clinical decision support in neurodegenerative disease diagnostics.
Effective stroke recovery requires continuous rehabilitation integrated with daily living. To support this need, we propose a home-based rehabilitation exercise and feedback system. The system consists of (1) hardware setup with RGB-D camera and wearable sensors to capture Stroke movements, (2) a mobile application for exercise guidance, and (3) an AI server for assessment and feedback. When Stroke user exercises following the application guidance, the system records skeleton sequences, which are then Assessed by the deep learning model, RAST-G@. The model employs a spatio-temporal graph convolutional network (ST-GCN) to extract skeletal features and integrates transformer-based temporal attention to figure out action quality. For system implementation, we constructed the NRC dataset, include 10 upper-limb activities of daily living (ADL) and 5 range-of-motion (ROM) collected from stroke and non-disabled participants, with Score annotations provided by licensed physiotherapists. Results on the KIMORE and NRC datasets show that RAST-G@ improves over baseline in terms of MAD, RMSE, and MAPE. Furthermore, the system provides user feedback that combines patient-centered assessment and monitoring. The results demonstrate that the proposed system offers a scalable approach for quantitative and consistent domiciliary rehabilitation assessment.
We present InfoVAE-Med3D, a latent-representation learning approach for 3D brain MRI that targets interpretable biomarkers of cognitive decline. Standard statistical models and shallow machine learning often lack power, while most deep learning methods behave as black boxes. Our method extends InfoVAE to explicitly maximize mutual information between images and latent variables, producing compact, structured embeddings that retain clinically meaningful content. We evaluate on two cohorts: a large healthy-control dataset (n=6527) with chronological age, and a clinical multiple sclerosis dataset from Charles University in Prague (n=904) with age and Symbol Digit Modalities Test (SDMT) scores. The learned latents support accurate brain-age and SDMT regression, preserve key medical attributes, and form intuitive clusters that aid interpretation. Across reconstruction and downstream prediction tasks, InfoVAE-Med3D consistently outperforms other VAE variants, indicating stronger information capture in the embedding space. By uniting predictive performance with interpretability, InfoVAE-Med3D offers a practical path toward MRI-based biomarkers and more transparent analysis of cognitive deterioration in neurological disease.
Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.
SkinGPT-4, a large vision-language model, leverages annotated skin disease images to augment clinical workflows in underserved communities. However, its training dataset predominantly represents lighter skin tones, limiting diagnostic accuracy for darker tones. Here, we evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases, including eczema, allergic-contact dermatitis, and psoriasis using the open-sourced SCIN dataset. We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks and explored bias mitigation strategies. Clinical evaluation by board-certified dermatologists on six relevant skin diseases from 300 SCIN cases assessed images for diagnostic accuracy, informativity, physician utility, and patient utility. Model fairness metrics, including demographic parity and equalized odds, were calculated across skin tones. SkinGPT-4 achieved an average demographic parity of 0.10 across Fitzpatrick types, with notable differences of 0.10-0.15 between lightest and darkest tones across evaluation metrics. Model hallucinations in artifacts and anatomy occurred at a rate of 17.8. Our customized models achieved average F1, precision, and AUROC of 0.75, 0.78, and 0.78 across visually similar disease pairs. Fairness analysis showed an average demographic parity of 0.75, with a maximum disparity of 0.21 across skin tones. The best model achieved parity scores of 0.83, 0.83, 0.76, 0.89, 0.90, and 0.90 for Fitzpatrick I-VI, indicating robust fairness. Large language models such as SkinGPT-4 showed weaker performance on darker tones. Model biases exist across evaluation criteria, and hallucinations may affect diagnostic efficacy. These findings demonstrate the efficacy of training accurate, fair models using existing backbones for custom skin disease classification.
This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86\% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.
Osteoporosis silently erodes skeletal integrity worldwide; however, early detection through imaging can prevent most fragility fractures. Artificial intelligence (AI) methods now mine routine Dual-energy X-ray Absorptiometry (DXA), X-ray, Computed Tomography (CT), and Magnetic Resonance Imaging (MRI) scans for subtle, clinically actionable markers, but the literature is fragmented. This survey unifies the field through a tri-axial framework that couples imaging modalities with clinical tasks and AI methodologies (classical machine learning, convolutional neural networks (CNNs), transformers, self-supervised learning, and explainable AI). Following a concise clinical and technical primer, we detail our Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-guided search strategy, introduce the taxonomy via a roadmap figure, and synthesize cross-study insights on data scarcity, external validation, and interpretability. By identifying emerging trends, open challenges, and actionable research directions, this review provides AI scientists, medical imaging researchers, and musculoskeletal clinicians with a clear compass to accelerate rigorous, patient-centered innovation in osteoporosis care. The project page of this survey can also be found on Github.
The necessity of new spectrum for 6G has intensified global interest in radio propagation measurements across emerging frequency bands, use cases, and antenna types. These measurements are vital for understanding radio channel properties in diverse environments, and involve time-consuming and expensive campaigns. A major challenge for the effective utilization of propagation measurement data has been the lack of a standardized format for reporting and archiving results. Although organizations such as NIST, NGA, and 3GPP have made commendable efforts for data pooling, a unified machine-readable data format for consolidating measurements across different institutions and frequencies remains a missing piece in advancing global standardization efforts. This paper introduces a standardized point-data format for radio propagation measurements and demonstrates how institutions may merge disparate campaigns into a common format. This data format, alongside an environmental map and a measurement summary metadata table, enables integration of data from disparate sources by using a structured representation of key parameters. Here, we show the efficacy of the point-data format standard using data gathered from two independent sub-THz urban microcell (UMi) campaigns: 142 GHz measurements at New York University (NYU) and 145 GHz measurements at the University of Southern California (USC). A joint path loss analysis using the close-in path loss model (1 m ref. distance) yields a refined estimate of the path loss exponent (PLE) employing the proposed standard to pool measurements. Other statistics such as RMS delay spread and angular spread are also determined using a joint point-data table. Adopting this simple, unified format will accelerate channel model development, build multi-institutional datasets, and feed AI/ML applications with reliable training data in a common format from many sources.
Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.
Attitude stabilization of unmanned aerial vehicles in uncertain environments presents significant challenges due to nonlinear dynamics, parameter variations, and sensor limitations. This paper presents a comparative study of $\mathcal{H}_\infty$ and classical PID controllers for multi-rotor attitude regulation in the presence of wind disturbances and gyroscope noise. The flight dynamics are modeled using a linear parameter-varying (LPV) framework, where nonlinearities and parameter variations are systematically represented as structured uncertainties within a linear fractional transformation formulation. A robust controller based on $\mathcal{H}_\infty$ formulation is designed using only gyroscope measurements to ensure guaranteed performance bounds. Nonlinear simulation results demonstrate the effectiveness of the robust controllers compared to classical PID control, showing significant improvement in attitude regulation under severe wind disturbances.
This paper introduces the Extended Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD SVDSR), a resource specifically designed to facilitate the creation of high quality deepfakes and support the development of detection systems trained against them. The dataset comprises 45 minute audio recordings from 36 participants, each reading various newspaper articles recorded under controlled conditions and captured via five microphones of differing quality. By focusing on extended duration audio, ELAD SVDSR captures a richer range of speech attributes such as pitch contours, intonation patterns, and nuanced delivery enabling models to generate more realistic and coherent synthetic voices. In turn, this approach allows for the creation of robust deepfakes that can serve as challenging examples in datasets used to train and evaluate synthetic voice detection methods. As part of this effort, 20 deepfake voices have already been created and added to the dataset to showcase its potential. Anonymized metadata accompanies the dataset on speaker demographics. ELAD SVDSR is expected to spur significant advancements in audio forensics, biometric security, and voice authentication systems.
We introduce a computationally efficient and tunable feedback delay network (FDN) architecture for real-time room impulse response (RIR) rendering that addresses the computational and latency challenges inherent in traditional convolution and Fourier transform based methods. Our approach directly optimizes FDN parameters to match target RIR acoustic and psychoacoustic metrics such as clarity and definition through novel differentiable programming-based optimization. Our method enables dynamic, real-time adjustments of room impulse responses that accommodates listener and source movement. When combined with previous work on representation of head-related impulse responses via infinite impulse responses, an efficient rendering of auditory objects is possible when the HRIR and RIR are known. Our method produces renderings with quality similar to convolution with long binaural room impulse response (BRIR) filters, but at a fraction of the computational cost.
Own voice pickup technology for hearable devices facilitates communication in noisy environments. Own voice reconstruction (OVR) systems enhance the quality and intelligibility of the recorded noisy own voice signals. Since disturbances affecting the recorded own voice signals depend on individual factors, personalized OVR systems have the potential to outperform generic OVR systems. In this paper, we propose personalizing OVR systems through data augmentation and fine-tuning, comparing them to their generic counterparts. We investigate the influence of personalization on speech quality assessed by objective metrics and conduct a subjective listening test to evaluate quality under various conditions. In addition, we assess the prediction accuracy of the objective metrics by comparing predicted quality with subjectively measured quality. Our findings suggest that personalized OVR provides benefits over generic OVR for some talkers only. Our results also indicate that performance comparisons between systems are not always accurately predicted by objective metrics. In particular, certain disturbances lead to a consistent overestimation of quality compared to actual subjective ratings.
Objective, task-based, measures of image quality (IQ) have been widely advocated for assessing and optimizing medical imaging technologies. Besides signal detection theory-based measures, information-theoretic quantities have been proposed to quantify task-based IQ. For example, task-specific information (TSI), defined as the mutual information between an image and task variable, represents an optimal measure of how informative an image is for performing a specified task. However, like the ideal observer from signal detection theory, TSI does not quantify the amount of task-relevant information in an image that can be exploited by a sub-ideal observer. A recently proposed relaxation of TSI, termed predictive V-information (V-info), removes this limitation and can quantify the utility of an image with consideration of a specified family of sub-ideal observers. In this study, for the first time, V-info is proposed and investigated as an objective, task-specific, IQ metric. To corroborate its usefulness, a stylized magnetic resonance image restoration problem is considered in which V-info is employed to quantify signal detection or discrimination performance. The presented results show that V-info correlates with area under the receiver operating characteristic (ROC) curve for binary tasks, while being readily applicable to multi-class (>2) tasks where ROC analysis is challenging. Notably, V-info exhibits greater sensitivity in scenarios where conventional metrics saturate. These findings demonstrate that V-info represents a new objective IQ measure that can complement conventional signal detection theory-based ones.
In this paper, we present the combined learning-and-control (CLC) approach, which is a new way to solve optimal control problems with unknown dynamics by unifying model-based control and data-driven learning. The key idea is simple: we design a controller to be optimal for a proxy objective built on an available model while penalizing mismatches with the real system, so that the resulting controller is also optimal for the actual system. Building on the original CLC formulation, we demonstrate the framework to the linear quadratic regulator problem and make three advances: (i) we show that the CLC penalty is a sequence of stage-specific weights rather than a single constant; (ii) we identify when these weights can be set in advance and when they must depend on the (unknown) dynamics; and (iii) we develop a lightweight learning loop that tunes the weights directly from data without abandoning the benefits of a model-based design. We provide a complete algorithm and an empirical study against common baseline methods. The results clarify where prior knowledge suffices and where learning is essential, and they position CLC as a practical, theoretically grounded bridge between classical optimal control and modern learning methods.
Diffusion Transformers (DiTs) enable high-quality audio synthesis but are often computationally intensive and require substantial storage, which limits their practical deployment. In this paper, we present a comprehensive evaluation of post-training quantization (PTQ) techniques for audio DiTs, analyzing the trade-offs between static and dynamic quantization schemes. We explore two practical extensions (1) a denoising-timestep-aware smoothing method that adapts quantization scales per-input-channel and timestep to mitigate activation outliers, and (2) a lightweight low-rank adapter (LoRA)-based branch derived from singular value decomposition (SVD) to compensate for residual weight errors. Using Stable Audio Open we benchmark W8A8 and W4A8 configurations across objective metrics and human perceptual ratings. Our results show that dynamic quantization preserves fidelity even at lower precision, while static methods remain competitive with lower latency. Overall, our findings show that low-precision DiTs can retain high-fidelity generation while reducing memory usage by up to 79%.
Existing beamforming-based full-duplex solutions for multi-antenna wireless systems often rely on explicit estimation of the self-interference channel. The pilot overhead of such estimation, however, can be prohibitively high in millimeter-wave and massive MIMO systems, thus limiting the practicality of existing solutions, especially in fast-fading conditions. In this work, we present a novel beam learning framework that bypasses explicit self-interference channel estimation by designing beam codebooks to efficiently obtain implicit channel knowledge that can then be processed by a deep learning network to synthesize transmit and receive beams for full-duplex operation. Simulation results using ray-tracing illustrate that our proposed technique can allow a full-duplex base station to craft serving beams that couple low self-interference while delivering high SNR, with 75-97% fewer measurements than would be required for explicit estimation of the self-interference channel.
Mosquito Species Classification (MSC) is crucial for vector surveillance and disease control. The collection of mosquito bioacoustic data is often limited by mosquito activity seasons and fieldwork. Mosquito recordings across regions, habitats, and laboratories often show non-biological variations from the recording environment, which we refer to as domain features. This study finds that models directly trained on audio recordings with domain features tend to rely on domain information rather than the species' acoustic cues for identification, resulting in illusory good performance while actually performing poor cross-domain generalization. To this end, we propose a Domain-Robust Bioacoustic Learning (DR-BioL) framework that combines contrastive learning with distribution alignment. Contrastive learning aims to promote cohesion within the same species and mitigate inter-domain discrepancies, and species-conditional distribution alignment further enhances cross-domain species representation. Experiments on a multi-domain mosquito bioacoustic dataset from diverse environments show that the DR-BioL improves the accuracy and robustness of baselines, highlighting its potential for reliable cross-domain MSC in the real world.
Learning Model Predictive Control (LMPC) improves performance on iterative tasks by leveraging data from previous executions. At each iteration, LMPC constructs a sampled safe set from past trajectories and uses it as a terminal constraint, with a terminal cost given by the corresponding cost-to-go. While effective, LMPC heavily depends on the initial trajectories: states with high cost-to-go are rarely selected as terminal candidates in later iterations, leaving parts of the state space unexplored and potentially missing better solutions. For example, in a reach-avoid task with two possible routes, LMPC may keep refining the initially shorter path while neglecting the alternative path that could lead to a globally better solution. To overcome this limitation, we propose Multi-Modal LMPC (MM-LMPC), which clusters past trajectories into modes and maintains mode-specific terminal sets and value functions. A bandit-based meta-controller with a Lower Confidence Bound (LCB) policy balances exploration and exploitation across modes, enabling systematic refinement of all modes. This allows MM-LMPC to escape high-cost local optima and discover globally superior solutions. We establish recursive feasibility, closed-loop stability, asymptotic convergence to the best mode, and a logarithmic regret bound. Simulations on obstacle-avoidance tasks validate the performance improvements of the proposed method.
Gadolinium-based contrast agents (GBCAs) are widely used in magnetic resonance imaging (MRI) to enhance lesion detection and characterisation, particularly in the field of neuro-oncology. Nevertheless, concerns regarding gadolinium retention and accumulation in brain and body tissues, most notably for diseases that require close monitoring and frequent GBCA injection, have led to the need for strategies to reduce dosage. In this study, a deep learning framework is proposed for the virtual contrast enhancement of full-dose post-contrast T1-weighted MRI images from corresponding low-dose acquisitions. The contribution of the presented model is its utilisation of longitudinal information, which is achieved by incorporating a prior full-dose MRI examination from the same patient. A comparative evaluation against a non-longitudinal single session model demonstrated that the longitudinal approach significantly improves image quality across multiple reconstruction metrics. Furthermore, experiments with varying simulated contrast doses confirmed the robustness of the proposed method. These results emphasize the potential of integrating prior imaging history into deep learning-based virtual contrast enhancement pipelines to reduce GBCA usage without compromising diagnostic utility, thus paving the way for safer, more sustainable longitudinal monitoring in clinical MRI practice.
Accurate identification of mental health biomarkers can enable earlier detection and objective assessment of compromised mental well-being. In this study, we analyze electrodermal activity recorded during an Emotional Stroop task to capture sympathetic arousal dynamics associated with depression and suicidal ideation. We model the timing of skin conductance responses as a point process whose conditional intensity is modulated by task-based covariates, including stimulus valence, reaction time, and response accuracy. The resulting subject-specific parameter vector serves as input to a machine learning classifier for distinguishing individuals with and without depression. Our results show that the model parameters encode meaningful physiological differences associated with depressive symptomatology and yield superior classification performance compared to conventional feature extraction methods.
Positive-negative pressure regulation is critical to soft robotic actuators, enabling large motion ranges and versatile actuation modes. However, it remains challenging due to complex nonlinearities, oscillations, and direction-dependent, piecewise dynamics introduced by affordable pneumatic valves and the bidirectional architecture. We present a model-based control framework that couples a physics-grounded switched nonlinear plant model (inflation/deflation modes) with a mixed-integer nonlinear model predictive controller (MI-NMPC). The controller co-optimizes mode scheduling and PWM inputs to realize accurate reference tracking while enforcing input constraints and penalizing energy consumption and excessive switching. To make discrete mode decisions tractable, we employ a Combinatorial Integral Approximation that relaxes binary mode variables to continuous surrogates within the valve-scheduling layer. With parameters identified from the physical system, simulations with step and sinusoidal references validate the proposed MI-NMPC, showing a consistently favorable trade-off among accuracy, control effort, and switching, and outperforming conventional PID and NMPC with heuristic mode selection.
This paper presents two innovative four-port probe stations developed by FormFactor Incorporated (FFI) and MPI Corporation (MPI), and a four-port calibration standard design up to 125 GHz for the probe stations. True four-port probing at mmWave and beyond does not yet exist, but is anticipated for future multi-band wireless devices using several antennas and RF chains. The four-port probe stations are housed in the THz measurement facility at NYU and allow simultaneous probing from East, West, North, and South orientations, which presents challenges for calibration. An on-chip Short-Open-Load-Reciprocal (SOLR) calibration (cal) standard is designed leveraging UMC's 28 nm CMOS process. S/O/L standard S-parameters are extracted using a virtual multiline Thru-Reflect-Line (mTRL) cal and used to validate SOLR cal performance via simulations up to 125 GHz. The novel probing solutions from MPI and FFI, along with the SOLR cal, open up considerable opportunities for precise RF characterization across wide frequency ranges.
Purpose: To develop a fast and precise method for searching rectangular regions in brain tumor images. Methods: The authors propose a new method for searching rectangular tumor regions in brain MR images. The proposed method consisted of a segmentation network and a fast search method with a user-controllable search metric. As the segmentation network, the U-Net whose encoder was replaced by the EfficientNet was used. In the fast search method, summed-area tables were used for accelerating sums of voxels in rectangular regions. Use of the summed-area tables enabled exhaustive search of the 3D offset (3D full search). The search metric was designed for giving priority to cubes over oblongs, and assigning better values for higher tumor fractions even if they exceeded target tumor fractions. The proposed computation and metric were compared with those used in a conventional method using the Brain Tumor Image Segmentation dataset. Results: When the 3D full search was used, the proposed computation (8 seconds) was 100-500 times faster than the conventional computation (11-40 minutes). When the user-controllable parts of the search metrics were changed variously, the tumor fractions of the proposed metric were higher than those of the conventional metric. In addition, the conventional metric preferred oblongs whereas the proposed metric preferred cubes. Conclusion: The proposed method is promising for implementing fast and precise search of rectangular tumor regions, which is useful for brain tumor diagnosis using MRI systems. The proposed computation reduced processing times of the 3D full search, and the proposed metric improved the quality of the assigned rectangular tumor regions.
We present a frequency-domain system identification scheme based on barycentric interpolation and weight optimization. The scheme is related to the Adaptive Antoulas-Anderson (AAA) algorithm for model reduction, but uses an adaptive algorithm for selection of frequency points for interrogating the system response, as would be required in identification versus model reduction. The scheme is particularly suited for systems in which any one sinusoidal response run is long or expensive, and thus there is an incentive to reduce the total number of such runs. Two key features of our algorithm are the use of transient data in sinusoidal runs to both optimize the barycentric weights, and automated next-frequency selection on an adaptive grid. Both are done with error criteria that are proxies for a system's $H^2$ and $H^\infty$ norms respectively. Furthermore, the optimization problem we formulate is convex, and can optionally guarantee stability of the identified system. Computational results on a high-order, lightly damped structural system highlights the efficacy of this scheme.
Regular physiological monitoring of maternal and fetal parameters is indispensable for ensuring safe outcomes during pregnancy and parturition. Fetal electrocardiogram (fECG) assessment is crucial to detect fetal distress and developmental anomalies. Given challenges of prenatal care due to the lack of medical professionals and the limit of accessibility, especially in remote and resource-poor areas, we develop a fECG monitoring system using novel non-contact electrodes (NCE) to record the fetal/maternal ECG (f/mECG) signals through clothes, thereby improving the comfort during measurement. The system is designed to be incorporated inside a maternity belt with data acquisition, data transmission module as well as novel NCEs. Thorough characterizations were carried out to evaluate the novel NCE against traditional wet electrodes (i.e., Ag/AgCl electrodes), showing comparable performance. A successful {preliminary pilot feasibility study} conducted with pregnant women (n = 10) between 25 and 32 weeks of gestation demonstrates the system's performance, usability and safety.
Ensuring the structural integrity of bridges is essential for maintaining infrastructure safety and promoting long-term sustainability. In this context, Indirect Structural Health Monitoring (ISHM) through drive-by bridge inspection emerges as a promising alternative to traditional inspection methods, offering a cost-effective and scalable solution by using vehicle-mounted sensors to assess the condition of bridges without requiring direct instrumentation. This study introduces the first purpose-built electric inspection vehicle specifically designed for drive-by bridge inspection. The autonomous platform is capable of maintaining a constant low speed and offers customisable operational parameters to maximise the accuracy and repeatability of indirect sensing, capabilities not achieved in previous studies. The vehicle is deployed within an ISHM framework and tested on two full-scale bridges to evaluate its effectiveness in capturing structural dynamic responses. Two unsupervised frameworks are then employed to analyse the collected data to identify features indicative of bridge properties and structural condition. The promising findings from this study demonstrate the practical feasibility of the approach. The study also shows the potential of ISHM as a viable tool for efficient bridge monitoring, contributing to the development of next-generation structural health monitoring systems that can enhance safety, optimise maintenance strategies, and support the longevity of critical infrastructure.
This article proposes a novel regularization method, named Geometric Spatio-Spectral Total Variation (GeoSSTV), for hyperspectral (HS) image denoising and destriping. HS images are inevitably affected by various types of noise due to the measurement equipment and environment. Total Variation (TV)-based regularization methods that model the spatio-spectral piecewise smoothness inherent in HS images are promising approaches for HS image denoising and destriping. However, existing TV-based methods are based on classical anisotropic and isotropic TVs, which cause staircase artifacts and lack rotation invariance, respectively, making it difficult to accurately recover round structures and oblique edges. To address this issue, GeoSSTV introduces a geometrically consistent formulation of TV that measures variations across all directions in a Euclidean manner. Through this formulation, GeoSSTV removes noise while preserving round structures and oblique edges. Furthermore, we formulate the HS image denoising problem as a constrained convex optimization problem involving GeoSSTV and develop an efficient algorithm based on a preconditioned primal-dual splitting method. Experimental results on HS images contaminated with mixed noise demonstrate the superiority of the proposed method over existing approaches.
The widespread use of uncrewed aerial vehicles (UAVs) has propelled the development of advanced techniques on countering unauthorized UAV flights. However, the resistance of legal UAVs to illegal interference remains under-addressed. This paper proposes radiation pattern reconfigurable fluid antenna systems (RPR-FAS)-empowered interference-resilient UAV communication scheme. This scheme integrates the reconfigurable pixel antenna technology, which provides each antenna with an adjustable radiation pattern. Therefore, RPR-FAS can enhance the angular resolution of a UAV with a limited number of antennas, thereby improving spectral efficiency (SE) and interference resilience. Specifically, we first design dedicated radiation pattern adapted from 3GPP-TR-38.901, where the beam direction and half power beamwidth are tailored for UAV communications. Furthermore, we propose a low-storage-overhead orthogonal matching pursuit multiple measurement vectors algorithm, which accurately estimates the angle-of-arrival (AoA) of the communication link, even in the single antenna case. Particularly, by utilizing the Fourier transform to the radiation pattern gain matrix, we design a dimension-reduction technique to achieve 1--2 order-of-magnitude reduction in storage requirements. Meanwhile, we propose a maximum likelihood interference AoA estimation method based on the law of large numbers, so that the SE can be further improved. Finally, alternating optimization is employed to obtain the optimal uplink radiation pattern and combiner, while an exhaustive search is applied to determine the optimal downlink pattern, complemented by the water-filling algorithm for beamforming. Comprehensive simulations demonstrate that the proposed schemes outperform traditional methods in terms of angular sensing precision and spectral efficiency.
Accurate medical image segmentation plays a crucial role in overall diagnosis and is one of the most essential tasks in the diagnostic pipeline. CNN-based models, despite their extensive use, suffer from a local receptive field and fail to capture the global context. A common approach that combines CNNs with transformers attempts to bridge this gap but fails to effectively fuse the local and global features. With the recent emergence of VLMs and foundation models, they have been adapted for downstream medical imaging tasks; however, they suffer from an inherent domain gap and high computational cost. To this end, we propose U-DFA, a unified DINOv2-Unet encoder-decoder architecture that integrates a novel Local-Global Fusion Adapter (LGFA) to enhance segmentation performance. LGFA modules inject spatial features from a CNN-based Spatial Pattern Adapter (SPA) module into frozen DINOv2 blocks at multiple stages, enabling effective fusion of high-level semantic and spatial features. Our method achieves state-of-the-art performance on the Synapse and ACDC datasets with only 33\% of the trainable model parameters. These results demonstrate that U-DFA is a robust and scalable framework for medical image segmentation across multiple modalities.
We present a distributed formation control strategy for multi-agent systems based only on rotation symmetry constraints. We propose a potential function that enforces inter-agent \textbf{rotational} symmetries, with its gradient defining the control law driving the agents toward a desired symmetric and planar configuration. We show that only $(n-1)$ edges, the minimal connectivity requirement, are sufficient to implement the control strategy, where $n$ is the number of agents. We further augment the design to address the \textbf{maneuvering problem}, enabling the formation to undergo coordinated translations, rotations, and scalings along a predefined virtual trajectory. Numerical simulations demonstrate the effectiveness and flexibility of the proposed method.
Accurate path loss (PL) prediction is crucial for successful network planning, antenna design, and performance optimization in wireless communication systems. Several conventional approaches for PL prediction have been adopted, but they have been demonstrated to lack flexibility and accuracy. In this work, we investigate the effectiveness of Machine Learning (ML) models in predicting PL, particularly for the sub-6 GHz band in a suburban campus of King Abdullah University of Science and Technology (KAUST). For training purposes, we generate synthetic datasets using the ray-tracing simulation technique. The feasibility and accuracy of the ML-based PL models are verified and validated using both synthetic and measurement datasets. The random forest regression (RFR) and the K-nearest neighbors (KNN) algorithms provide the best PL prediction accuracy compared to other ML models. In addition, we compare the performance of the developed ML-based PL models with the traditional propagation models, including COST-231 Hata, Longley-Rice, and Close-in models. The results show the superiority of the ML-based PL models compared to conventional models. Therefore, the ML approach using the ray-tracing technique can provide a promising and cost-effective solution for predicting and modeling radio wave propagation in various scenarios in a flexible manner.
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
Radio frequency interference (RFI) poses a growing challenge to satellite communications, particularly in uplink channels of Low Earth Orbit (LEO) systems, due to increasing spectrum congestion and uncertainty in the location of terrestrial interferers. This paper addresses the impact of RFI source position uncertainty on beamforming-based interference mitigation. First, we analytically characterize how geographic uncertainty in RFI location translates into angular deviation as observed from the satellite. Building on this, we propose a robust null-shaping framework to increase resilience in the communication links by incorporating the probability density function (PDF) of the RFI location uncertainty into the beamforming design via stochastic optimization. This allows adaptive shaping of the antenna array's nulling pattern to enhance interference suppression under uncertainty. Extensive Monte Carlo simulations, incorporating realistic satellite orbital dynamics and various RFI scenarios, demonstrate that the proposed approach achieves significantly improved mitigation performance compared to conventional deterministic designs.
A method of simulating a single-input single-output reconfigurable intelligent surface (RIS) assisted channel is presented using three channel black boxes to represent the direct signal path, the transmit path to the RIS and the reflected path from the RIS. The complex coefficients for each channel box is obtained by ray tracing in a scenario with geographic terrain information that also contains approximate building shapes. The electrical characteristics of the ground and building walls were also accounted for in the ray tracing function. Simulations were conducted with reflected rays only and reflected rays together with diffracted rays. The received power exhibits variations typical of multipath fading environments. In the best locations, the RIS-assisted channel simulation result agrees well with theoretical models, the performance increasing by the RIS size squared as the number of RIS elements is increased. In the simplified theoretical model where the transmitter and receiver are inline and the RIS orthogonal but much closer than the distance between the former elements, the simulation results also corroborate best deployment close the transmitter or the receiver with a U-shaped drop between them.
Next-generation wireless networks require intelligent traffic prediction to enable autonomous resource management and handle diverse, dynamic service demands. The Open Radio Access Network (O-RAN) framework provides a promising foundation for embedding machine learning intelligence through its disaggregated architecture and programmable interfaces. This work applies a Neural Architecture Search (NAS)-based framework that dynamically selects and orchestrates efficient Long Short-Term Memory (LSTM) architectures for traffic prediction in O-RAN environments. Our approach leverages the O-RAN paradigm by separating architecture optimisation (via non-RT RIC rApps) from real-time inference (via near-RT RIC xApps), enabling adaptive model deployment based on traffic conditions and resource constraints. Experimental evaluation across six LSTM architectures demonstrates that lightweight models achieve $R^2 \approx 0.91$--$0.93$ with high efficiency for regular traffic, while complex models reach near-perfect accuracy ($R^2 = 0.989$--$0.996$) during critical scenarios. Our NAS-based orchestration achieves a 70-75\% reduction in computational complexity compared to static high-performance models, while maintaining high prediction accuracy when required, thereby enabling scalable deployment in real-world edge environments.
Buildings represent a promising flexibility source to support the integration of renewable energy sources, as they may shift their heating energy consumption over time without impacting users' comfort. However, a building's predicted flexibility potential is based on uncertain ambient weather forecasts and a typically inaccurate building thermal model. Hence, this paper presents an uncertainty-aware flexibility quantifier using a chance-constrained formulation. Because such a quantifier may be conservative, we additionally model real-time feedback in the quantification, in the form of affine feedback policies. Such adaptation can take the form of intra-day trades or rebound around the flexibility provision period. To assess the flexibility quantification formulations, we further assume that flexible buildings participate in secondary frequency control markets. The results show some increase in flexibility and revenues when introducing affine feedback policies. Additionally, it is demonstrated that accounting for uncertainties in the flexibility quantification is necessary, especially when intra-day trades are not available. Even though an uncertainty-ignorant potential may seem financially profitable in secondary frequency control markets, it comes at the cost of significant thermal discomfort for inhabitants. Hence, we suggest a comfort-preserving approach, aiming to truly reflect thermal discomfort on the economic flexibility revenue, to obtain a fairer comparison.
The growing complexity of wireless systems has accelerated the move from traditional methods to learning-based solutions. Graph Neural Networks (GNNs) are especially well-suited here, since wireless networks can be naturally represented as graphs. A key property of GNNs is transferability: models trained on one graph often generalize to much larger graphs with little performance loss. While empirical studies have shown that GNN-based wireless policies transfer effectively, existing theoretical guarantees do not capture this phenomenon. Most works focus on dense graphs where node degrees scale with network size, an assumption that fails in wireless systems. In this work, we provide a formal theoretical foundation for transferability on Random Geometric Graphs (RGGs), a sparse and widely used model of wireless networks. We further validate our results through numerical experiments on power allocation, a fundamental resource management task.
Interactive Imitation Learning deals with training a novice policy from expert demonstrations in an online fashion. The established DAgger algorithm trains a robust novice policy by alternating between interacting with the environment and retraining of the network. Many variants thereof exist, that differ in the method of discerning whether to allow the novice to act or return control to the expert. We propose the use of stochastic reachtubes - common in verification of dynamical systems - as a novel method for estimating the necessity of expert intervention. Our approach does not require fine-tuning of decision thresholds per environment and effectively reduces the number of expert interventions, especially when compared with related approaches that make use of a doubt classification model.
Acoustic to articulatory inversion has often been limited to a small part of the vocal tract because the data are generally EMA (ElectroMagnetic Articulography) data requiring sensors to be glued to easily accessible articulators. The presented acoustic to articulation model focuses on the inversion of the entire vocal tract from the glottis, the complete tongue, the velum, to the lips. It relies on a realtime dynamic MRI database of more than 3 hours of speech. The data are the denoised speech signal and the automatically segmented articulator contours. Several bidirectional LSTM-based approaches have been used, either inverting each articulator individually or inverting all articulators simultaneously. To our knowledge, this is the first complete inversion of the vocal tract. The average RMSE precision on the test set is 1.65 mm to be compared with the pixel size which is 1.62 mm.
Distributed multichannel active noise control (DMCANC) systems assign the high computational load of conventional centralized algorithms across multiple processing nodes, leveraging inter-node communication to collaboratively suppress unwanted noise. However, communication overhead can undermine algorithmic stability and degrade overall performance. To address this challenge, we propose a robust communication framework that integrates adaptive-fixed-filter switching and the mixed-gradient combination strategy. In this approach, each node independently executes a single-channel filtered reference least mean square (FxLMS) algorithm while monitoring real-time noise reduction levels. When the current noise reduction performance degrades compared to the previous state, the node halts its adaptive algorithm, switches to a fixed filter, and simultaneously initiates a communication request. The exchanged information comprises the difference between the current control filter and the filter at the time of the last communication, equivalent to the accumulated gradient sum during non-communication intervals. Upon receiving neighboring cumulative gradients, the node employs a mixed-gradient combination method to update its control filter, subsequently reverting to the adaptive mode. This proactive communication strategy and adaptive-fixed switching mechanism ensure system robustness by mitigating instability risks caused by communication issues. Simulations demonstrate that the proposed method achieves noise reduction performance comparable to centralized algorithms while maintaining stability under communication constraints, highlighting its practical applicability in real-world distributed ANC scenarios.
Digital control has become increasingly widespread in modern power electronic converters. When acquiring feedback signals such as the inductor current, synchronizing the analog-to-digital converter (ADC) with the digital pulse-width modulator (DPWM) is commonly employed to accurately track their steady-state average. However, the small-signal implications of such synchronization have not been investigated. This paper presents an exact small-signal model for digitally controlled buck converters operating in forced continuous-conduction mode (FCCM) under constant-frequency current-mode control, explicitly accounting for DPWM-ADC synchronization. Using a sampled-data framework, the proposed model captures all sideband effects introduced by the sampling process, yielding precise predictions of both analog and digital loop gains, even at frequencies beyond the switching and sampling frequencies. Both asymmetrical and symmetrical carrier modulations are considered. Furthermore, the digital loop gain is derived in closed form using the modified z-transform, enabling low-complexity compensator design and stability assessment. Within this framework, the analog loop gain can be directly obtained from the digital loop gain, thereby eliminating the need for computationally intensive infinite series evaluations. The validity of the proposed model is confirmed through both simulation and experimental results.
The CL-UZH team submitted one system each for the fixed and open conditions of the NIST SRE 2024 challenge. For the closed-set condition, results for the audio-only trials were achieved using the X-vector system developed with Kaldi. For the audio-visual results we used only models developed for the visual modality. Two sets of results were submitted for the open-set and closed-set conditions, one based on a pretrained model using the VoxBlink2 and VoxCeleb2 datasets. An Xvector-based model was trained from scratch using the CTS superset dataset for the closed set. In addition to the submission of the results of the SRE24 evaluation to the competition website, we talked about the performance of the proposed systems on the SRE24 evaluation in this report.
Vehicle data is essential for advancing data-driven development throughout the automotive lifecycle, including requirements engineering, design, verification, and validation, and post-deployment optimization. Developers currently collect data in a decentralized and fragmented manner across simulations, test benches, and real-world driving, resulting in data silos, inconsistent formats, and limited interoperability. This leads to redundant efforts, inefficient integration, and suboptimal use of data. This fragmentation results in data silos, inconsistent storage structures, and limited interoperability, leading to redundant data collection, inefficient integration, and suboptimal application. To address these challenges, this article presents a structured literature review and develops an inductive taxonomy for automotive data. This taxonomy categorizes data according to its sources and applications, improving data accessibility and utilization. The analysis reveals a growing emphasis on real-world driving and machine learning applications while highlighting a critical gap in data availability for requirements engineering. By providing a systematic framework for structuring automotive data, this research contributes to more efficient data management and improved decision-making in the automotive industry.
For streaming speech recognition, a Transformer-based encoder has been widely used with block processing. Although many studies addressed improving emission latency of transducers, little work has been explored for improving encoding latency of the block processing. We seek to reduce latency by frequently emitting a chunk with a small shift rather than scarce large-chunk emissions, resulting in higher computational costs. To efficiently compute with the small chunk shift, we propose a new encoder, Spiralformer, tailored for block processing by combining layer dropping and early exiting. We skip layer computation in a cyclic manner and shift the computed layer in each block spirally, which completes computation for all the layers over the block processing. Experimentally, we observed that our method achieved 21.6% reduction in the averaged token emission delay in Librispeech, and 7.0% in CSJ, compared with the baseline with similar computational cost and word error rates.
Electric autonomous mobility-on-demand (EAMoD) systems are emerging all over the world. However, their potential swarm charging in depots may deteriorate operation of the power system, further in turn affecting EAMoD system's optimal operation. To prevent this latent risk, we develop a real-time coordination framework for the EAMoD system and the power system. First, the temporal-spatial characteristics of EAMoD fleets are fully described based on a Markov decision process model, including serving trips, repositioning, and charging. Second, charger accessibility of EAMoD depot charging is well modeled as real-world configuration, wherein fast and slow charge piles are both included. Third, the power system regulation model provides real-time charging regulation constraints for EAMoD systems to prevent potential overload and undervoltage. To address the poor solution quality attributed to the complex decision space of the EAMoD system, this paper proposes a piecewise linear-based approximate dynamic programming algorithm combined with model predictive control. Numerical experiments in the Manhattan and a 14-node power distribution network validate the effectiveness of the proposed algorithm and underscore the necessity of system coordination.
In the last decade, charging service providers are emerging along with the prevalence of electric vehicles. These providers need to strategically optimize their charging prices to improve the profits considering operation conditions of the coupled power-transportation network. However, the optimal pricing problem generally involves the user equilibrium model, which leads to a mathematical program with equilibrium constraints. As a result, the pricing problem is non-convex and computationally intractable especially for large-scale network. To address this challenge, we propose a generalized sensitivity analysis approach for optimal pricing of electric vehicle charging on coupled power-transportation network. Specifically, we adopt a sensitivity analysis to capture the best response of charging demand to charging price in the gradient form. Consequently, charging service providers can make pricing decisions based on the gradient information instead of the conventional KKT conditions of the user equilibrium model. We then propose a tailored gradient descent algorithm to solve the whole pricing problem. The mathematical proof of validity is given and the time complexity of the proposed algorithm is theoretically polynomial. Numerical experiments on different scales of networks verify the computational efficiency of the proposed algorithm, indicating its potential in evaluating the impact of the optimal pricing on the operational performance of large-scale coupled power-transportation network.
A model-based extended state observer (MB-ESO) and its variant are proposed for discrete-time linear multivariable systems, where multiple disturbances are defined as an extended state vector in the same manner as in the original formulation of ESO. The variant MB-ESO extends the MB-ESO to address cases where the disturbance gain matrix is non-diagonal. Leveraging the connection between the variant MB-ESO and the well-known unknown input observer (UIO), the condition for the existence of a MB-ESO and its variant in multivariable systems is established, for the first time, i.e., no invariant zeros exist between the disturbances and the plant outputs. It is shown that, with the observer eigenvalues all placed at the origin and the subsystems decoupled, the variant MB-ESO produces the identical disturbance estimation as that of UIO. Moreover, the error characteristics of MB-ESO and its variant are analyzed and the transfer functions associated with the disturbance estimation errors are derived. It is demonstrated both mathematically and in simulations that the disturbance estimation error of MB-ESO decreases monotonically with respect to both the observer eigenvalues and time.
Dual-system UAVs with vertical take-off and landing capabilities have become increasingly popular in recent years. As a safety-critical system, it is important that a dual-system UAV can maintain safe flight after faults/failures occur. This paper proposes a gain-scheduled passive fault-tolerant control (PFTC) method for the transition flight of dual-system UAVs. In this novel FTC design method, the model uncertainties arising from the loss of control effectiveness caused by actuator faults/failures, for the first time, are treated as model input uncertainty, allowing us to use multiplicative uncertainty descriptions to represent it. The advantages of the proposed method consist in significantly reducing the number of design points, thereby simplifying the control synthesis process and improving the efficiency of designing the FTC system for dual-system UAV transition flight compared with the existing FTC design methods. As a general method, it can be applied to the design of FTC systems with multiple uncertain parameters and multiple channels. The developed passive FTC system is validated on a nonlinear six-degree-of-freedom simulator. The simulation results demonstrate that the gain-scheduled structured H infinity (GS SHIF) PFTC system provides superior fault tolerance performance compared with the LQR and structured H infinity control systems, thereby showcasing the effectiveness and the advantages of the proposed GS SHIF PFTC approach.
The rapid expansion of data center infrastructure is reshaping power system dynamics by significantly increasing electricity demand while also offering potential for fast and controllable flexibility. To ensure reliable operation under such conditions, the frequency secured unit commitment problem must be solved with enhanced modeling of demand side frequency response. In this work, we propose a data-driven linearization framework based on decision tree based constraint learning to embed nonlinear nadir frequency constraints into mixed-integer linear programming. This approach enables tractable optimization of generation schedules and fast frequency response from data centers. Through case studies on both a benchmark system and a 2030 future scenario with higher DC penetration, we demonstrate that increasing the proportion of flexible DC load consistently improves system cost efficiency and supports renewable integration. However, this benefit exhibits diminishing marginal returns, motivating the introduction of the Marginal Flexibility Value metric to quantify the economic value of additional flexibility. The results highlight that as DCs become a larger share of system load, their active participation in frequency response will be increasingly indispensable for maintaining both economic and secure system operations.
This paper introduces a predictive control barrier function (PCBF) framework for enforcing state constraints in discrete-time systems with unknown relative degree, which can be caused by input delays or unmodeled input dynamics. Existing discrete-time CBF formulations typically require the construction of auxiliary barrier functions when the relative degree is greater than one, which complicates implementation and may yield conservative safe sets. The proposed PCBF framework addresses this challenge by extending the prediction horizon to construct a CBF for an associated system with relative degree one. As a result, the superlevel set of the PCBF coincides with the safe set, simplifying constraint enforcement and eliminating the need for auxiliary functions. The effectiveness of the proposed method is demonstrated on a discrete-time double integrator with input delay and a bicopter system with position constraints.
The Graph Fourier Transform (GFT) has recently demonstrated promising results in speech enhancement. However, existing GFT-based speech enhancement approaches often employ fixed graph topologies to build the graph Fourier basis, whose the representation lacks the adaptively and flexibility. In addition, they suffer from the numerical errors and instability introduced by matrix inversion in GFT based on both Singular Value Decomposition (GFT-SVD) and Eigen Vector Decomposition (GFT-EVD). Motivated by these limitations, this paper propose a simple yet effective learnable GFT-SVD framework for speech enhancement. Specifically, we leverage graph shift operators to construct a learnable graph topology and define a learnable graph Fourier basis by the singular value matrices using 1-D convolution (Conv-1D) neural layer. This eliminates the need for matrix inversion, thereby avoiding the associated numerical errors and stability problem.
This paper addresses the challenge of synthesizing safety-critical controllers for high-order nonlinear systems, where constructing valid Control Barrier Functions (CBFs) remains computationally intractable. Leveraging layered control, we design CBFs in reduced-order models (RoMs) while regulating full-order models' (FoMs) dynamics at the same time. Traditional Lyapunov tracking functions are required to decrease monotonically, but systematic synthesis methods for such functions exist only for fully-actuated systems. To overcome this limitation, we introduce Recurrent Tracking Functions (RTFs), which replace the monotonic decay requirement with a weaker finite-time recurrence condition. This relaxation permits transient deviations of tracking errors while ensuring safety. By augmenting CBFs for RoMs with RTFs, we construct recurrent CBFs (RCBFs) whose zero-superlevel set is control $\tau$-recurrent, and guarantee safety for all initial states in such a set when RTFs are satisfied. We establish theoretical safety guarantees and validate the approach through numerical experiments, demonstrating RTFs' effectiveness and the safety of FoMs.
This paper examines how musical symbolism is produced and circulated in online communities by combining content-based music analysis with a lightweight network perspective on lyrics. Using a curated corpus of 275 chart-topping songs enriched with audio descriptors (energy, danceability, loudness, liveness, valence, acousticness, speechiness, popularity) and full lyric transcripts, we build a reproducible pipeline that (i) quantifies temporal trends in sonic attributes, (ii) models lexical salience and co-occurrence, and (iii) profiles mood by genre. We find a decade-long decline in energy (79 -> 58) alongside a rise in danceability (59 -> 73); valence peaks in 2013 (63) and dips in 2014-2016 (42) before partially recovering. Correlation analysis shows strong coupling of energy with loudness (r = 0.74) and negative associations for acousticness with both energy (r = -0.54) and loudness (r = -0.51); danceability is largely orthogonal to other features (|r| < 0.20). Lyric tokenization (>114k tokens) reveals a pronoun-centric lexicon "I/you/me/my" and a dense co-occurrence structure in which interpersonal address anchors mainstream narratives. Mood differs systematically by style: R&B exhibits the highest mean valence (96), followed by K-Pop/Pop (77) and Indie/Pop (70), whereas Latin/Reggaeton is lower (37) despite high danceability. Read through a subcultural identity lens, these patterns suggest the mainstreaming of previously peripheral codes and a commercial preference for relaxed yet rhythmically engaging productions that sustain collective participation without maximal intensity. Methodologically, we contribute an integrated MIR-plus-network workflow spanning summary statistics, correlation structure, lexical co-occurrence matrices, and genre-wise mood profiling that is robust to modality sparsity and suitable for socially aware recommendation or community-level diffusion studies.
Deep learning systems often struggle with processing long sequences, where computational complexity can become a bottleneck. Current methods for automated dementia detection using speech frequently rely on static, time-agnostic features or aggregated linguistic content, lacking the flexibility to model the subtle, progressive deterioration inherent in speech production. These approaches often miss the dynamic temporal patterns that are critical early indicators of cognitive decline. In this paper, we introduce TAI-Speech, a Temporal Aware Iterative framework that dynamically models spontaneous speech for dementia detection. The flexibility of our method is demonstrated through two key innovations: 1) Optical Flow-inspired Iterative Refinement: By treating spectrograms as sequential frames, this component uses a convolutional GRU to capture the fine-grained, frame-to-frame evolution of acoustic features. 2) Cross-Attention Based Prosodic Alignment: This component dynamically aligns spectral features with prosodic patterns, such as pitch and pauses, to create a richer representation of speech production deficits linked to functional decline (IADL). TAI-Speech adaptively models the temporal evolution of each utterance, enhancing the detection of cognitive markers. Experimental results on the DementiaBank dataset show that TAI-Speech achieves a strong AUC of 0.839 and 80.6\% accuracy, outperforming text-based baselines without relying on ASR. Our work provides a more flexible and robust solution for automated cognitive assessment, operating directly on the dynamics of raw audio.
There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: this https URL.
Sleep apnea is a serious sleep-related breathing disorder that is common and can impact health if left untreated. Currently the traditional method for screening and diagnosis is overnight polysomnography. Polysomnography is expensive and takes a lot of time, and is not practical for screening large groups of people. In this paper, we explored a more accessible option, using respiratory audio recordings to spot signs of this http URL utilized 18 audio this http URL approach involved converting breathing sounds into spectrograms, balancing the dataset by oversampling apnea segments, and applying class weights to reduce bias toward the majority class. The model reached a recall of 90.55 for apnea detection. Intentionally, prioritizing catching apnea events over general accuracy. Despite low precision,the high recall suggests potential as a low-cost screening tool that could be used at home or in basic clinical setups, potentially helping identify at-risk individuals much earlier.
Hyperspectral anomaly detection (HAD), a crucial approach for many civilian and military applications, seeks to identify pixels with spectral signatures that are anomalous relative to a preponderance of background signatures. Significant effort has been made to improve HAD techniques, but challenges arise due to complex real-world environments and, by definition, limited prior knowledge of potential signatures of interest. This paper introduces a novel HAD method by proposing a transport-based mathematical model to describe the pixels comprising a given hyperspectral image. In this approach, hyperspectral pixels are viewed as observations of a template pattern undergoing unknown deformations that enables their representation in the signed cumulative distribution transform (SCDT) domain. An unsupervised subspace modeling technique is then used to construct a model of abundant background signals in this domain, whereupon anomalous signals are detected as deviations from the learned model. Comprehensive evaluations across five distinct datasets illustrate the superiority of our approach compared to state-of-the-art methods.
Nonlinear Model Predictive Control (NMPC) is a precise controller, but its heavy computational load often prevents application in robotic systems. Some studies have attempted to approximate NMPC using deep neural networks (NMPC-DNN). However, in the presence of unexpected disturbances or when operating conditions differ from training data, this approach lacks robustness, leading to large tracking errors. To address this issue, for the first time, the NMPC-DNN output is combined with a PI controller (Hybrid NMPC-DNN-PI). The proposed controller is validated by applying it to an exoskeleton robot during squat movement, which has a complex dynamic model and has received limited attention regarding robust nonlinear control design. A human-robot dynamic model with three active joints (ankle, knee, hip) is developed, and more than 5.3 million training samples are used to train the DNN. The results show that, under unseen conditions for the DNN, the tracking error in Hybrid NMPC-DNN-PI is significantly lower compared to NMPC-DNN. Moreover, human joint torques are greatly reduced with the use of the exoskeleton, with RMS values for the studied case reduced by 30.9%, 41.8%, and 29.7% at the ankle, knee, and hip, respectively. In addition, the computational cost of Hybrid NMPC-DNN-PI is 99.93% lower than that of NMPC.
Quantum network protocol development is crucial to realizing a production-grade network that can support distributed sensing, secure communication, and utility-scale quantum computation. However, the transition from laboratory demonstration to deployable networks requires software implementations of architectures and protocols tailored to the unique constraints of quantum systems. This paper reviews the current state of software implementations for quantum networks, organized around the three-plane abstraction of infrastructure, logical, and control/service planes. We cover software for both designing quantum network protocols (e.g., SeQUeNCe, QuISP, and NetSquid) and operating them, with a focus on essential control/service plane functions such as entanglement, topology, and resource management, in a proposed taxonomy. Our review highlights a persistent gap between theoretical protocol proposals and their realization in simulators or testbeds, particularly in dynamic topology and network management. We conclude by outlining open challenges and proposing a roadmap for developing scalable software architectures to enable hybrid, large-scale quantum networks.
Autonomous inspection systems are essential for ensuring the performance and longevity of industrial assets. Recently, agentic frameworks have demonstrated significant potential for automating inspection workflows but have been limited to digital tasks. Their application to physical assets in real-world environments, however, remains underexplored. In this work, our contributions are two-fold: first, we propose a hierarchical agentic framework for autonomous drone control, and second, a reasoning methodology for individual function executions which we refer to as ReActEval. Our framework focuses on visual inspection tasks in indoor industrial settings, such as interpreting industrial readouts or inspecting equipment. It employs a multi-agent system comprising a head agent and multiple worker agents, each controlling a single drone. The head agent performs high-level planning and evaluates outcomes, while worker agents implement ReActEval to reason over and execute low-level actions. Operating entirely in natural language, ReActEval follows a plan, reason, act, evaluate cycle, enabling drones to handle tasks ranging from simple navigation (e.g., flying forward 10 meters and land) to complex high-level tasks (e.g., locating and reading a pressure gauge). The evaluation phase serves as a feedback and/or replanning stage, ensuring actions align with user objectives while preventing undesirable outcomes. We evaluate the framework in a simulated environment with two worker agents, assessing performance qualitatively and quantitatively based on task completion across varying complexity levels and workflow efficiency. By leveraging natural language processing for agent communication, our approach offers a novel, flexible, and user-accessible alternative to traditional drone-based solutions, enabling autonomous problem-solving for industrial inspection without extensive user intervention.
Cellular sheaves and sheaf Laplacians provide a far-reaching generalization of graphs and graph Laplacians, resulting in a wide array of applications ranging from machine learning to multi-agent control. In the context of multi-agent systems, so called coordination sheaves provide a unifying formalism that models heterogeneous agents and coordination goals over undirected communication topologies, and applying sheaf diffusion drives agents to achieve their coordination goals. Existing literature on sheaf diffusion assumes that agents can communicate and compute updates synchronously, which is an unrealistic assumption in many scenarios where communication delays or heterogeneous agents with different compute capabilities cause disagreement among agents. To address these challenges, we introduce asynchronous nonlinear sheaf diffusion. Specifically, we show that under mild assumptions on the coordination sheaf and bounded delays in communication and computation, nonlinear sheaf diffusion converges to a minimizer of the Dirichlet energy of the coordination sheaf at a linear rate proportional to the delay bound. We further show that this linear convergence is attained from arbitrary initial conditions and the analysis depends on the spectrum of the sheaf Laplacian in a manner that generalizes the standard graph Laplacian case. We provide several numerical simulations to validate our theoretical results.
Vocal dereverberation remains a challenging task in audio processing, particularly for real-time applications where both accuracy and efficiency are crucial. Traditional deep learning approaches often struggle to suppress reverberation without degrading vocal clarity, while recent methods that jointly predict magnitude and phase have significant computational cost. We propose a real-time dereverberation framework based on residual mask prediction in the short-time Fourier transform (STFT) domain. A U-Net architecture is trained to estimate a residual reverberation mask that suppresses late reflections while preserving direct speech components. A hybrid objective combining binary cross-entropy, residual magnitude reconstruction, and time-domain consistency further encourages both accurate suppression and perceptual quality. Together, these components enable low-latency dereverberation suitable for real-world speech and singing applications.
Bound states in the continuum (BICs) have emerged as a revolutionary paradigm in terahertz (THz) photonics, enabling metasurfaces with theoretically infinite quality factors (Q-factors) and unprecedented light-matter control. This review synthesizes a decade of progress in THz-BIC research, tracing the evolution from foundational symmetry-protected designs to application-optimized quasi-BICs. We dissect multipolar origins, topological robustness, and symmetry-breaking strategies underpinning high-Q resonances, alongside computational frameworks for predictive design. The timeline highlights key milestones: early dielectric metasurfaces with high Q-factor, flexible biosensors achieving microgram detection limits, and Kerker-conditioned gas spectrometers reducing path lengths by few orders of magnitude. Emerging frontiers in reconfigurable MEMS-BICs and chiral quantum photonics are critically evaluated. Despite breakthroughs, scalability barriers persist for 6G integration, including nano-fabrication tolerances, material loss trade-offs, and dynamic control gaps. This review establishes BIC metasurfaces as pivotal enablers of compact, high-efficiency THz technologies poised to bridge the gap between fundamental discovery and commercialization of THz-based 6G communication and MedTech.
Large Language models (LLMs) have shown promise as generators of symbolic control policies, producing interpretable program-like representations through iterative search. However, these models are not capable of separating the functional structure of a policy from the numerical values it is parametrized by, thus making the search process slow and inefficient. We propose a hybrid approach that decouples structural synthesis from parameter optimization by introducing an additional optimization layer for local parameter search. In our method, the numerical parameters of LLM-generated programs are extracted and optimized numerically to maximize task performance. With this integration, an LLM iterates over the functional structure of programs, while a separate optimization loop is used to find a locally optimal set of parameters accompanying candidate programs. We evaluate our method on a set of control tasks, showing that it achieves higher returns and improved sample efficiency compared to purely LLM-guided search. We show that combining symbolic program synthesis with numerical optimization yields interpretable yet high-performing policies, bridging the gap between language-model-guided design and classical control tuning. Our code is available at this https URL.
With the rapid growth of intelligent services, communication targets are shifting from humans to artificial intelligent (AI) agents, which require new paradigms to enable real-time perception, decision-making, and collaboration. Semantic communication, which conveys task-relevant meaning rather than raw data, offers a promising solution. However, its practical deployment remains constrained by dynamic environments and limited resources. To address these issues, this article proposes a semantic-driven AI agent communication framework and develops three enabling techniques. First, semantic adaptation transmission applies fine-tuning with real or generative samples to efficiently adapt models to varying environments. Second, semantic lightweight transmission incorporates pruning, quantization, and perception-aware sampling to reduce model complexity and alleviate computational burden on edge agents. Third, semantic self-evolution control employs distributed hierarchical decision-making to optimize multi-dimensional resources, enabling robust multi-agent collaboration in dynamic environments. Simulation results show that the proposed solutions achieve faster convergence and stronger robustness, while the proposed distributed hierarchical optimization method significantly outperforms conventional decision-making schemes, highlighting its potential for AI agent communication networks.
We propose the multistep port-Hamiltonian Gaussian process (MS-PHS GP) to learn physically consistent continuous-time dynamics and a posterior over the Hamiltonian from noisy, irregularly-sampled trajectories. By placing a GP prior on the Hamiltonian surface $H$ and encoding variable-step multistep integrator constraints as finite linear functionals, MS-PHS GP enables closed-form conditioning of both the vector field and the Hamiltonian surface without latent states, while enforcing energy balance and passivity by design. We state a finite-sample vector-field bound that separates the estimation and variable-step discretization terms. Lastly, we demonstrate improved vector-field recovery and well-calibrated Hamiltonian uncertainty on mass-spring, Van der Pol, and Duffing benchmarks.
Low-latency symbolic music generation is essential for real-time improvisation and human-AI co-creation. Existing transformer-based models, however, face a trade-off between inference speed and musical quality. Traditional acceleration techniques such as embedding pooling significantly degrade quality, while recently proposed Byte Pair Encoding (BPE) methods - though effective on single-track piano data - suffer large performance drops in multi-track settings, as revealed by our analysis. We propose Attribute-Specialized Key-Value Head Sharing (AS-KVHS), adapted to music's structured symbolic representation, achieving about 30% inference speedup with only a negligible (about 0.4%) quality drop in objective evaluations and slight improvements in subjective listening tests. Our main contributions are (1) the first systematic study of BPE's generalizability in multi-track symbolic music, and (2) the introduction of AS-KVHS for low-latency symbolic music generation. Beyond these, we also release SAGE-Music, an open-source benchmark that matches or surpasses state-of-the-art models in generation quality.
This paper studies the adversarial robustness of conformal novelty detection. In particular, we focus on AdaDetect, a powerful learning-based framework for novelty detection with finite-sample false discovery rate (FDR) control. While AdaDetect provides rigorous statistical guarantees under benign conditions, its behavior under adversarial perturbations remains unexplored. We first formulate an oracle attack setting that quantifies the worst-case degradation of FDR, deriving an upper bound that characterizes the statistical cost of attacks. This idealized formulation directly motivates a practical and effective attack scheme that only requires query access to AdaDetect's output labels. Coupling these formulations with two popular and complementary black-box adversarial algorithms, we systematically evaluate the vulnerability of AdaDetect on synthetic and real-world datasets. Our results show that adversarial perturbations can significantly increase the FDR while maintaining high detection power, exposing fundamental limitations of current error-controlled novelty detection methods and motivating the development of more robust alternatives.
Low-altitude uncrewed aerial vehicles (UAVs) have become integral enablers for the Internet of Things (IoT) by offering enhanced coverage, improved connectivity and access to remote areas. A critical challenge limiting their operational capacity lies in the energy constraints of both aerial platforms and ground-based sensors. This paper explores WLPT as a transformative solution for sustainable energy provisioning in UAV-assisted IoT networks. We first systematically investigate the fundamental principles of WLPT and analysis the comparative advantages. Then, we introduce three operational paradigms for system integration, identify key challenges, and discuss corresponding potential solutions. In case study, we propose a multi-agent reinforcement learning framework to address the coordination and optimization challenges in WLPT-enabled UAV-assisted IoT data collection. Simulation results demonstrate that our framework significantly improves energy sustainability and data freshness. Finally, we discuss some future directions.
Recently, an increasing number of multimodal (text and audio) benchmarks have emerged, primarily focusing on evaluating models' understanding capability. However, exploration into assessing generative capabilities remains limited, especially for open-ended long-form content generation. Significant challenges lie in no reference standard answer, no unified evaluation metrics and uncontrollable human judgments. In this work, we take podcast-like audio generation as a starting point and propose PodEval, a comprehensive and well-designed open-source evaluation framework. In this framework: 1) We construct a real-world podcast dataset spanning diverse topics, serving as a reference for human-level creative quality. 2) We introduce a multimodal evaluation strategy and decompose the complex task into three dimensions: text, speech and audio, with different evaluation emphasis on "Content" and "Format". 3) For each modality, we design corresponding evaluation methods, involving both objective metrics and subjective listening test. We leverage representative podcast generation systems (including open-source, close-source, and human-made) in our experiments. The results offer in-depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval in evaluating open-ended long-form audio. This project is open-source to facilitate public use: this https URL.
This work proposes a novel Alternating Direction Method of Multipliers (ADMM)-based Ensemble Kalman Inversion (EKI) algorithm for solving constrained nonlinear model predictive control (NMPC) problems. First, the stage-wise nonlinear inequality constraints in the NMPC problem are embedded via an augmented Lagrangian with nonnegative slack variables. We then show that the unconstrained augmented Lagrangian formulation of the NMPC admits a Bayesian interpretation: under a Gaussian observation model, its minimizers coincide with MAP estimators, enabling solution via EKI. However, since the nonnegativity constraint on the slacks cannot be enforced via Gaussian noise, our proposed algorithm results in a two-block ADMM that alternates between (i) a primal step that minimizes the unconstrained augmented Lagrangian, (ii) a nonnegativity projection for the slacks, and (iii) a dual ascent step. To balance exploration and convergence, an annealing schedule tempers covariances and penalty weights, thereby encouraging global search early and precise constraint satisfaction later. To demonstrate the performance of the proposed method, we compare it with another iterative sampling-based approach based on Model Predictive Path Integral (MPPI) control, called DIAL-MPC.
In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-\alpha)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|\lambda_2|)}}\right)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|\lambda_2|$ is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.
An accurate analytical form of the achievable bit error rate in the presence of multipath interference (MPI) is proposed for PAM4 for the first time, taking into account an ideal MPI estimate and compensation.
This work presents novel methods to reduce computational and memory requirements for medical image segmentation with a large number of classes. We curiously observe challenges in maintaining state-of-the-art segmentation performance with all of the explored options. Standard learning-based methods typically employ one-hot encoding of class labels. The computational complexity and memory requirements thus increase linearly with the number of classes. We propose a family of binary encoding approaches instead of one-hot encoding to reduce the computational complexity and memory requirements to logarithmic in the number of classes. In addition to vanilla binary encoding, we investigate the effects of error-correcting output codes (ECOCs), class weighting, hard/soft decoding, class-to-codeword assignment, and label embedding trees. We apply the methods to the use case of whole brain parcellation with 108 classes based on 3D MRI images. While binary encodings have proven efficient in so-called extreme classification problems in computer vision, we faced challenges in reaching state-of-the-art segmentation quality with binary encodings. Compared to one-hot encoding (Dice Similarity Coefficient (DSC) = 82.4 (2.8)), we report reduced segmentation performance with the binary segmentation approaches, achieving DSCs in the range from 39.3 to 73.8. Informative negative results all too often go unpublished. We hope that this work inspires future research of compact encoding strategies for large multi-class segmentation tasks.
Utilizing teams of multiple robots is advantageous for handling bulky objects. Many related works focus on multi-manipulator systems, which are limited by workspace constraints. In this paper, we extend a classical hybrid motion-force controller to a team of legged manipulator systems, enabling collaborative loco-manipulation of rigid objects with a force-closed grasp. Our novel approach allows the robots to flexibly coordinate their movements, achieving efficient and stable object co-manipulation and transport, validated through extensive simulations and real-world experiments.
Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.
This paper establishes the global convergence properties of the Oja flow, a continuous-time algorithm for principal component extraction, for general square matrices. The Oja flow is a matrix differential equation on the Stiefel manifold designed to extract a dominant subspace. While its analysis has traditionally been restricted to symmetric positive-definite matrices, where it acts as a gradient flow, recent applications have extended its use to general matrices. In this non-symmetric case, the flow extracts the invariant subspace corresponding to the eigenvalues with the largest real parts. However, prior convergence results have been purely local, leaving the global behavior as an open problem. This paper fills this gap by providing a comprehensive global convergence analysis, establishing that the flow converges exponentially for almost all initial conditions. We also propose a modification to the algorithm that enhances its numerical stability. As an application of this theory, we develop novel methods for the model reduction of linear dynamical systems and the synthesis of low-rank stabilizing controllers.
The increasing integration of distributed energy resources (DERs), particularly renewables, poses significant challenges for power system protection, with fault classification (FC) and fault localization (FL) being among the most critical tasks. Conventional protection schemes, based on fixed thresholds, cannot reliably identify and localize short circuits with the increasing complexity of the grid under dynamic conditions. Machine learning (ML) offers a promising alternative; however, systematic benchmarks across models and settings remain limited. This work presents, for the first time, a comparative benchmarking study of classical ML models for FC and FL in power system protection based on EMT data. Using voltage and current waveforms segmented into sliding windows of 10 ms to 50 ms, we evaluate models under realistic real-time constraints. Performance is assessed in terms of accuracy, robustness to window size, and runtime efficiency. The best-performing FC model achieved an F1 score of 0.992$\pm$0.001, while the top FL model reached an R2 of 0.806$\pm$0.008 with a mean processing time of 0.563 ms.
Current products, especially in the automotive sector, pose complex technical systems having a multi-disciplinary mechatronic nature. Industrial standards supporting system engineering and production typically (i) address the production phase only, but do not cover the complete product life cycle, and (ii) focus on production processes and resources rather than the products themselves. The presented approach is motivated by incorporating impacts of end-of-life phase of the product life cycle into the engineering phase. This paper proposes a modelling approach coming up from the Product-Process-Resource (PPR) modeling paradigm. It combines requirements on (i) respecting the product structure as a basis for the model, and (ii) it incorporates repairing, remanufacturing, or upcycling within cyber-physical production systems. The proposed model called PoPAN should accompany the product during the entire life cycle as a digital shadow encapsulated within the Asset Administration Shell of a product. To facilitate the adoption of the proposed paradigm, the paper also proposes serialization of the model in the AutomationML data format. The model is demonstrated on a use-case for disassembling electric vehicle batteries to support their remanufacturing for stationary battery applications.
Vehicular Ad-hoc Networks (VANETs), a subclass of Mobile Ad-hoc Networks (MANETs), are expected to play a crucial role in the future of intelligent transportation systems (ITSs). A key objective of VANETs is to enable efficient and cost-effective communication among vehicles while supporting a large number of network participants and minimizing infrastructure dependency. However, the highly dynamic nature of vehicular networks poses significant challenges to their deployment. Clustering techniques are employed to address these challenges, with a strong emphasis on stability, as they directly influence the routing process and enhance the quality of service (QoS). This paper explores the feasibility of reducing reliance on roadside units (RSUs) in metropolitan areas while improving cluster stability. We propose an efficient clustering algorithm tailored for urban environments, leveraging existing metropolitan infrastructure to compensate for the absence of RSUs. Our approach designates public transportation buses as primary cluster heads (CHs), minimizing reliance on additional infrastructure, while stand-alone vehicles (SAVs) dynamically select additional CHs. Through comprehensive case studies and comparative analysis with existing algorithms, our results demonstrate the superior performance of the proposed method across different transmission ranges (TRs).
This paper presents a task-oriented computational framework to enhance Visual-Inertial Navigation (VIN) in robots, addressing challenges such as limited time and energy resources. The framework strategically selects visual features using a Mean Squared Error (MSE)-based, non-submodular objective function and a simplified dynamic anticipation model. To address the NP-hardness of this problem, we introduce four polynomial-time approximation algorithms: a classic greedy method with constant-factor guarantees; a low-rank greedy variant that significantly reduces computational complexity; a randomized greedy sampler that balances efficiency and solution quality; and a linearization-based selector based on a first-order Taylor expansion for near-constant-time execution. We establish rigorous performance bounds by leveraging submodularity ratios, curvature, and element-wise curvature analyses. Extensive experiments on both standardized benchmarks and a custom control-aware platform validate our theoretical results, demonstrating that these methods achieve strong approximation guarantees while enabling real-time deployment.
In the complex landscape of multivariate time series forecasting, achieving both accuracy and interpretability remains a significant challenge. This paper introduces the Fuzzy Transformer (Fuzzformer), a novel recurrent neural network architecture combined with multi-head self-attention and fuzzy inference systems to analyze multivariate stock market data and conduct long-term time series forecasting. The method leverages LSTM networks and temporal attention to condense multivariate data into interpretable features suitable for fuzzy inference systems. The resulting architecture offers comparable forecasting performance to conventional models such as ARIMA and LSTM while providing meaningful information flow within the network. The method was examined on the real world stock market index S\&P500. Initial results show potential for interpretable forecasting and identify current performance tradeoffs, suggesting practical application in understanding and forecasting stock market behavior.
ROSflight is a lean, open-source autopilot ecosystem for unmanned aerial vehicles (UAVs). Designed by researchers for researchers, it is built to lower the barrier to entry to UAV research and accelerate the transition from simulation to hardware experiments by maintaining a lean (not full-featured), well-documented, and modular codebase. This publication builds on previous treatments and describes significant additions to the architecture that improve the modularity and usability of ROSflight, including the transition from ROS 1 to ROS 2, supported hardware, low-level actuator mixing, and the simulation environment. We believe that these changes improve the usability of ROSflight and enable ROSflight to accelerate research in areas like advanced-air mobility. Hardware results are provided, showing that ROSflight is able to control a multirotor over a serial connection at 400 Hz while closing all control loops on the companion computer.
We introduce a novel version of the geometric scattering transform for geometric graphs containing scalar and vector node features. This new scattering transform has desirable symmetries with respect to rigid-body roto-translations (i.e., $SE(3)$-equivariance) and may be incorporated into a geometric GNN framework. We empirically show that our equivariant scattering-based GNN achieves comparable performance to other equivariant message-passing-based GNNs at a fraction of the parameter count.
Unmanned aerial vehicle (UAV) research requires the integration of cutting-edge technology into existing autopilot frameworks. This process can be arduous, requiring extensive resources, time, and detailed knowledge of the existing system. ROSplane is a lean, open-source fixed-wing autonomy stack built by researchers for researchers. It is designed to accelerate research by providing clearly defined interfaces with an easily modifiable framework. Powered by ROS 2, ROSplane allows for rapid integration of low or high-level control, path planning, or estimation algorithms. A focus on lean, easily understood code and extensive documentation lowers the barrier to entry for researchers. Recent developments to ROSplane improve its capacity to accelerate UAV research, including the transition from ROS 1 to ROS 2, enhanced estimation and control algorithms, increased modularity, and an improved aerodynamic modeling pipeline. This aerodynamic modeling pipeline significantly reduces the effort of transitioning from simulation to real-world testing without requiring expensive system identification or computational fluid dynamics tools. ROSplane's architecture reduces the effort required to integrate new research tools and methods, expediting hardware experimentation.
In this article, we employ an input-output approach to expand the study of cooperative multi-agent control and optimization problems characterized by mean-field interactions that admit decentralized and selfish solutions. The setting involves $n$ independent agents that interact solely through a shared cost function, which penalizes deviations of each agent from the group's average collective behavior. Building on our earlier results established for homogeneous agents, we extend the framework to nonidentical agents and show that, under a diagonal dominant interaction of the collective dynamics, with bounded local open-loop dynamics, the optimal controller for $H_\infty$ and $H_2$ norm minimization remains decentralized and selfish in the limit as the number of agents $n$ grows to infinity.
This work presents two methodologies to enhance vulnerability assessment in power systems using bilevel attacker-defender network interdiction models. First, we introduce a systematic evaluation procedure for comparing different optimal power flow formulations in the lower-level problem. We demonstrate the procedure for a comparison of the widely used DC approximation and a linearized AC optimal power flow model. Second, we propose a novel scoring methodology to identify and prioritize critical attack vectors across diverse load and generation scenarios. Both methodologies go beyond traditional worst-case analysis. Case studies on a SimBench high-voltage test grid show that the DC approach fails to detect a significant portion of critical vulnerabilities. The scoring methodology further demonstrates the dependency of vulnerabilities on the considered load case and time step, highlighting the importance of assessing multiple scenarios and going beyond worst-case solutions. The proposed methodologies enhance power system vulnerability assessment and can support the effective development of robust defense strategies for future power systems.
This work studies resilient leader-follower consensus with a bounded number of adversaries. Existing approaches typically require robustness conditions of the entire network to guarantee resilient consensus. However, the behavior of such systems when these conditions are not fully met remains unexplored. To address this gap, we introduce the notion of partial leader-follower consensus, in which a subset of non-adversarial followers successfully tracks the leader's reference state despite insufficient robustness. We propose a novel distributed algorithm - the Bootstrap Percolation and Mean Subsequence Reduced (BP-MSR) algorithm - and establish sufficient conditions for individual followers to achieve consensus via the BP-MSR algorithm in arbitrary time-varying graphs. We validate our findings through simulations, demonstrating that our method guarantees partial leader-follower consensus, even when standard resilient consensus algorithms fail.
While normalization techniques are widely used in deep learning, their theoretical understanding remains relatively limited. In this work, we establish the benefits of (generalized) weight normalization (WN) applied to the overparameterized matrix sensing problem. We prove that WN with Riemannian optimization achieves linear convergence, yielding an exponential speedup over standard methods that do not use WN. Our analysis further demonstrates that both iteration and sample complexity improve polynomially as the level of overparameterization increases. To the best of our knowledge, this work provides the first characterization of how WN leverages overparameterization for faster convergence in matrix sensing.
This study addresses the challenge of estimating traffic states for road links. We propose an innovative approach that leverages partial trajectory data captured by camera-equipped probe vehicles traveling in the opposite lane. The methodology combines state-of-the-art computer vision algorithms for extracting vehicle trajectories from street-view video sequences with a novel estimation technique based on the Cell Transmission Model (CTM) and Genetic Algorithms (GA). Our approach first calibrates Fundamental Diagram (FD) parameters using observed cell densities, then estimates boundary conditions for all space-time diagrams. We validate the method using simulated traffic data from three different types of links and parameter settings. Results show that the proposed methodology can estimate traffic densities in unobserved regions, even with limited data availability. This research contributes to the field by introducing a cost-effective, high-resolution traffic data collection method and a robust estimation technique for comprehensive traffic state information. While the study shows promising results, it also identifies areas for improvement, including refining models, optimizing processes, and testing with real-world data to enhance accuracy and scalability.
We study a class of graphon particle systems with time-varying random coefficients. In a graphon particle system, the interactions among particles are characterized by the coupled mean field terms through an underlying graphon and the randomness of the coefficients comes from exogenous stochastic processes. By constructing two-level approximated sequences converging in 2-Wasserstein distance, we prove the existence and uniqueness of the solution to the system. Besides, by constructing two-level approximated functions converging to the graphon mean field terms, we establish the law of large numbers, which reveals that if the number of particles tends to infinity and the discretization step tends to zero, then the discrete-time interacting particle system over a large-scale network converges to the graphon particle system. As a byproduct, we discover that the graphon particle system can describe the limiting dynamics of the distributed stochastic gradient descent algorithm over the large-scale network and prove that if the gradients of the local cost functions are Lipschitz continuous, then the graphon particle system can be regarded as the spatio-temporal approximation of the discrete-time distributed stochastic gradient descent algorithm as the number of network nodes tends to infinity and the algorithm step size tends to zero.
We study the distributed optimization problem over a graphon with a continuum of nodes, which is regarded as the limit of the distributed networked optimization as the number of nodes goes to infinity. Each node has a private local cost function. The global cost function, which all nodes cooperatively minimize, is the integral of the local cost functions on the node set. We propose stochastic gradient descent and gradient tracking algorithms over the graphon. We establish a general lemma for the upper bound estimation related to a class of time-varying differential inequalities with negative linear terms, based upon which, we prove that for both kinds of algorithms, the second moments of the nodes' states are uniformly bounded. Especially, for the stochastic gradient tracking algorithm, we transform the convergence analysis into the asymptotic property of coupled nonlinear differential inequalities with time-varying coefficients and develop a decoupling method. For both kinds of algorithms, we show that by choosing the time-varying algorithm gains properly, all nodes' states achieve $\mathcal{L}^{\infty}$-consensus for a connected graphon. Furthermore, if the local cost functions are strongly convex, then all nodes' states converge to the minimizer of the global cost function and the auxiliary states in the stochastic gradient tracking algorithm converge to the gradient value of the global cost function at the minimizer uniformly in mean square.
While control barrier functions (CBFs) are employed in addressing safety, control synthesis methods based on them generally rely on accurate system dynamics. This is a critical limitation, since the dynamics of complex systems are often not fully known. Supervised machine learning techniques hold great promise for alleviating this weakness by inferring models from data. We propose a novel \revision{approach for safe event-triggered learning of Gaussian process models in CBF-based continuous-time control for unknown control-affine systems. By applying a finite excitation at triggering times, our approach ensures a sufficient information gain to maintain the feasibility of the CBF-based safety condition with high probability. Our approach probabilistically guarantees safety based on a suitable GP prior and rules out} Zeno behavior in the triggering scheme. The effectiveness of the proposed approach and theory is demonstrated in simulations.
We propose networked policy gradient play for solving Markov potential games with continuous and/or discrete state-action pairs. During the game, agents use parametrized and differentiable policies that depend on the current state and the policy parameters of other agents. During training, agents update their policy parameters following stochastic gradients. The gradient estimation involves two consecutive episodes, generating unbiased estimators of reward and policy score functions. In addition, it involves keeping estimates of others' parameters using consensus steps given local estimates received through a time-varying communication network. In Markov potential games, there exists a potential value function among agents with gradients corresponding to the gradients of local value functions. Using this structure, we prove almost sure convergence to a stationary point of the potential value function with rate $O(1/\epsilon^2)$. Compared to previous works, our results do not require bounded policy gradients or initial agreement on the values of individual policy parameters. Numerical experiments on a dynamic multi-agent newsvendor problem verify the convergence of local beliefs and gradients. It further shows that networked policy gradient play converges as fast as independent policy gradient updates, while collecting higher rewards.
This paper considers the application of Model Predictive Control (MPC) to a weighted coverage path planning (WCPP) problem. The problem appears in a wide range of practical applications, including search and rescue (SAR) missions. The basic setup is that one (or multiple) agents can move around a given search space and collect rewards from a given spatial distribution. Unlike an artificial potential field, each reward can only be collected once. In contrast to a Traveling Salesman Problem (TSP), the agent moves in a continuous space. Moreover, he is not obliged to cover all locations and/or may return to previously visited locations. The WCPP problem is tackled by a new Model Predictive Control (MPC) formulation with so-called Coverage Constraints (CCs). It is shown that the solution becomes more effective if the solver is initialized with a TSP-based heuristic. With and without this initialization, the proposed MPC approach clearly outperforms a naive MPC formulation, as demonstrated in a small simulation study.
We tackle the problem of system identification, where we select inputs, observe the corresponding outputs from the true system, and optimize the parameters of our model to best fit the data. We propose a practical and computationally tractable methodology that is compatible with any system and parametric family of models. Our approach only requires input-output data from the system and first-order information of the model with respect to the parameters. Our approach consists of two modules. First, we formulate the problem of system identification from a Bayesian perspective and use a linear Gaussian model approximation to iteratively optimize the model's parameters. In each iteration, we propose to use the input-output data to tune the covariance of the linear Gaussian model. This online covariance calibration stabilizes fitting and signals model inaccuracy. Secondly, we define a Gaussian-based uncertainty measure for the model parameters, which we can then minimize with respect to the next selected input. We test our method with linear and nonlinear dynamics.
Sleep is essential for maintaining human health and quality of life. Analyzing physiological signals during sleep is critical in assessing sleep quality and diagnosing sleep disorders. However, manual diagnoses by clinicians are time-intensive and subjective. Despite advances in deep learning that have enhanced automation, these approaches remain heavily dependent on large-scale labeled datasets. This study introduces SynthSleepNet, a multimodal hybrid self-supervised learning framework designed for analyzing polysomnography (PSG) data. SynthSleepNet effectively integrates masked prediction and contrastive learning to leverage complementary features across multiple modalities, including electroencephalogram (EEG), electrooculography (EOG), electromyography (EMG), and electrocardiogram (ECG). This approach enables the model to learn highly expressive representations of PSG data. Furthermore, a temporal context module based on Mamba was developed to efficiently capture contextual information across signals. SynthSleepNet achieved superior performance compared to state-of-the-art methods across three downstream tasks: sleep-stage classification, apnea detection, and hypopnea detection, with accuracies of 89.89%, 99.75%, and 89.60%, respectively. The model demonstrated robust performance in a semi-supervised learning environment with limited labels, achieving accuracies of 87.98%, 99.37%, and 77.52% in the same tasks. These results underscore the potential of the model as a foundational tool for the comprehensive analysis of PSG data. SynthSleepNet demonstrates comprehensively superior performance across multiple downstream tasks compared to other methodologies, making it expected to set a new standard for sleep disorder monitoring and diagnostic systems.
The application of machine learning (ML) to electroencephalography (EEG) has great potential to advance both neuroscientific research and clinical applications. However, the generalisability and robustness of EEG-based ML models often hinge on the amount and diversity of training data. It is common practice to split EEG recordings into small segments, thereby increasing the number of samples substantially compared to the number of individual recordings or participants. We conceptualise this as a multi-level data generation process and investigate the scaling behaviour of model performance with respect to the overall sample size and the participant diversity through large-scale empirical studies. We then use the same framework to investigate the effectiveness of different ML strategies designed to address limited data problems: data augmentations and self-supervised learning. Our findings show that model performance scaling can be severely constrained by participant distribution shifts and provide actionable guidance for data collection and ML research. The code for our experiments is publicly available online.
This study investigates the problem of learning linear block codes optimized for Belief-Propagation decoders significantly improving performance compared to the state-of-the-art. Our previous research is extended with an enhanced system design that facilitates a more effective learning process for the parity check matrix. We simplify the input dataset, restrict the number of parameters to learn and improve the gradient back-propagation within the model. We also introduce novel optimizers specifically designed for discrete-valued weights. Based on conventional gradient computation, these optimizers provide discrete weights updates, enabling finer control and improving explainability of the learning process. Through these changes, we consistently achieve improved code performance, provided appropriately chosen hyper-parameters. To rigorously evaluate the performance of learned codes in the context of short to medium block lengths, we propose a comprehensive code performance assessment framework. This framework enables a fair comparison between our learning methodology and random search approaches, ensuring statistical significance in our results. The proposed model pave the way for a new approach to the efficient learning of linear block codes tailored to specific decoder structures.
Pinching-antenna systems (PASS) improve wireless links by configuring the locations of activated pinching antennas along dielectric waveguides, namely pinching beamforming. In this paper, a novel adjustable power radiation model is proposed for PASS, where power radiation ratios of pinching antennas can be flexibly controlled by tuning the spacing between pinching antennas and waveguides. A closed-form pinching antenna spacing arrangement strategy is derived to achieve the commonly assumed equal-power radiation. Based on this, a practical PASS framework relying on discrete activation is considered, where pinching antennas can only be activated among a set of predefined locations. A transmit power minimization problem is formulated, which jointly optimizes the transmit beamforming, pinching beamforming, and the numbers of activated pinching antennas, subject to each user's minimum rate requirement. (1) To solve the resulting highly coupled mixed-integer nonlinear programming (MINLP) problem, branch-and-bound (BnB)-based algorithms are proposed for both single-user and multi-user scenarios, which is guaranteed to converge to globally optimal solutions. (2) A low-complexity many-to-many matching algorithm is further developed. Combined with the Karush-Kuhn-Tucker (KKT) theory, locally optimal and pairwise-stable solutions are obtained within polynomial-time complexity. Simulation results demonstrate that: (i) PASS significantly outperforms conventional multi-antenna architectures, particularly when the number of users and the spatial range increase; and (ii) The proposed matching-based algorithm achieves near-optimal performance, resulting in only a slight performance loss while significantly reducing computational overheads. Code is available at this https URL
In this paper, we propose a coordinated routing strategy aimed at improving bus schedule adherence and enhancing travel efficiency for connected and automated vehicles (CAVs) operating within a mixed-traffic urban network. Our approach capitalizes on the existence of dedicated lanes for buses and CAVs, leveraging real-time traffic data to dynamically reroute CAVs in anticipation of congestion. By continuously monitoring traffic conditions on dedicated lanes and tracking the real-time positions of buses, we enable the system to proactively adjust CAV routes when potential interference with bus operations is detected. This coordination mitigates delays affecting transit services and reduces travel time for CAVs. We evaluate the proposed strategy through simulation studies conducted in the SUMO. The results demonstrate significant improvements in both transit reliability and CAV operational performance across a range of traffic conditions.
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
Visceral adipose tissue (VAT) is a key marker of both metabolic health and habitual physical activity (PA). Excess VAT is highly correlated with type 2 diabetes and insulin resistance. The mechanistic basis for this pathophysiology relates to overloading the liver with fatty acids. VAT is also a highly labile fat depot, with increased turnover stimulated by catecholamines during exercise. VAT can be measured with sophisticated imaging technologies, but can also be inferred directly from PA. We tested this relationship using National Health and Nutrition Examination Survey (NHANES) data from 2011-2014, for individuals aged 20-60 years with 7 days of accelerometry data (n=2,456 men; 2,427 women) [1]. Two approaches were used for estimating VAT from activity. The first used engineered features based on movements during gait and sleep, and then ridge regression to map summary statistics of these features into a VAT estimate. The second approach used deep neural networks trained on 24 hours of continuous accelerometry. A foundation model first mapped each 10s frame into a high-dimensional feature vector. A transformer model then mapped each day's feature vector time series into a VAT estimate, which were averaged over multiple days. For both approaches, the most accurate estimates were obtained with the addition of covariate information about subject demographics and body measurements. The best performance was obtained by combining the two approaches, resulting in VAT estimates with correlations of r=0.86. These findings demonstrate a strong relationship between PA and VAT and, by extension, between PA and metabolic health risks.
Demand charge often constitutes a significant portion of electricity costs for commercial electric vehicle (EV) charging station operators. This paper explores control methods to reduce peak power consumption at workplace EV charging stations in a joint price and power optimization framework. We optimize a menu of price options to incentivize users to select controllable charging service. Using this framework, we propose a model predictive control approach to reduce both demand charge and overall operator costs. Through a Monte Carlo simulation, we find that our algorithm outperforms a state-of-the-art benchmark optimization strategy and can significantly reduce station operator costs.
Ultrasonic Guided Waves (UGWs) represent a promising diagnostic tool for Structural Health Monitoring (SHM) in thin-walled structures, and their integration with machine learning (ML) algorithms is increasingly being adopted to enable real-time monitoring capabilities. However, the large-scale deployment of UGW-based ML methods is constrained by data scarcity and limited generalisation across different materials and sensor configurations. To address these limitations, this work proposes a novel transfer learning (TL) framework based on Multilinear Principal Component Analysis (MPCA). First, a Convolutional Neural Network (CNN) for regression is trained to perform damage localisation for a plated structure. Then, MPCA and fine-tuning are combined to have the CNN work for a different plate. By jointly applying MPCA to the source and target domains, the method extracts shared latent features, enabling effective domain adaptation without requiring prior assumptions about dimensionality. Following MPCA, fine-tuning enables adapting the pre-trained CNN to a new domain without the need for a large training dataset. The proposed MPCA-based TL method was tested against 12 case studies involving different composite materials and sensor arrays. Statistical metrics were used to assess domains alignment both before and after MPCA, and the results demonstrate a substantial reduction in localisation error compared to standard TL techniques. Hence, the proposed approach emerges as a robust, data-efficient, and statistically based TL framework for UGW-based SHM.
The inherent control switching of renewable energy sources (RESs) during intricate transient processes introduces complexity to the dynamic behavior of modern power systems. This paper reveals the dynamic coupling between grid-forming (GFM)/grid-following (GFL)-based RES and dominant instability modes of the hybrid system. First, six control combinations are systematically investigated by pairing the two GFM-RES modes, normal control (NC) and current saturation (CS), with the three GFL-RES modes: normal control, low voltage ride-through (LVRT), and high voltage ride-through (HVRT). Based on switching system theory, the coupled power flow and dynamic motion models are developed considering multi-mode switching characteristics. It is revealed that the hybrid system exhibits two distinct instability modes when the GFM-RES and GFL-RES exceed their P-f and V-f desynchronization boundaries, respectively. The two-dimensional spatiotemporal damping characteristics of GFL-RES induced by GFM-RES are also uncovered for the first time. A novel criterion is proposed to quantify the impact of GFM-RES on GFL-RES dynamics, capturing both its stabilizing and destabilizing effects under different control combinations. High-fidelity electromagnetic transient simulations validate the correctness of the analysis framework.
The recent progress in low-loss hollow-core fibers allows to speculate on the possibility of building a transatlantic submarine cable that can achieve the goal of 1 Pb/s per direction, leveraging bidirectional transmission, and at the same time drastically increase span length, theoretically to 200km. In this version, we add the analysis of the impact of Rayleigh backscattering.
Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however, the success of these models is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained models do not exist for 3D, significantly limiting progress. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language remains unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression, and enhanced medical training by visualizing hypothetical conditions in realistic detail. Our work takes a step toward this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this is the first demonstration of a language-guided native-3D diffusion model applied to neurological imaging, where faithful three-dimensional modeling is essential. On two neurological MRI datasets, our framework simulates varying counterfactual lesion loads in Multiple Sclerosis and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity. Our results lay the groundwork for prompt-driven disease progression analysis in 3D medical imaging. Project link - this https URL.
We propose DeepASA, a multi-purpose model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA is designed for complex auditory scenes where multiple, often similar, sound sources overlap in time and move dynamically in space. To achieve robust and consistent inference across tasks, we introduce an object-oriented processing (OOP) strategy. This approach encapsulates diverse auditory features into object-centric representations and refines them through a chain-of-inference (CoI) mechanism. The pipeline comprises a dynamic temporal kernel-based feature extractor, a transformer-based aggregator, and an object separator that yields per-object features. These features feed into multiple task-specific decoders. Our object-centric representations naturally resolve the parameter association ambiguity inherent in traditional track-wise processing. However, early-stage object separation can lead to failure in downstream ASA tasks. To address this, we implement temporal coherence matching (TCM) within the chain-of-inference, enabling multi-task fusion and iterative refinement of object features using estimated auditory parameters. We evaluate DeepASA on representative spatial audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental results show that our model achieves state-of-the-art performance across all evaluated tasks, demonstrating its effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.
Multi-agent reinforcement learning (MARL) optimizes strategic interactions in non-cooperative dynamic games, where agents have misaligned objectives. However, data-driven methods such as multi-agent policy gradients (MA-PG) often suffer from instability and limit-cycle behaviors. Prior stabilization techniques typically rely on entropy-based exploration, which slows learning and increases variance. We propose a model-based approach that incorporates approximate priors into the reward function as regularization. In linear quadratic (LQ) games, we prove that such priors stabilize policy gradients and guarantee local exponential convergence to an approximate Nash equilibrium. We then extend this idea to infinite-horizon nonlinear games by introducing Multi-agent Guided Policy Search (MA-GPS), which constructs short-horizon local LQ approximations from trajectories of current policies to guide training. Experiments on nonlinear vehicle platooning and a six-player strategic basketball formation show that MA-GPS achieves faster convergence and more stable learning than existing MARL methods.
This paper presents a controller design framework aiming to balance control performance and actuation rate. Control performance is evaluated by an infinite-horizon average cost, and the number of control actions is penalized via sparsity-promoting regularization. Since the formulated optimal control problem has a combinatorial nature, we employ a rollout algorithm to obtain a tractable suboptimal solution. In the proposed scheme, actuation timings are determined through a multistage minimization procedure based on a receding-horizon approach, and the corresponding control inputs are computed online. We establish theoretical performance guarantees with respect to periodic control and prove the stability of the closed-loop system. The effectiveness of the proposed method is demonstrated through a numerical example.
We study the identification of continuous-time vector fields from irregularly sampled trajectories. We introduce Spectral Flow Learning (SFL), which learns in a windowed flow space using a lag-linear label operator that aggregates lagged Koopman actions. We provide finite-sample high-probability (FS-HP) guarantees for the class of variable-step linear multistep methods (vLLM). The FS-HP rates are constructed using spectral regularization with qualification-controlled filters for flow predictors under standard source and filter assumptions. A multistep observability inequality links flow error to vector-field error and yields two-term bounds that combine a statistical rate with an explicit discretization bias from vLMM theory. This preliminary preprint states the results and sketches proofs, with full proofs and extensions deferred to a journal version.
Non-fixed flexible antenna architectures, such as fluid antenna system (FAS), movable antenna (MA), and pinching antenna, have garnered significant interest in recent years. Among them, rotatable antenna (RA) technology has recently drawn significant attention in wireless systems owing to its unique ability to exploit additional spatial degrees-of-freedom (DoFs) by dynamically adjusting the three-dimensional (3D) boresight direction of each antenna. In this letter, we propose a new RA-assisted cognitive radio (CR) system designed to achieve efficient spectrum sharing while mitigating interference between primary and secondary communication links. Specifically, we formulate an optimization problem for the joint design of the transmit beamforming and the boresight directions of RAs at the secondary transmitter (ST), aimed at maximizing the received signal-to-interference-plus-noise ratio (SINR) at the secondary receiver (SR), while satisfying both interference constraint at the primary receiver (PR) and the maximum transmit power constraint at the ST. Although the formulated problem is challenging to solve due to its non-convexity and coupled variables, we develop an efficient algorithm by leveraging alternating optimization (AO) and successive convex approximation (SCA) techniques to acquire high-quality solutions. Numerical results demonstrate that the proposed RA-assisted system substantially outperforms conventional benchmark schemes in spectrum-sharing CR systems, validating RA's capability to simultaneously enhance the communication quality at the SR and mitigate interference at the PR.
In this paper we revisit a fundamental technical issue within the theory of stochastic approximation (SA) in a Markovian framework, first proposed in the book by Djereveckii and Fradkov (1981), and further developed in much detail in the book by Benveniste, M{é}tivier, and Priouret (1990). This theory is instrumental in many application areas such as the statistical analysis of Hidden Markov Models arising in telecommunication, quantized linear stochastic systems, and more recently in active learning and reinforcement learning. The problem at hand is the verification of the existence, uniqueness and Lipschitz-continuity of the solution of a parameter-dependent Poisson equation, in an appropriate weighted sup-norm, associated with a collection of Markov chains on general state spaces. Verification of the above facts is vital in the analysis of SA processes presented in (Benveniste et al., 1990) via the ODE (ordinary differential equations) method, requiring substantial technical effort. The motivation and focus of the paper is to address this technical issue, by presenting a simple set of conditions, under which the above properties of the Poisson equation at hand can be conveniently established. The starting point of our work is an intricate result of Hairer and Mattingly (2011) proving that by tilting standard conditions of mainstream stability theory for Markov chains, the transition kernels prove to be contractions in the space of differences of probability measures in a suitable metric. To demonstrate the applicability of our results, the proposed conditions are verified for a class of queuing system with open-loop control.
We consider the problem of finite-time identification of linear dynamical systems from $T$ samples of a single trajectory. Recent results have predominantly focused on the setup where either no structural assumption is made on the system matrix $A^* \in \mathbb{R}^{n \times n}$, or specific structural assumptions (e.g. sparsity) are made on $A^*$. We assume prior structural information on $A^*$ is available, which can be captured in the form of a convex set $\mathcal{K}$ containing $A^*$. For the solution of the ensuing constrained least squares estimator, we derive non-asymptotic error bounds in the Frobenius norm that depend on the local size of $\mathcal{K}$ at $A^*$. To illustrate the usefulness of these results, we instantiate them for four examples, namely when (i) $A^*$ is sparse and $\mathcal{K}$ is a suitably scaled $\ell_1$ ball; (ii) $\mathcal{K}$ is a subspace; (iii) $\mathcal{K}$ consists of matrices each of which is formed by sampling a bivariate convex function on a uniform $n \times n$ grid (convex regression); (iv) $\mathcal{K}$ consists of matrices each row of which is formed by uniform sampling (with step size $1/T$) of a univariate Lipschitz function. In all these situations, we show that $A^*$ can be reliably estimated for values of $T$ much smaller than what is needed for the unconstrained setting.
Parameter estimation, which represents a classical inverse problem, is often ill-posed as different parameter combinations can yield identical outputs. This non-uniqueness poses a critical barrier to accurate and unique identification. This work introduces a novel parameter estimation framework to address such limits: the Joint Conditional Diffusion Model-based Inverse Problem Solver (JCDI). By leveraging the stochasticity of diffusion models, JCDI produces possible solutions revealing underlying distributions. Joint conditioning on multiple observations further narrows the posterior distributions of non-identifiable parameters. For the challenging task in dynamic power systems: composite load model parameterization, JCDI achieves a 58.6% reduction in parameter estimation error compared to the single-condition model. It also accurately replicates system's dynamic responses under various electrical faults, with root mean square errors below 4*10^(-3), outperforming existing deep-reinforcement-learning and supervised learning approaches. Given its data-driven nature, JCDI provides a universal framework for parameter estimation while effectively mitigating the non-uniqueness challenge across scientific domains.
Deploying artificial intelligence (AI) models on edge devices involves a delicate balance between meeting stringent complexity constraints, such as limited memory and energy resources, and ensuring reliable performance in sensitive decision-making tasks. One way to enhance reliability is through uncertainty quantification via Bayesian inference. This approach, however, typically necessitates maintaining and running multiple models in an ensemble, which may exceed the computational limits of edge devices. This paper introduces a low-complexity methodology to address this challenge by distilling calibration information from a more complex model. In an offline phase, predictive probabilities generated by a high-complexity cloud-based model are leveraged to determine a threshold based on the typical divergence between the cloud and edge models. At run time, this threshold is used to construct credal sets -- ranges of predictive probabilities that are guaranteed, with a user-selected confidence level, to include the predictions of the cloud model. The credal sets are obtained through thresholding of a divergence measure in the simplex of predictive probabilities. Experiments on visual and language tasks demonstrate that the proposed approach, termed Conformalized Distillation for Credal Inference (CD-CI), significantly improves calibration performance compared to low-complexity Bayesian methods, such as Laplace approximation, making it a practical and efficient solution for edge AI deployments.
The R2D2 Deep Neural Network (DNN) series was recently introduced for image formation in radio interferometry. It can be understood as a learned version of CLEAN, whose minor cycles are substituted with DNNs. We revisit R2D2 on the grounds of series convergence, training methodology, and DNN architecture, improving its robustness in terms of generalizability beyond training conditions, capability to deliver high data fidelity, and epistemic uncertainty. First, while still focusing on telescope-specific training, we enhance the learning process by randomizing Fourier sampling integration times, incorporating multiscan multinoise configurations, and varying imaging settings, including pixel resolution and visibility-weighting scheme. Second, we introduce a convergence criterion whereby the reconstruction process stops when the data residual is compatible with noise, rather than simply using all available DNNs. This not only increases the reconstruction efficiency by reducing its computational cost, but also refines training by pruning out the data/image pairs for which optimal data fidelity is reached before training the next DNN. Third, we substitute R2D2's early U-Net DNN with a novel architecture (U-WDSR) combining U-Net and WDSR, which leverages wide activation, dense skip connections, weight normalization, and low-rank convolution to improve feature reuse and reconstruction precision. As previously, R2D2 was trained for monochromatic intensity imaging with the Very Large Array at fixed $512 \times 512$ image size. Simulations on a wide range of inverse problems and a case study on real data reveal that the new R2D2 model consistently outperforms its earlier version in image reconstruction quality, data fidelity, and epistemic uncertainty.
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.
Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: this https URL, code is available at: this https URL
An approach is proposed to identify optimal asset protection strategies based on vulnerability assessment outcomes. Traditional bilevel attacker-defender models emphasize worst-case scenarios but offer limited defensive guidance. In contrast, trilevel models introduce high computational complexity and rely on fixed network configurations. The proposed critical-components method leverages vulnerability assessment results to determine protection strategies, effectively outsourcing the upper-level defense decision. This enables adaptability to diverse network topologies, assessment techniques, and cyber-physical energy systems without the overhead of multi-level optimization. Case studies demonstrate the potential for improved system resilience across varying operational conditions.
Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging, especially under sparse transmitter setups, e.g., with only one transmitter. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires time-consuming case-specific optimization and fails under sparse transmitter setups. To address these limitations, we revisit EISP from a data-driven perspective. The scarcity of transmitters leads to an insufficient amount of measured data, which fails to capture adequate physical information for stable inversion. Built on this insight, we propose a fully end-to-end and data-driven framework that predicts the relative permittivity of scatterers from measured fields, leveraging data distribution priors to compensate for the lack of physical information. This design enables data-driven training and feed-forward prediction of relative permittivity while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy and robustness. Notably, it achieves high-quality results even with a single transmitter, a setting where previous methods consistently fail. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.
Real-time planning among many uncertain, dynamic obstacles is challenging because predicting every agent with high fidelity is both unnecessary and computationally expensive. We present Heterogeneous Predictor-based Risk-Aware Planning (H-PRAP), a framework that allocates prediction effort to where it matters. H-PRAP introduces the Probability-based Collision Risk Index (P-CRI), a closed-form, horizon-level collision index obtained by calibrating a Gaussian surrogate with conformal prediction. P-CRI drives a router that assigns high-risk obstacles to accurate but expensive predictors and low-risk obstacles to lightweight predictors, while preserving distribution-free coverage across heterogeneous predictors through conformal prediction. The selected predictions and their conformal radii are embedded in a chance-constrained model predictive control (MPC) problem, yielding receding-horizon policies with explicit safety margins. We analyze the safety-efficiency trade-off under prediction compute budget: more portion of low-fidelity predictions reduce residual risk from dropped obstacles, but in the same time induces larger conformal radii and degrades trajectory efficiency and shrinks MPC feasibility. Extensive numerical simulations in dense, uncertain environments validate that H-PRAP attains best balance between trajectory success rate (i.e., no collisions) and the time to reach the goal (i.e., trajectory efficiency) compared to single prediction architectures.
Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincaré Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.
This paper introduces a methodology for identifying and simulating financial and economic systems using stochastically structured reservoir computers (SSRCs). The framework combines structure-preserving embeddings with graph-informed coupling matrices to model inter-agent dynamics while enhancing interpretability. A constrained optimization scheme guarantees compliance with both stochastic and structural constraints. Two empirical case studies, a nonlinear stochastic dynamic model and regional inflation network dynamics, demonstrate the effectiveness of the approach in capturing complex nonlinear patterns and enabling interpretable predictive analysis under uncertainty.
Across the globe there are growing calls to streamline and improve ever more complex income tax codes. Executing reform has proven difficult. Even when the desired outcomes are clear, the tools to design fitting reforms are lacking. To remedy this, we developed \texttt{TaxSolver}: a methodology to help policymakers realize optimal tax reform. \texttt{TaxSolver} allows policymakers to focus solely on what they aim to achieve with a reform -- like redistributing wealth, incentivizing labor market participation or reducing complexity -- and the guarantees within which reform is acceptable -- like limited fluctuations in taxpayer incomes or shocks to overall tax revenue. Given these goals and fiscal guarantees, \texttt{TaxSolver} finds the optimal set of tax rules that satisfies all the criteria or shows that the set of demands are not mathematically feasible. We illustrate \texttt{TaxSolver} by reforming various simulated examples of tax codes, including some that reflect the complexity and size of a real-world tax system.
Scene recognition of audiologically relevant environments is important for hearing aids; however, it is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying these models on resource-constrained edge devices presents another challenge. Our solution is two-fold: we leverage several open source datasets to create AHEAD-DS, a dataset designed for scene recognition of audiologically relevant environments, and introduce YAMNet+, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. YAMNet+ is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality; serving as a baseline model for sound-based scene recognition. YAMNet+ achieved a mean average precision of 0.83 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories of audiologically relevant environments. We found that applying transfer learning from the pretrained YAMNet model was essential. We demonstrated real-time sound-based scene recognition capabilities on edge devices by deploying YAMNet+ to an Android smartphone. Even with a Google Pixel 3 (a phone with modest specifications, released in 2018), the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. Our website and code this https URL .
Artificial intelligence (AI) systems can detect disease-related acoustic patterns in cough sounds, offering a scalable and cost-effective approach to tuberculosis (TB) screening in high-burden, resource-limited settings. Previous studies have been limited by small datasets, under-representation of symptomatic non-TB patients, and recordings collected in controlled environments. In this study, we enrolled 512 participants at two hospitals in Zambia, categorised into three groups: bacteriologically confirmed TB (TB+), symptomatic patients with other respiratory diseases (OR), and healthy controls (HC). Usable cough recordings with demographic and clinical data were obtained from 500 participants. Deep learning classifiers based on pre-trained speech foundation models were fine-tuned on cough recordings to predict diagnostic categories. The best-performing model, trained on 3-second audio clips, achieved an AUROC of 85.2% for distinguishing TB coughs from all other participants (TB+/Rest) and 80.1% for TB+ versus symptomatic OR participants (TB+/OR). Incorporating demographic and clinical features improved performance to 92.1% for TB+/Rest and 84.2% for TB+/OR. At a probability threshold of 0.38, the multimodal model reached 90.3% sensitivity and 73.1% specificity for TB+/Rest, meeting WHO target product profile benchmarks for TB screening. Adversarial testing and stratified analyses shows that the model was robust to confounding factors including background noise, recording time, and device variability. These results demonstrate the feasibility of cough-based AI for TB screening in real-world, low-resource settings.
This paper addresses the problem of protecting network information from privacy system identification (SI) attacks when sharing cyber-physical system simulations. We model analyst observations of networked states as time-series outputs of a graph filter driven by differentially private (DP) nodal excitations, with the analyst aiming to infer the underlying graph shift operator (GSO). Unlike traditional SI, which estimates system parameters, we study the inverse problem: what assumptions prevent adversaries from identifying the GSO while preserving utility for legitimate analysis. We show that applying DP mechanisms to inputs provides formal privacy guarantees for the GSO, linking the $(\epsilon,\delta)$-DP bound to the spectral properties of the graph filter and noise covariance. More precisely, for DP Gaussian signals, the spectral characteristics of both the filter and noise covariance determine the privacy bound, with smooth filters and low-condition-number covariance yielding greater privacy.
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: this http URL