Polycystic Ovary Syndrome (PCOS) is a widespread disorder in women of reproductive age, characterized by a hormonal imbalance, irregular periods, and multiple ovarian cysts. Infertility, metabolic syndrome, and cardiovascular risks are long-term complications that make early detection essential. In this paper, we design a powerful framework based on transfer learning utilizing DenseNet201 and ResNet50 for classifying ovarian ultrasound images. The model was trained on an online dataset containing 3856 ultrasound images of cyst-infected and non-infected patients. Each ultrasound frame was resized to 224x224 pixels and encoded with precise pathological indicators. The MixUp and CutMix augmentation strategies were used to improve generalization, yielding a peak validation accuracy of 99.80% by Densenet201 and a validation loss of 0.617 with alpha values of 0.25 and 0.4, respectively. We evaluated the model's interpretability using leading Explainable AI (XAI) approaches such as SHAP, Grad-CAM, and LIME, reasoning with and presenting explicit visual reasons for the model's behaviors, therefore increasing the model's transparency. This study proposes an automated system for medical picture diagnosis that may be used effectively and confidently in clinical practice.
In this work, we study multi-sensor scheduling for remote state estimation over wireless multiple-input multiple-output (MIMO) fading channels using a novel semantic over-the-air (SemOTA) aggregation approach. We first revisit Kalman filtering with conventional over-the-air (OTA) aggregation and highlight its transmit power limitations. To balance power efficiency and estimation performance, we formulate the scheduling task as a finite-horizon dynamic programming (DP) problem. By analyzing the structure of the optimal Q-function, we show that the resulting scheduling policy exhibits a semantic structure that adapts online to the estimation error covariance and channel variations. To obtain a practical solution, we derive a tractable upper bound on the Q-function via a positive semidefinite (PSD) cone decomposition, which enables an efficient approximate scheduling policy and a low-complexity remote estimation algorithm. Numerical results confirm that the proposed scheme outperforms existing methods in both estimation accuracy and power efficiency.
Purpose: To investigate whether routinely acquired longitudinal MR-Linac images can be leveraged to characterize treatment-induced changes during radiotherapy, particularly subtle inter-fraction changes over short intervals (average of 2 days). Materials and Methods: This retrospective study included a series of 0.35T MR-Linac images from 761 patients. An artificial intelligence (deep learning) model was used to characterize treatment-induced changes by predicting the temporal order of paired images. The model was first trained with the images from the first and the last fractions (F1-FL), then with all pairs (All-pairs). Model performance was assessed using quantitative metrics (accuracy and AUC), compared to a radiologist's performance, and qualitative analyses - the saliency map evaluation to investigate affected anatomical regions. Input ablation experiments were performed to identify the anatomical regions altered by radiotherapy. The radiologist conducted an additional task on partial images reconstructed by saliency map regions, reporting observations as well. Quantitative image analysis was conducted to investigate the results from the model and the radiologist. Results: The F1-FL model yielded near-perfect performance (AUC of 0.99), significantly outperforming the radiologist. The All-pairs model yielded an AUC of 0.97. This performance reflects therapy-induced changes, supported by the performance correlation to fraction intervals, ablation tests and expert's interpretation. Primary regions driving the predictions were prostate, bladder, and pubic symphysis. Conclusion: The model accurately predicts temporal order of MR-Linac fractions and detects radiation-induced changes over one or a few days, including prostate and adjacent organ alterations confirmed by experts. This underscores MR-Linac's potential for advanced image analysis beyond image guidance.
Carrier phase positioning (CPP) can enable cm-level accuracy in next-generation wireless systems, while recent literature shows that accuracy remains high using phase-only measurements in distributed MIMO (D-MIMO). However, the impact of phase synchronization errors on such systems remains insufficiently explored. To address this gap, we first show that the proposed hyperbola intersection method achieves highly accurate positioning even in the presence of phase synchronization errors, when trained on appropriate data reflecting such impairments. We then introduce a deep learning (DL)-based D-MIMO antenna point (AP) selection framework that ensures high-precision localization under phase synchronization errors. Simulation results show that the proposed framework improves positioning accuracy compared to prior-art methods, while reducing inference complexity by approximately 19.7%.
Nonlinear ordinary differential equations (ODEs) are powerful tools for modeling real-world dynamical systems. However, propagating initial state uncertainty through nonlinear dynamics, especially when the ODE is unknown and learned from data, remains a major challenge. This paper introduces a novel continuum dynamics perspective for model learning that enables formal uncertainty propagation by constructing Taylor series approximations of probabilistic events. We establish sufficient conditions for the soundness of the approach and prove its asymptotic convergence. Empirical results demonstrate the framework's effectiveness, particularly when predicting rare events.
White matter segmentation methods from diffusion magnetic resonance imaging range from streamline clustering-based approaches to bundle mask delineation, but none have proposed a pediatric-specific approach. We hypothesize that a deep learning model with a similar approach to TractSeg will improve similarity between an algorithm-generated mask and an expert-labeled ground truth. Given a cohort of 56 manually labelled white matter bundles, we take inspiration from TractSeg's 2D UNet architecture, and we modify inputs to match bundle definitions as determined by pediatric experts, evaluation to use k fold cross validation, the loss function to masked Dice loss. We evaluate Dice score, volume overlap, and volume overreach of 16 major regions of interest compared to the expert labeled dataset. To test whether our approach offers statistically significant improvements over TractSeg, we compare Dice voxels, volume overlap, and adjacency voxels with a Wilcoxon signed rank test followed by false discovery rate correction. We find statistical significance across all bundles for all metrics with one exception in volume overlap. After we run TractSeg and our model, we combine their output masks into a 60 label atlas to evaluate if TractSeg and our model combined can generate a robust, individualized atlas, and observe smoothed, continuous masks in cases that TractSeg did not produce an anatomically plausible output. With the improvement of white matter pathway segmentation masks, we can further understand neurodevelopment on a population level scale, and we can produce reliable estimates of individualized anatomy in pediatric white matter diseases and disorders.
While the rapid expansion of data centers poses challenges for power grids, it also offers new opportunities as potentially flexible loads. Existing power system research often abstracts data centers as aggregate resources, while computer system research primarily focuses on optimizing GPU energy efficiency and largely ignores the grid impacts of optimized GPU power consumption. To bridge this gap, we develop a GPU-to-Grid framework that couples device-level GPU control with power system objectives. We study distribution-level voltage regulation enabled by flexibility in LLM inference, using batch size as a control knob that trades off the voltage impacts of GPU power consumption against inference latency and token throughput. We first formulate this problem as an optimization problem and then realize it as an online feedback optimization controller that leverages measurements from both the power grid and GPU systems. Our key insight is that reducing GPU power consumption alleviates violations of lower voltage limits, while increasing GPU power mitigates violations near upper voltage limits in distribution systems; this runs counter to the common belief that minimizing GPU power consumption is always beneficial to power grids.
Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.
Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video compression by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video compression framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve compression, we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by sparse segmentation masks. This allows for significantly boosts compression efficiency, enabling descent video reconstruction at extremely low bit-rates.
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.
Retinopathy of Prematurity (ROP) is among the major causes of preventable childhood blindness. Automated screening remains challenging, primarily due to limited data availability and the complex condition involving both structural staging and microvascular abnormalities. Current deep learning models depend heavily on large private datasets and passive multimodal fusion, which commonly fail to generalize on small, imbalanced public cohorts. We thus propose the Context-Aware Asymmetric Ensemble Model (CAA Ensemble) that simulates clinical reasoning through two specialized streams. First, the Multi-Scale Active Query Network (MS-AQNet) serves as a structure specialist, utilizing clinical contexts as dynamic query vectors to spatially control visual feature extraction for localization of the fibrovascular ridge. Secondly, VascuMIL encodes Vascular Topology Maps (VMAP) within a gated Multiple Instance Learning (MIL) network to precisely identify vascular tortuosity. A synergistic meta-learner ensembles these orthogonal signals to resolve diagnostic discordance across multiple objectives. Tested on a highly imbalanced cohort of 188 infants (6,004 images), the framework attained State-of-the-Art performance on two distinct clinical tasks: achieving a Macro F1-Score of 0.93 for Broad ROP staging and an AUC of 0.996 for Plus Disease detection. Crucially, the system features `Glass Box' transparency through counterfactual attention heatmaps and vascular threat maps, proving that clinical metadata dictates the model's visual search. Additionally, this study demonstrates that architectural inductive bias can serve as an effective bridge for the medical AI data gap.
Unmanned aerial vehicles (UAVs) are increasingly deployed in mission-critical applications such as target tracking, where they must simultaneously sense dynamic environments, ensure reliable communication, and achieve precise control. A key challenge here is to jointly guarantee tracking accuracy, communication reliability, and control stability within a unified framework. To address this issue, we propose an integrated sensing, communication, and control (ISCC) framework for UAV-assisted target tracking, where the considered tracking system is modeled as a discrete-time linear control process, with the objective of driving the deviation between the UAV and target states toward zero. We formulate a stochastic model predictive control (MPC) optimization problem for joint control and beamforming design, which is highly non-convex and intractable in its original form. To overcome this difficulty, the target state is first estimated using an extended Kalman filter (EKF). Then, by deriving the closed-form optimal beamforming solution under a given control input, the original problem is equivalently reformulated into a tractable control-oriented form. Finally, we convexify the remaining non-convex constraints via a relaxation-based convex approximation, yielding a computationally tractable convex optimization problem that admits efficient global solution. Numerical results show that the proposed ISCC framework achieves tracking accuracy comparable to a non-causal benchmark while maintaining stable communication, and it significantly outperforms the conventional control and tracking method.
Exterior sound field interpolation is a challenging problem that often requires specific array configurations and prior knowledge on the source conditions. We propose an interpolation method based on Gaussian processes using a point source reproducing kernel with a trainable inner product formulation made to fit exterior sound fields. While this estimation does not have a closed formula, it allows for the definition of a flexible estimator that is not restricted by microphone distribution and attenuates higher harmonic orders automatically with parameters directly optimized from the recordings, meaning an arbitrary distribution of microphones can be used. The proposed kernel estimator is compared in simulated experiments to the conventional method using spherical wave functions and an established physics-informed machine learning model, achieving lower interpolation error by approximately 2 dB on average within the analyzed frequencies of 100 Hz and 2.5 kHz and reconstructing the ground truth sound field more consistently within the target region.
Control of nonlinear systems with high levels of uncertainty is practically relevant and theoretically challenging. This paper presents a numerical investigation of an adaptive nonlinear model predictive control (MPC) technique that relies entirely on online system identification without prior modeling, training, or data collection. In particular, the paper considers predictive cost adaptive control (PCAC), which is an extension of generalized predictive control. Nonlinear PCAC (NPCAC) uses recursive least squares (RLS) with subspace of information forgetting (SIFt) to identify a discrete-time, pseudo-linear, input-output model, which is used with iterative MPC for nonlinear receding-horizon optimization. The performance of NPCAC is illustrated using polynomial, Fourier, and cubic-spline basis functions.
Ground-penetrating radar (GPR) has emerged as a prominent tool for imaging internal defects in cylindrical structures, such as columns, utility poles, and tree trunks. However, accurately reconstructing both the shape and permittivity of the defects inside cylindrical structures remains challenging due to complex wave scattering phenomena and the limited accuracy of the existing signal processing and deep learning techniques. To address these issues, this study proposes a migration-assisted deep learning scheme for reconstructing the shape and permittivity of defects within cylindrical structures. The proposed scheme involves three stages of GPR data processing. First, a dual-permittivity estimation network extracts the permittivity values of the defect and the cylindrical structure, the latter of which is estimated with the help of a novel structural similarity index measure-based autofocusing technique. Second, a modified Kirchhoff migration incorporating the extracted permittivity of the cylindrical structure maps the signals reflected from the defect to the imaging domain. Third, a shape reconstruction network processes the migrated image to recover the precise shape of the defect. The image of the interior defect is finally obtained by combining the reconstructed shape and extracted permittivity of the defect. The proposed scheme is validated using both synthetic and experimental data from a laboratory trunk model and real tree trunk samples. Comparative results show superior performance over existing deep learning methods, while generalization tests on live trees confirm its feasibility for in-field deployment. The underlying principle can further be applied to other circumferential GPR imaging scenarios. The code and database are available at: this https URL.
This paper proposes a user-centric split federated learning (UCSFL) framework for user-centric cell-free multiple-input multiple-output (CF-MIMO) networks to support split federated learning (SFL). In the proposed UCSFL framework, users deploy split sub-models locally, while complete models are maintained and updated at access point (AP)-side distributed processing units (DPUs), followed by a two-level aggregation procedure across DPUs and the central processing unit (CPU). Under standard machine learning (ML) assumptions, we provide a theoretical convergence analysis for UCSFL, which reveals that the AP-cluster size is a key factor influencing model training accuracy. Motivated by this result, we introduce a new performance metric, termed the latency-to-accuracy ratio, defined as the ratio of a user's per-iteration training latency to the weighted size of its AP cluster. Based on this metric, we formulate a joint optimization problem to minimize the maximum latency-to-accuracy ratio by jointly optimizing uplink power control, downlink beamforming, model splitting, and AP clustering. The resulting problem is decomposed into two sub-problems operating on different time scales, for which dedicated algorithms are developed to handle the short-term and long-term optimizations, respectively. Simulation results verify the convergence of the proposed algorithms and demonstrate that UCSFL effectively reduces the latency-to-accuracy ratio of the VGG16 model compared with baseline schemes. Moreover, the proposed framework adaptively adjusts splitting and clustering strategies in response to varying communication and computation resources. An MNIST-based handwritten digit classification example further shows that UCSFL significantly accelerates the convergence of the VGG16 model.
Non-terrestrial networks (NTNs) have gained significant attention for their scalability and wide coverage in next-generation communication systems. A large number of NTN nodes, such as satellites, are required to establish a global NTN, but not all operators have the capability to deploy such a system. Therefore, cooperation among multiple operators, facilitated by an orchestrator, enables the construction of virtually large-scale constellations. In this paper, we propose a weak-control-based orchestration framework that coordinates multiple NTN operators while ensuring that operations align with the policies of both the orchestrator and the individual operators. Unlike centralized orchestration frameworks, where the orchestrator determines the entire route from source to destination, the proposed framework allows each operator to select preferred routes from multiple candidates provided by the orchestrator. To evaluate the effectiveness of our proposed framework, we conducted numerical simulations under various scenarios and network configurations including dynamic NTN environments with time-varying topologies, showing that inter-operator cooperation improves the availability of feasible end-to-end routes. Furthermore, we analyzed the iterative negotiation process to address policy conflicts and quantitatively demonstrated the "price of autonomy," where strict individual policies degrade global feasibility and performance. The results also demonstrate that outcomes of the proposed framework depend on the operators' policies and that hop count and latency increase as the number of operators grows. These findings validate the proposed framework's ability to deliver practical benefits of orchestrated multi-operator collaboration in future NTN environments.
This paper introduces a data-based integral sliding mode control scheme for robustification of model-reference controllers, accommodating generic multivariable linear systems with unknown dynamics and affected by matched disturbances. Specifically, an integral sliding mode control (ISMC) law is recast into a data-based framework relying on an integral sliding variable depending only on the reference model, without the need of modeling the plant. The main strength of the proposed approach is the enforcement of the desired reference model in closed-loop under sliding mode conditions, despite the lack of knowledge of the model dynamics and the presence of the matched disturbances. Moreover, the conditions required to guarantee an integral sliding mode generation and the closed-loop stability are formally analyzed in the paper, remarking the generality of the proposed data-driven integral sliding mode control (DD-ISMC) with respect to the related model-based counterpart. Finally, the main practices for the data-based design of the proposed control scheme are deeply discussed in the paper, and the proposed method is tested in simulation on a benchmark example, and experimentally on a real laboratory setup. Simulation and experimental evidence fully corroborates the theoretical analysis, thus motivating further research in this direction.
We propose WaveTrainerFit, a neural vocoder that performs high-quality waveform generation from data-driven features such as SSL features. WaveTrainerFit builds upon the WaveFit vocoder, which integrates diffusion model and generative adversarial network. Furthermore, the proposed method incorporates the following key improvements: 1. By introducing trainable priors, the inference process starts from noise close to the target speech instead of Gaussian noise. 2. Reference-aware gain adjustment is performed by imposing constraints on the trainable prior to matching the speech energy. These improvements are expected to reduce the complexity of waveform modeling from data-driven features, enabling high-quality waveform generation with fewer inference steps. Through experiments, we showed that WaveTrainerFit can generate highly natural waveforms with improved speaker similarity from data-driven features, while requiring fewer iterations than WaveFit. Moreover, we showed that the proposed method works robustly with respect to the depth at which SSL features are extracted. Code and pre-trained models are available from this https URL.
Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre-operative MRI, they are often effectively invisible on intra-operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross-modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration-segmentation framework that combines MSCGUNet for inter-modal image registration with a UNet-based segmentation module, enabling registration-assisted pseudo-label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the "domain gap" and "feature absence" problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration-based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross-modality medical image analysis. Code an weights are available at: this https URL
Monitoring drift into failure is hindered by Euclidean anomaly detection that can conflate safe operational trade-offs with risk accumulation in signals expressed as shares, and by architectural churn that makes fixed schemas (and learned models) stale before rare boundary events occur. Rasmussen's dynamic safety model motivates drift under competing pressures, but operationalizing it for software is difficult because many high-value operational signals (effort, remaining margin, incident impact) are compositional and their parts evolve. We propose a vision for drift observability on the simplex: model drift and boundary proximity in Aitchison geometry to obtain coordinate-invariant direction and distance-to-safety in interpretable balance coordinates. To remain comparable under churn, a monitor would continuously refresh its part inventory and policy-defined boundaries from engineering artifacts and apply lineage-aware aggregation. We outline early-warning diagnostics and falsifiable hypotheses for future evaluation.
Radio transmissions in millimeter wave (mmWave) bands have gained significant interest for applications demanding precise device localization and trajectory estimation. This paper explores novel neural network (NN) architectures suitable for trajectory estimation and path determination in a mmWave multiple-input multiple-output (MIMO) outdoor system based on localization data from beamformed fingerprint (BFF). The NN architecture captures sequences of BFF signals from different users, and through the application of learning mechanisms, subsequently estimate their trajectories. In turn, this information is employed to find the shortest path to the target, thereby enabling more efficient navigation. Specifically, we propose a two-stage procedure for trajectory estimation and optimal path finding. In the first stage, a transformer network (TN) based on attention mechanisms is developed to predict trajectories of wireless devices using BFF sequences captured in a mmWave MIMO outdoor system. In the second stage, a novel algorithm based on Informed Rapidly-exploring Random Trees (iRRT*) is employed to determine the optimal path to target locations using trajectory estimates derived in the first stage. The effectiveness of the proposed schemes is validated through numerical experiments, using a comprehensive dataset of radio measurements, generated using ray tracing simulations to model outdoor propagation at 28 GHz. We show that our proposed TN-based trajectory estimator outperforms other methods from the recent literature and can successfully generalize to new trajectories outside the training set. Furthermore, our proposed iRRT* algorithm is able to consistently provide the shortest path to the target.
Estimating the depth of a monoharmonic sound source at a fixed range using a vertical linear array (VLA) is challenging in the absence of seabed environmental parameters, and relevant research remains scarce. The orthogonality constrained modal search based depth estimation (OCMS-D) method is proposed in this paper, which enables the estimation of the depth of a monoharmonic source at a fixed range using a VLA under unknown seabed parameters. Using the sparsity of propagating normal modes and the orthogonality of mode depth functions, OCMS-D estimates the normal mode parameters under a fixed source-array distance at first. The estimated normal mode parameters are then used to estimate the source depth. To ensure the precision of the source depth estimation, the method utilizes information on both the amplitude distribution and the sign (positive/negative) patterns of the estimated mode depth functions at the inferred source depth. Numerical simulations evaluate the performance of OCMS-D under different conditions. The effectiveness of OCMS-D is also verified by the Yellow Sea experiment and the SWellEx-96 experiment. In the Yellow Sea experiment, the depth estimation absolute errors by OCMS-D with a 4-second time window are less than 2.4 m. And the depth estimation absolute errors in the SWellEx-96 experiment with a 10-second time window are less than 5.4 m for the shallow source and less than 10.8 m for the deep source.
The deployment of pixel-based antennas and fluid antenna systems (FAS) is hindered by prohibitive channel state information (CSI) acquisition overhead. While radio maps enable proactive mode selection, reconstructing high-fidelity maps from sparse measurements is challenging. Existing physics-agnostic or data-driven methods often fail to recover fine-grained shadowing details under extreme sparsity. We propose a Physics-Regularized Low-Rank Tensor Completion (PR-LRTC) framework for radio map reconstruction. By modeling the signal field as a three-way tensor, we integrate environmental low-rankness with deterministic antenna physics. Specifically, we leverage Effective Aerial Degrees-of-Freedom (EADoF) theory to derive a differential gain topology map as a physical prior for regularization. The resulting optimization problem is solved via an efficient Alternating Direction Method of Multipliers (ADMM)-based algorithm. Simulations show that PR-LRTC achieves a 4 dB gain over baselines at a 10% sampling ratio. It effectively preserves sharp shadowing edges, providing a robust, physics-compliant solution for low-overhead beam management.
The integration of sensing and communication (ISAC) is an essential function of future wireless systems. Due to its large available bandwidth, millimeter-wave (mmWave) ISAC systems are able to achieve high sensing accuracy. In this paper, we consider the multiple base-station (BS) collaborative sensing problem in a multi-input multi-output (MIMO) orthogonal frequency division multiplexing (OFDM) mmWave communication system. Our aim is to sense a remote target shape with the collected signals which consist of both the reflection and scattering signals. We first characterize the mmWave's scattering and reflection effects based on the Lambertian scattering model. Then we apply the periodogram technique to obtain rough scattering point detection, and further incorporate the subspace method to achieve more precise scattering and reflection point detection. Based on these, a reconstruction algorithm based on Hough Transform and principal component analysis (PCA) is designed for a single convex polygon target scenario. To improve the accuracy and completeness of the reconstruction results, we propose a method to further fuse the scattering and reflection points. Extensive simulation results validate the effectiveness of the proposed algorithms.
We present an injustice-aware innovation-diffusion model extending the Generalized Linear Threshold framework by assigning agents activation thresholds drawn from a Beta distribution to capture the stochastic nature of adoption shaped by inequalities. Because incentive policies themselves can inadvertently amplify these inequalities, building on this model, we design a fair Model Predictive Control (MPC) scheme that incorporates equality and equity objectives for allocating incentives. Simulations using real mobility-habit data show that injustice reduces overall adoption, while equality smooths incentive distribution and equity reduces disparities in the final outcomes. Thus, incorporating fairness ensures effective diffusion without exacerbating existing social inequalities.
This paper proposes a decentralized controller for large-scale heterogeneous multi-agent systems subject to bounded external disturbances, where agents must satisfy Signal Temporal Logic (STL) specifications requiring cooperation among non-communicating agents. To address the lack of direct communication, we employ a decentralized k-hop Prescribed Performance State Observer (k-hop PPSO) to provide each agent with state estimates of those agents it cannot communicate with. By leveraging the performance bounds on the state estimation errors guaranteed by the k-hop PPSO, we first modify the space robustness of the STL tasks to account for these errors, and then exploit the modified robustness to design a decentralized continuous-time feedback controller that ensures satisfaction of the STL tasks even under worst-case estimation errors. A simulation result is provided to validate the proposed framework.
This paper proposes an Improved Noisy Deep Q-Network (Noisy DQN) to enhance the exploration and stability of Unmanned Aerial Vehicle (UAV) when applying deep reinforcement learning in simulated environments. This method enhances the exploration ability by combining the residual NoisyLinear layer with an adaptive noise scheduling mechanism, while improving training stability through smooth loss and soft target network updates. Experiments show that the proposed model achieves faster convergence and up to $+40$ higher rewards compared to standard DQN and quickly reach to the minimum number of steps required for the task 28 in the 15 * 15 grid navigation environment set up. The results show that our comprehensive improvements to the network structure of NoisyNet, exploration control, and training stability contribute to enhancing the efficiency and reliability of deep Q-learning.
This study introduces a novel approach for estimating plane-wave coefficients in sound field reconstruction, specifically addressing challenges posed by error-in-variable phase perturbations. Such systematic errors typically arise from sensor mis-calibration, including uncertainties in sensor positions and response characteristics, leading to measurement-induced phase shifts in plane wave coefficients. Traditional methods often result in biased estimates or non-convex solutions. To overcome these issues, we propose an optimal transport (OT) framework. This framework operates on a set of lifted non-negative measures that correspond to observation-dependent shifted coefficients relative to the unperturbed ones. By applying OT, the supports of the measures are transported toward an optimal average in the phase space, effectively morphing them into an indistinguishable state. This optimal average, known as barycenter, is linked to the estimated plane-wave coefficients using the same lifting rule. The framework addresses the ill-posed nature of the problem, due to the large number of plane waves, by adding a constant to the ground cost, ensuring the sparsity of the transport matrix. Convex consistency of the solution is maintained. Simulation results confirm that our proposed method provides more accurate coefficient estimations compared to baseline approaches in scenarios with both additive noise and phase perturbations.
This paper proposes a novel Bayesian reciprocity calibration method that consistently ensures uplink and downlink channel reciprocity in repeater-assisted multiple-input multiple-output (MIMO) systems. The proposed algorithm is formulated under the minimum mean-square error (MMSE) criterion. Its Bayesian framework incorporates complete statistical knowledge of the signal model, noise, and prior distributions, enabling a coherent design that achieves both low computational complexity and high calibration accuracy. To further enhance phase alignment accuracy, which is critical for calibration tasks, we develop a von Mises denoiser that exploits the fact that the target parameters lie on the circle in the complex plane. Simulation results demonstrate that the proposed MMSE algorithm achieves substantially improved estimation accuracy compared with conventional deterministic non-linear least-squares (NLS) methods, while maintaining comparable computational complexity. Furthermore, the proposed method exhibits remarkably fast convergence, making it well suited for practical implementation.
This work examines a disc-centric approach for automated severity grading of lumbar spinal stenosis from sagittal T2-weighted MRI. The method combines contrastive pretraining with disc-level fine-tuning, using a single anatomically localized region of interest per intervertebral disc. Contrastive learning is employed to help the model focus on meaningful disc features and reduce sensitivity to irrelevant differences in image appearance. The framework includes an auxiliary regression task for disc localization and applies weighted focal loss to address class imbalance. Experiments demonstrate a 78.1% balanced accuracy and a reduced severe-to-normal misclassification rate of 2.13% compared with supervised training from scratch. Detecting discs with moderate severity can still be challenging, but focusing on disc-level features provides a practical way to assess the lumbar spinal stenosis.
We evaluate two non-autoregressive architectures, StyleTTS2 and F5-TTS, to address the spontaneous nature of in-the-wild speech. Our models utilize flexible duration modeling to improve prosodic naturalness. To handle acoustic noise, we implement a multi-stage enhancement pipeline using the Sidon model, which significantly outperforms standard Demucs in signal quality. Experimental results show that finetuning enhanced audios yields superior robustness, achieving up to 4.21 UTMOS and 3.47 DNSMOS. Furthermore, we analyze the impact of reference prompt quality and length on zero-shot synthesis performance, demonstrating the effectiveness of our approach for realistic speech generation.
We consider a novel algorithm, for the completion of partially observed low-rank tensors, where each entry of the tensor can be chosen from a discrete finite alphabet set, such as in common image processing problems, where the entries represent the RGB values. The proposed low-rank tensor completion (TC) method builds on the conventional nuclear norm (NN) minimization-based low-rank TC paradigm, through the addition of a discrete-aware regularizer, which enforces discreteness in the objective of the problem, by an $\ell_0$-norm regularizer that is approximated by a continuous and differentiable function normalized via fractional programming (FP) under a proximal gradient (PG) framework, in order to solve the proposed problem. Simulation results demonstrate the superior performance of the new method both in terms of normalized mean square error (NMSE) and convergence, compared to the conventional state of-the-art (SotA) techniques, including NN minimization approaches, as well as a mixture of the latter with a matrix factorization approach.
In multi-agent systems, dynamic average consensus (DAC) is a decentralized estimation strategy in which a set of agents tracks the average of time-varying reference signals. Because DAC requires exchanging state information with neighbors, attackers may gain access to these states and infer private information. In this paper, we develop a privacy-preserving method that protects each agent's reference signal from external eavesdroppers and honest-but-curious agents while achieving the same convergence accuracy and convergence rate as conventional DAC. Our approach masks the reference signals by having each agent draw a random real number for each neighbor, exchanges that number over an encrypted channel at the initialization, and computes a masking value to form a masked reference. Then the agents run the conventional DAC algorithm using the masked references. Convergence and privacy analyses show that the proposed algorithm matches the convergence properties of conventional DAC while preserving the privacy of the reference signals. Numerical simulations validate the effectiveness of the proposed privacy-preserving DAC algorithm.
LiDAR point clouds captured in rain or snow are often corrupted by weather-induced returns, which can degrade perception and safety-critical scene understanding. This paper proposes Intensity- and Distance-Aware Statistical Outlier Removal (IDSOR), a range-adaptive filtering method that jointly exploits intensity cues and neighborhood sparsity. By incorporating an empirical, range-dependent distribution of weather returns into the threshold design, IDSOR suppresses weather-induced points while preserving fine structural details without cumbersome manual parameter tuning. We also propose a variant that uses a previously proposed method to estimate the weather return distribution from data, and integrates it into IDSOR. Experiments on simulation-augmented level-crossing measurements and on the Winter Adverse Driving dataset (WADS) demonstrate that IDSOR achieves a favorable precision-recall trade-off, maintaining both precision and recall above 90% on WADS.
Safety filtering is an effective method for enforcing constraints in safety-critical systems, but existing methods typically assume perfect state information. This limitation is especially problematic for systems that rely on neural network (NN)-based state estimators, which can be highly sensitive to noise and adversarial input perturbations. We address these problems by introducing GUARDIAN: Guaranteed Uncertainty-Aware Reachability Defense against Adversarial INterference, a safety filtering framework that provides formal safety guarantees for systems with NN-based state estimators. At runtime, GUARDIAN uses neural network verification tools to provide guaranteed bounds on the system's state estimate given possible perturbations to its observation. It then uses a modified Hamilton-Jacobi reachability formulation to construct a safety filter that adjusts the nominal control input based on the verified state bounds and safety constraints. The result is an uncertainty-aware filter that ensures safety despite the system's reliance on an NN estimator with noisy, possibly adversarial, input observations. Theoretical analysis and numerical experiments demonstrate that GUARDIAN effectively defends systems against adversarial attacks that would otherwise lead to a violation of safety constraints.
Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.
Hyperbolic representation learning has been widely used to extract implicit hierarchies within data, and recently it has found its way to the open-world classification task of Generalized Category Discovery (GCD). However, prior hyperbolic GCD methods only use hyperbolic geometry for representation learning and transform back to Euclidean geometry when clustering. We hypothesize this is suboptimal. Therefore, we present Hyperbolic Clustered GCD (HC-GCD), which learns embeddings in the Lorentz Hyperboloid model of hyperbolic geometry, and clusters these embeddings directly in hyperbolic space using a hyperbolic K-Means algorithm. We test our model on the Semantic Shift Benchmark datasets, and demonstrate that HC-GCD is on par with the previous state-of-the-art hyperbolic GCD method. Furthermore, we show that using hyperbolic K-Means leads to better accuracy than Euclidean K-Means. We carry out ablation studies showing that clipping the norm of the Euclidean embeddings leads to decreased accuracy in clustering unseen classes, and increased accuracy for seen classes, while the overall accuracy is dataset dependent. We also show that using hyperbolic K-Means leads to more consistent clusters when varying the label granularity.
This paper studies the safe control of very large multi-agent systems via a generalized framework that employs so-called Banach Control Barrier Functions (B-CBFs). Modeling a large swarm as probability distribution over a spatial domain, we show how B-CBFs can be used to appropriately capture a variety of macroscopic constraints that can integrate with large-scale swarm objectives. Leveraging this framework, we define stable and filtered gradient flows for large swarms, paying special attention to optimal transport algorithms. Further, we show how to derive agent-level, microscopical algorithms that are consistent with macroscopic counterparts in the large-scale limit. We then identify conditions for which a group of agents can compute a distributed solution that only requires local information from other agents within a communication range. Finally, we showcase the theoretical results over swarm systems in the simulations section.
Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
We present the PLATO Hand, a dexterous robotic hand with a hybrid fingertip that embeds a rigid fingernail within a compliant pulp. This design shapes contact behavior to enable diverse interaction modes across a range of object geometries. We develop a strain-energy-based bending-indentation model to guide the fingertip design and to explain how guided contact preserves local indentation while suppressing global bending. Experimental results show that the proposed robotic hand design demonstrates improved pinching stability, enhanced force observability, and successful execution of edge-sensitive manipulation tasks, including paper singulation, card picking, and orange peeling. Together, these results show that coupling structured contact geometry with a force-motion transparent mechanism provides a principled, physically embodied approach to precise manipulation.
The intersection of Safety of Intended Functionality (SOTIF) and Functional Safety (FuSa) analysis of driving automation features has traditionally excluded Quality Management (QM) components from rigorous safety impact evaluations. While QM components are not typically classified as safety-relevant, recent developments in artificial intelligence (AI) integration reveal that such components can contribute to SOTIF-related hazardous risks. Compliance with emerging AI safety standards, such as ISO/PAS 8800, necessitates re-evaluating safety considerations for these components. This paper examines the necessity of conducting holistic safety analysis and risk assessment on AI components, emphasizing their potential to introduce hazards with the capacity to violate risk acceptance criteria when deployed in safety-critical driving systems, particularly in perception algorithms. Using case studies, we demonstrate how deficiencies in AI-driven perception systems can emerge even in QM-classified components, leading to unintended functional behaviors with critical safety implications. By bridging theoretical analysis with practical examples, this paper argues for the adoption of comprehensive FuSa, SOTIF, and AI standards-driven methodologies to identify and mitigate risks in AI components. The findings demonstrate the importance of revising existing safety frameworks to address the evolving challenges posed by AI, ensuring comprehensive safety assurance across all component classifications spanning multiple safety standards.
Stochastic variance-reduced algorithms such as Stochastic Average Gradient (SAG) and SAGA, and their deterministic counterparts like the Incremental Aggregated Gradient (IAG) method, have been extensively studied in large-scale machine learning. Despite their popularity, existing analyses for these algorithms are disparate, relying on different proof techniques tailored to each method. Furthermore, the original proof of SAG is known to be notoriously involved, requiring computer-aided analysis. Focusing on finite-sum optimization with smooth and strongly convex objective functions, our main contribution is to develop a single unified convergence analysis that applies to all three algorithms: SAG, SAGA, and IAG. Our analysis features two key steps: (i) establishing a bound on delays due to stochastic sub-sampling using simple concentration tools, and (ii) carefully designing a novel Lyapunov function that accounts for such delays. The resulting proof is short and modular, providing the first high-probability bounds for SAG and SAGA that can be seamlessly extended to non-convex objectives and Markov sampling. As an immediate byproduct of our new analysis technique, we obtain the best known rates for the IAG algorithm, significantly improving upon prior bounds.
Neural Lyapunov and barrier certificates have recently been used as powerful tools for verifying the safety and stability properties of deep reinforcement learning (RL) controllers. However, existing methods offer guarantees only under fixed ideal unperturbed dynamics, limiting their reliability in real-world applications where dynamics may deviate due to uncertainties. In this work, we study the problem of synthesizing \emph{robust neural Lyapunov barrier certificates} that maintain their guarantees under perturbations in system dynamics. We formally define a robust Lyapunov barrier function and specify sufficient conditions based on Lipschitz continuity that ensure robustness against bounded perturbations. We propose practical training objectives that enforce these conditions via adversarial training, Lipschitz neighborhood bound, and global Lipschitz regularization. We validate our approach in two practically relevant environments, Inverted Pendulum and 2D Docking. The former is a widely studied benchmark, while the latter is a safety-critical task in autonomous systems. We show that our methods significantly improve both certified robustness bounds (up to $4.6$ times) and empirical success rates under strong perturbations (up to $2.4$ times) compared to the baseline. Our results demonstrate effectiveness of training robust neural certificates for safe RL under perturbations in dynamics.
Dynamic games are powerful tools to model multi-agent decision-making, yet computing Nash (generalized Nash) equilibria remains a central challenge in such settings. Complexity arises from tightly coupled optimality conditions, nested optimization structures, and poor numerical conditioning. Existing game-theoretic solvers address these challenges by directly solving the joint game, typically requiring explicit modeling of all agents' objective functions and constraints, while learning-based approaches often decouple interaction through prediction or policy approximation, sacrificing equilibrium consistency. This paper introduces a conceptually novel formulation for dynamic games by restructuring the equilibrium computation. Rather than solving a fully coupled game or decoupling agents through prediction or policy approximation, a data-driven structural reduction of the game is proposed that removes nested optimization layers and derivative coupling by embedding an offline-compiled best-response map as a feasibility constraint. Under standard regularity conditions, when the best-response operator is exact, any converged solution of the reduced problem corresponds to a local open-loop Nash (GNE) equilibrium of the original game; with a learned surrogate, the solution is approximately equilibrium-consistent up to the best-response approximation error. The proposed formulation is supported by mathematical proofs, accompanying a large-scale Monte Carlo study in a two-player open-loop dynamic game motivated by the autonomous racing problem. Comparisons are made against state-of-the-art joint game solvers, and results are reported on solution quality, computational cost, and constraint satisfaction.
The control of large buildings encounters challenges in computational efficiency due to their size and nonlinear components. To address these issues, this paper proposes a Piecewise Affine (PWA)-based distributed scheme for Model Predictive Control (MPC) that optimizes energy and comfort through PWA-based quadratic programming. We utilize the Alternating Direction Method of Multipliers (ADMM) for effective decomposition and apply the PWA technique to handle the nonlinear components. To solve the resulting large-scale nonconvex problems, the paper introduces a convex ADMM algorithm that transforms the nonconvex problem into a series of smaller convex problems, significantly enhancing computational efficiency. Furthermore, we demonstrate that the convex ADMM algorithm converges to a local optimum of the original problem. A case study involving 36 zones validates the effectiveness of the proposed method. Our proposed method reduces execution time by 86\% compared to the centralized version.
This paper addresses robotic system engineering for safety- and mission-critical applications by bridging the gap between high-level objectives and formal, executable specifications. The proposed method, Robotic System Task to Model Transformation Methodology (RSTM2) is an ontology-driven, hierarchical approach using stochastic timed Petri nets with resources, enabling Monte Carlo simulations at mission, system, and subsystem levels. A hypothetical case study demonstrates how the RSTM2 method supports architectural trades, resource allocation, and performance analysis under uncertainty. Ontological concepts further enable explainable AI-based assistants, facilitating fully autonomous specification synthesis. The methodology offers particular benefits to complex multi-robot systems, such as the NASA CADRE mission, representing decentralized, resource-aware, and adaptive autonomous systems of the future.
SLO-as-code has made per-service} reliability declarative, but user experience is defined by journeys whose reliability is an emergent property of microservice topology, routing, redundancy, timeouts/fallbacks, shared failure domains, and tail amplification. As a result, journey objectives (e.g., "checkout p99 < 400 ms") are often maintained outside code and drift as the system evolves, forcing teams to either miss user expectations or over-provision and gate releases with ad-hoc heuristics. We propose Emergence-as-Code (EmaC), a vision for making journey reliability computable and governable via intent plus evidence. An EmaC spec declares journey intent (objective, control-flow operators, allowed actions) and binds it to atomic SLOs and telemetry. A runtime inference component consumes operational artifacts (e.g., tracing and traffic configuration) to synthesize a candidate journey model with provenance and confidence. From the last accepted model, the EmaC compiler/controller derives bounded journey SLOs and budgets under explicit correlation assumptions (optimistic independence vs. pessimistic shared fate), and emits control-plane artifacts (burn-rate alerts, rollout gates, action guards) that are reviewable in a Git workflow. An anonymized artifact repository provides a runnable example specification and generated outputs.
GNSSs are vulnerable to attacks of two kinds: jamming (i.e. denying access to the signal) and spoofing (i.e. impersonating a legitimate satellite). These attacks have been extensively studied, and we have a myriad of countermeasures to mitigate them. In this paper we expose a new type of attack: SpAmming, which combines both approaches to achieve the same effects in a more subtle way. Exploiting the CDMA multiplexing present in most GNSSs, and through a spoofing attack, this approach leads the receiver to lose access to the signal of a legitimate satellite, which would be equivalent to a denial of service; but in this case the existing countermeasures against jamming or spoofing would not allow safeguarding its effectiveness, as it is neither of them. An experimental proof-of-concept is presented in which its impact is evaluated as a function of the previous state of the receiver. Using an SDR-based system developed at the Space Security Centre, the attack is executed against a cold-started receiver, a warm-started receiver, and a receiver that has already acquired the PVT solution and is navigating. Different attack configurations are also tested, starting from a raw emission of the false signal, to surgical Doppler effect configuration, code offset, etc. Although it is shown to be particularly successful against cold-started receivers, the results show that it is also effective in other scenarios, especially if accompanied by other attacks. We will conclude the article by outlining possible countermeasures to detect and, eventually, counteract it; and possible avenues of research to better understand its impact, especially for authenticated services such as OSNMA, and to characterize it in order to improve the response to similar attacks.
Path marginal cost (PMC) is a crucial component in solving path-based system-optimal dynamic traffic assignment (SO-DTA), dynamic origin-destination demand estimation (DODE), and network resilience analysis. However, accurately evaluating PMC in heterogeneous traffic conditions poses significant challenges. Previous studies often focus on homogeneous traffic flow of single vehicle class and do not well address the interactive effect of heterogeneous traffic flows and the resultant computational issues. This study proposes a novel but simple method for approximately evaluating PMC in complex heterogeneous traffic condition. The method decomposes PMC into intra-class and inter-class terms and uses conversion factor derived from heterogeneous link dynamics to explicitly model the intricate relationships between vehicle classes. Additionally, the method considers the non-differentiable issue that arises when mixed traffic flow approaches system optimum conditions. The proposed method is tested on a small corridor network with synthetic demand and a large-scale network with calibrated demand from real-world data. Results demonstrated that our method exhibits superior performance in solving bi-class SO-DTA problems, yielding lower total travel cost and capturing the multi-class flow competition at the system optimum state.
In this paper, we study efficient beam coverage design for multi-antenna systems in both far-field and near-field cases. To reduce the computational complexity of existing sampling-based optimization methods, we propose a new low-complexity yet efficient beam coverage design. To this end, we first formulate a general beam coverage optimization problem to maximize the worst-case beamforming gain over a target region. For the far-field case, we show that the beam coverage design can be viewed as a spatial-frequency filtering problem, where angular coverage can be achieved by weight-shaping in the antenna domain via an inverse FT, yielding an infinite-length weighting sequence. Under the constraint of a finite number of antennas, a surrogate scheme is proposed by directly truncating this sequence, which inevitably introduces a roll-off effect at the angular boundaries, yielding degraded worst-case beamforming gain. To address this issue, we characterize the finite-antenna-induced roll-off effect, based on which a roll-off-aware design with a protective zoom is developed to ensure a flat beamforming-gain profile within the target angular region. Next, we extend the proposed method to the near-field case. Specifically, by applying a first-order Taylor approximation to the near-field channel steering vector (CSV), the two-dimensional (2D) beam coverage design (in both angle and inverse-range) can be transformed into a 2D inverse FT, leading to a low-complexity beamforming design. Furthermore, an inherent near-field range defocusing effect is observed, indicating that sufficiently wide angular coverage results in range-insensitive beam steering. Finally, numerical results demonstrate that the proposed FT-based approach achieves a comparable worst-case beamforming performance with that of conventional sampling-based optimization methods while significantly reducing the computational complexity.
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
Robotic navigation has historically struggled to reconcile reactive, sensor-based control with the decisive capabilities of model-based planners. This duality becomes critical when the absence of a predominant option among goals leads to indecision, challenging reactive systems to break symmetries without computationally-intense planners. We propose a parsimonious neuromorphic control framework that bridges this gap for vision-guided navigation and tracking. Image pixels from an onboard camera are encoded as inputs to dynamic neuronal populations that directly transform visual target excitation into egocentric motion commands. A dynamic bifurcation mechanism resolves indecision by delaying commitment until a critical point induced by the environmental geometry. Inspired by recently proposed mechanistic models of animal cognition and opinion dynamics, the neuromorphic controller provides real-time autonomy with a minimal computational burden, a small number of interpretable parameters, and can be seamlessly integrated with application-specific image processing pipelines. We validate our approach in simulation environments as well as on an experimental quadrotor platform.
Controlling the false discovery rate (FDR) in high-dimensional variable selection requires balancing rigorous error control with statistical power. Existing methods with provable guarantees are often overly conservative, creating a persistent gap between the realized false discovery proportion (FDP) and the target FDR level. We introduce a learning-augmented enhancement of the T-Rex Selector framework that narrows this gap. Our approach replaces the analytical FDP estimator with a neural network trained solely on diverse synthetic datasets, enabling a substantially tighter and more accurate approximation of the FDP. This refinement allows the procedure to operate much closer to the desired FDR level, thereby increasing discovery power while maintaining effective approximate control. Through extensive simulations and a challenging synthetic genome-wide association study (GWAS), we demonstrate that our method achieves superior detection of true variables compared to existing approaches.
This is the third article in a series of three dealing with the exploitation of speckle for imaging purposes. In complex media, a fundamental limit is the multiple scattering phenomenon that completely blurs the imaging process in depth. Matrix imaging can provide a relevant framework for solving this problem. As it proved to be an adequate tool for probing reverberations in speckle [E. Giraudat et al., Part I], we will show how it can be used to tailor complex spatio-temporal focusing laws to monitor the interference between the multiply-reflected paths and the ballistic component of the wave-field. To do so, we extend the distortion matrix concept to the frequency domain. An iterative phase reversal process operated from the space-time Fourier space is then used to compensate for reverberations and optimize both the axial and transverse resolution of the confocal image. Here, we first present an experimental proof-of-concept consisting in imaging a tissue-mimicking phantom through a reverberating plate before outlining the potential and the limits of this strategy for transcranial ultrasound and beyond.
Hydraulic systems are widely utilized in industrial applications due to their high force generation, precise control, and ability to function in harsh environments. Hydraulic cylinders, as actuators in these systems, apply force and position through the displacement of hydraulic fluid, but their operation is significantly influenced by friction force. Achieving precision in hydraulic cylinders requires an accurate friction model under various operating conditions. Existing analytical models, often derived from experimental tests, necessitate the identification or estimation of influencing factors but are limited in adaptability and computational efficiency. This research introduces a data-driven, hybrid algorithm based on Long Short-Term Memory (LSTM) networks and Random Forests for nonlinear friction force estimation. The algorithm effectively combines feature detection and estimation processes using training data acquired from an experimental hydraulic test setup. It achieves a consistent and stable model error of less than 10% across diverse operating conditions and external load variations, ensuring robust performance in complex situations. The computational cost of the algorithm is 1.51 milliseconds per estimation, making it suitable for real-time applications. The proposed method addresses the limitations of analytical models by delivering high precision and computational efficiency. The algorithm's performance is validated through detailed analysis and experimental results, including direct comparisons with the LuGre model. The comparison highlights that while the LuGre model offers a theoretical foundation for friction modeling, its performance is limited by its inability to dynamically adjust to varying operational conditions of the hydraulic cylinder, further emphasizing the advantages of the proposed hybrid approach in real-time applications.
We study the Rectified Linear Unit (ReLU) dual, an existing dual formulation for stochastic programs that reformulates non-anticipativity constraints using ReLU functions to generate tight, non-convex, and mixed-integer representable cuts. While this dual reformulation guarantees convergence with mixed-integer state variables, it admits multiple optimal solutions that can yield weak cuts. To address this issue, we propose normalizing the dual in the extended space to identify solutions that yield stronger cuts. We prove that the resulting normalized cuts are tight and Pareto-optimal in the original state space. We further compare normalization with existing regularization-based approaches for handling dual degeneracy and explain why normalization offers key advantages. In particular, we show that normalization can recover any cut obtained via regularization, whereas the converse does not hold. Computational experiments demonstrate that the proposed approach outperforms existing methods by consistently yielding stronger cuts and reducing solution times on harder instances.
The co-existence of terrestrial and non-terrestrial networks (NTNs) is essential for achieving global coverage in sixth-generation cellular networks. Due to increasing spectrum demand, there is discussion in the world level to share some frequencies used in terrestrial Networks (TNs) with NTNs, resulting in co-channel interference and performance degradation. This paper analyzes the interference caused by satellite networks on TN in the S-band. We examined the transmission mechanisms of satellite signals and conducted simulations to evaluate interference intensity across varying slant ranges. Our findings indicate that the angle between the user equipment direction and the sub-satellite point direction from the beam center significantly impacts the interference level.
This paper investigates stochastic generalized dynamic games with coupling chance constraints, where agents have incomplete information about uncertainties satisfying a concentration of measure property. This problem, in general, is non-convex and NP-hard. To address this, we propose a convex under-approximation by replacing chance constraints with tightened expected-value constraints, yielding a tractable game. We prove the existence of a stochastic generalized Nash equilibrium (SGNE) in this new game and show that its variational SGNE is an $\boldsymbol{\varepsilon}$-SGNE for the original game, with $\boldsymbol{\varepsilon}$ expressed via the approximation errors and Lagrange multipliers. A semi-decentralized, sampling-based algorithm with time-varying step sizes is developed, requiring no prior knowledge of the uncertainty distribution or expectation evaluations. Unlike existing methods, it avoids step-size tuning based on Lipschitz constants or adaptive rules. Under standard assumptions on the pseudo-gradient, the algorithm converges almost surely to an SGNE.
Self-Supervised Learning (SSL) enables us to pre-train foundation models without costly labeled data. Among SSL methods, Contrastive Learning (CL) methods are better at obtaining accurate semantic representations in noise interference. However, due to the significant domain gap, while CL methods have achieved great success in many computer vision tasks, they still require specific adaptation for Remote Sensing (RS) images. To this end, we present a novel self-supervised method called PerA, which produces all-purpose RS features through semantically Perfectly Aligned sample pairs. Specifically, PerA obtains features from sampled views by applying spatially disjoint masks to augmented images rather than random cropping. Our framework provides high-quality features by ensuring consistency between teacher and student and predicting learnable mask tokens. Compared to previous contrastive methods, our method demonstrates higher memory efficiency and can be trained with larger batches due to its sparse inputs. Additionally, the proposed method demonstrates remarkable adaptability to uncurated RS data and reduce the impact of the potential semantic inconsistency. We also collect an unlabeled pre-training dataset, which contains about 5 million RS images. We conducted experiments on multiple downstream task datasets and achieved performance comparable to previous state-of-the-art methods with a limited model scale, demonstrating the effectiveness of our approach. We hope this work will contribute to practical remote sensing interpretation works.
Multi-head self-attention (MHSA) is a key building block in modern vision Transformers, yet its quadratic complexity in the number of tokens remains a major bottleneck for real-time and resource-constrained deployment. We present PnP-Nystra, a training-free Nyström-based linear attention module designed as a plug-and-play replacement for MHSA in {pretrained} image restoration Transformers, with provable kernel approximation error guarantees. PnP-Nystra integrates directly into window-based architectures such as SwinIR, Uformer, and Dehazeformer, yielding efficient inference without finetuning. Across denoising, deblurring, dehazing, and super-resolution on images, PnP-Nystra delivers $1.8$--$3.6\times$ speedups on an NVIDIA RTX 4090 GPU and $1.8$--$7\times$ speedups on CPU inference. Compared with the strongest training-free linear-attention baselines we evaluate, our method incurs the smallest quality drop and stays closest to the original model's outputs.
This paper investigates maximum likelihood estimation for direct system identification in networks of dynamical systems. We establish that the proposed approach is both consistent and efficient. In addition, it is more generally applicable than existing methods, since it can be employed even when measurements are unavailable for all network nodes, provided that network identifiability is satisfied. Finally, we demonstrate that the maximum likelihood problem can be formulated without relying on a predictor, which is key to achieving computationally efficient numerical solutions.
Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-SF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.
Understanding the performance of multi-user multiple-input multiple-output (MU-MIMO) systems under imperfect channel state information at the transmitter (CSIT) remains a critical challenge in next-generation wireless networks. In this context, accurate statistical modeling of the signal-to-interference-plus-noise ratio (SINR) is essential for enabling tractable performance analysis of multi-user systems. This paper presents an improved statistical approximation of the SINR for downlink (DL) MU-MIMO systems with imperfect CSIT. The proposed model retains the analytical simplicity of existing approaches (e.g., Gamma-based approximations) while overcoming their limitations, particularly the underestimation of SINR variance. We evaluate the proposed approximation in the context of Rate-Splitting Multiple Access (RSMA)-enabled MIMO DL systems with outdated CSIT. The results demonstrate excellent accuracy across a wide range of system configurations, including varying numbers of users, antennas, and degrees of CSIT staleness.
This paper introduces a dimension-decomposed geometric learning framework called Sliced Learning for disturbance identification in quadrotor geometric attitude control. Instead of conventional learning-from-states, this framework adopts a learning-from-error strategy by using the Lie-algebraic error representation as the input feature, enabling axis-wise space decomposition (``slicing") while preserving the SO(3) structure. This is highly consistent with the geometric mechanism of cognitive control observed in neuroscience, where neural systems organize adaptive representations within structured subspaces to enable cognitive flexibility and efficiency. Based on this framework, we develop a lightweight and structurally interpretable Sliced Adaptive-Neuro Mapping (SANM) module. The high-dimensional mapping for online identification is axially ``sliced" into multiple low-dimensional submappings (``slices"), implemented by shallow neural networks and adaptive laws. These neural networks and adaptive laws are updated online via Lyapunov-based adaptation within their respective shared subspaces. To enhance interpretability, we prove exponential convergence despite time-varying disturbances and inertia uncertainties. To our knowledge, Sliced Learning is among the first frameworks to demonstrate lightweight online neural adaptation at 400 Hz on resource-constrained microcontroller units (MCUs), such as STM32, with real-world experimental validation.
The increasing deployment of distributed Battery Energy Storage Systems (BESSs) in modern power grids necessitates effective coordination strategies to ensure state-of-charge (SoC) balancing and accurate power delivery. While distributed control frameworks offer scalability and resilience, they also raise significant privacy concerns due to the need for inter-agent information exchange. This paper presents a novel privacy-preserving distributed control algorithm for SoC balancing in a networked BESS. The proposed framework includes distributed power allocation law that is designed based on two privacy-preserving distributed estimators, one for the average unit state and the other for the average desired power. The average unit state estimator is designed via the state decomposition method without disclosing sensitive internal states. The proposed power allocation law based on these estimators ensures asymptotic SoC balancing and global power delivery while safeguarding agent privacy from external eavesdroppers. The effectiveness and privacy-preserving properties of the proposed control strategy are demonstrated through simulation results.
Plantar pressure measurement, or pedobarography, is an essential tool for analyzing human motion in healthy individuals and patients. Across the reviewed literature, sensor insoles are motivated as wearable, mobile solutions for assessing pressure distribution in applications including diabetic foot monitoring, rehabilitation guidance, assistive device control, and sports performance analysis. This review evaluates the current state of the art with particular attention to sensor technologies, sensor quantity and placement, participant cohorts, and reference standards. The focus lies on original works with innovative designs, preferably supported by ambulation experiments. The modalities covered include resistive, capacitive, inductive, piezoelectric, triboelectric, and optical sensing approaches. We identify a lack of proper sensor calibration, gait-based verification, and human study validation, and propose a gold standard based on testing machines and instrumented treadmills to ensure comparability across studies. The bidirectional interaction between insole insertion and foot-sole mechanics is examined, with tissue stiffness identified as a key source of uncertainty in sensor signals. Guidelines are provided for sensor dimensions and unobtrusive insole designs to foster natural gait. Finally, future directions include the development of multimodal sensors to compensate for the limitations of individual modalities and the emerging trend of multiaxial sensing for capturing shear components in pressure distributions.
This letter presents a locality-aware bearing fault diagnosis framework that operates on time-frequency representations and enables spatially interpretable decision-making. One-dimensional vibration signals are first mapped to two-dimensional time-frequency spectrograms using the continuous wavelet transform (CWT) with Morlet wavelets to enhance transient fault signatures. The diagnosis task is then formulated as object detection on the time-frequency plane, where YOLOv9, YOLOv10, and YOLOv11 are employed to localize fault-relevant regions and classify fault types simultaneously. Experiments on three public benchmarks, including Case Western Reserve University (CWRU), Paderborn University (PU), and Intelligent Maintenance System (IMS), demonstrate strong cross-dataset generalization compared with a representative MCNN-LSTM baseline. In particular, YOLOv11 achieves mAP@0.5 of 99.0% (CWRU), 97.8% (PU), and 99.5% (IMS), while providing region-aware visualization of fault patterns in the time-frequency domain. These results suggest that detection-based inference on CWT spectrograms provides an effective and interpretable complementary approach to conventional global classification for rotating machinery condition monitoring.
This paper demonstrates that, in both theory and practice, the latent optimally partitioned (LOP)-$\ell_2/\ell_1$ penalty is effective for exploiting block-sparsity without knowledge of the concrete block structure. More precisely, we first present a novel theoretical result showing that the optimized block partition in the LOP-$\ell_2/\ell_1$ penalty satisfies a condition required for accurate recovery of block-sparse signals. Motivated by this result, we present a new application of the LOP-$\ell_2/\ell_1$ penalty to estimation of angular power spectrum, which is block-sparse with unknown block partition, in MIMO communication systems. Numerical simulations show that the proposed use of block-sparsity with the LOP-$\ell_2/\ell_1$ penalty significantly improves the estimation accuracy of the angular power spectrum.
Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority annotations and limits interpretability. We propose an explainable Speech Language Model (SpeechLM) framework that frames SER as a generative reasoning task. Given an utterance, the model first produces a transcript, then outputs both an emotion label and a concise natural-language rationale grounded in lexical and acoustic cues. Rationales are generated by a reasoning-capable teacher LLM and used as intermediate supervision, combined with majority labels during fine-tuning. Unlike prior work primarily focused on boosting classification accuracy, we aim to enhance explainability while preserving competitive performance. To this end, we complement majority-label metrics with annotator-aware scoring that credits matches with any annotator label. On MSP-Podcast v1.12, our model maintains improvements over zero-shot SpeechLM baselines, and produces rationales that human evaluators find plausible and well grounded. This demonstrates that incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality.
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
Joint detection and localization of users and scatterers in multipath-rich channels on multiple bands is critical for integrated sensing and communication (ISAC) in 6G. Existing multiband sensing methods are limited by classical beamforming or computationally expensive approaches. This paper introduces alternating direction method of multipliers (ADMM)-assisted compressed multiband sensing (CMS), hereafter referred to as ADMM-CMS, which is a novel framework for multiband sensing using uplink quadrature amplitude modulation-modulated pilot symbols. To solve the CMS problem, we develop an adaptive ADMM algorithm that adjusts to noise and ensures automatic stopping if converged. ADMM combines the decomposability of dual ascent with the robustness of augmented Lagrangian methods, making it suitable for large-scale structured optimization. Simulations show that ADMM-CMS achieves higher spatial resolution and improved denoising compared to Bartlett-type beamforming, yielding a 34 dB gain in per-antenna transmit power for achieving a 0.9 successful recovery probability (SRP). Moreover, compared to performing compressed sensing separately on the constituent 7 GHz and 10 GHz sub-bands, ADMM-CMS achieves reductions in delay root mean squared error of 34.46% and 40.76%, respectively, at -41 dBm per-antenna transmit power, while also yielding improved SRP. Our findings demonstrate ADMM-CMS as an efficient enabler of ISAC in frequency range 3 (FR3, 7-24 GHz) for 6G systems.
Wolff-Parkinson-White (WPW) syndrome, a congenital cardiac conduction abnormality with low prevalence, carries a significant risk of sudden cardiac death. Early identification remains challenging due to screening costs and professional resource scarcity. This retrospective real-world study systematically evaluates an integrated Artificial Intelligence-enabled mobile screening system comprising portable single-lead devices, AI primary screening, and cardiologist review. Analyzing 3,566,626 ECG records from 87,836 individuals between 2019 and 2025, the AI model achieved an AUC of 0.6676 and a specificity of 95.92% in complex real-world signal environments. Despite predictive probability bias inherent in ultra-low prevalence contexts, the model demonstrated stable risk stratification, with high-confidence scores concentrated among true positive individuals. The risk of detecting WPW in AI-positive records was 86.2-fold higher than in AI-negative records. By implementing a human-AI collaborative workflow, the volume of ECGs requiring manual review was reduced by approximately 99.5% compared to universal screening. In an ideal collaborative scenario, an average of only 18 ECGs required review to confirm one WPW case, representing a more than 60-fold increase in screening efficiency. Compared to traditional 12-lead ECGs and electrophysiological studies, this system significantly reduced time and medical costs. Our findings suggest that a risk-stratification-based human-AI collaborative system provides a promising paradigm for the early public health detection of low-prevalence, high-risk arrhythmias.
Accelerating magnetic resonance imaging (MRI) remains challenging, particularly under realistic acquisition noise. While diffusion models have recently shown promise for reconstructing undersampled MRI data, many approaches lack an explicit link to the underlying MRI physics, and their parameters are sensitive to measurement noise, limiting their reliability in practice. We introduce Implicit-MAP (ImMAP), a diffusion-based reconstruction framework that integrates the acquisition noise model directly into a maximum a posteriori (MAP) formulation. Specifically, we build on the stochastic ascent method of Kadkhodaie et al. and generalize it to handle MRI encoding operators and realistic measurement noise. Across both simulated and real noisy datasets, ImMAP consistently outperforms state-of-the-art deep learning (LPDSNet) and diffusion-based (DDS) methods. By clarifying the practical behavior and limitations of diffusion models under realistic noise conditions, ImMAP establishes a more reliable and interpretable
This paper introduces a Ramanujan inner product and its corresponding norm, establishing a novel framework for the stability analysis of hybrid and discrete-time systems as an alternative to traditional Euclidean metrics. We establish new $\epsilon$-$\delta$ stability conditions that utilize the unique properties of Ramanujan summations and their relationship with number-theoretic concepts. The proposed approach provides enhanced robustness guarantees and reveals fundamental connections between system stability and arithmetic properties of the system dynamics. Theoretical results are rigorously proven, and simulation results on numerical examples are presented to validate the efficacy of the proposed approach.
In physical-layer security schemes, radio frequency fingerprint (RFF) identification of WiFi devices is susceptible to receiver differences, which can significantly degrade classification performance when a model is trained on one receiver but tested on another. In this paper, we propose a division-based receiver-agnostic RFF extraction method for WiFi systems, which removes the receivers' effects by dividing different preambles in the frequency domain. The proposed method requires only a single receiver for training and does not rely on additional calibration or stacking processes. First, for flat fading channel scenarios, the legacy short training field (L-STF) and legacy long training field (L-LTF) of the unknown device are divided by those of the reference device in the frequency domain. The receiver-dependent effects can be eliminated with the requirement of only a single receiver for training, and the higher-dimensional RFF features can be extracted. Second, for frequency-selective fading channel scenarios, the high-throughput long training field (HT-LTF) is divided by the L-LTF in the frequency domain. Only a single receiver is required for training and the higher-dimensional RFF features that are both channel-invariant and receiver-agnostic are extracted. Finally, simulation and experimental results demonstrate that the proposed method effectively mitigate the impacts of channel variations and receiver differences. The classification results show that, even when training on a single receiver and testing on a different one, the proposed method achieves classification accuracy improvements of 15.5% and 28.45% over the state-of-the-art approach in flat fading and frequency-selective fading channel scenarios, respectively.
Particle filtering algorithms have enabled practical solutions to problems in autonomous robotics (self-driving cars, UAVs, warehouse robots), target tracking, and econometrics, with further applications in speech processing and medicine (patient monitoring). Yet, their inherent weakness at representing the likelihood of the observation (which often leads to particle degeneracy) remains unaddressed for real-time resource-constrained systems. Improvements such as the optimal proposal and auxiliary particle filter mitigate this issue under specific circumstances and with increased computational cost. This work presents a new particle filtering method and its implementation, which enables tunably-approximative representation of arbitrary likelihood densities as program transformations of parametric this http URL method leverages a recent computing platform thatcan perform deterministic computation on probability distributionrepresentations (UxHw) without relying on stochastic this http URL non-Gaussian non-linear systems and with an optimal-auxiliary particle filter, we benchmark the likelihood evaluation error and speed for a total of 294840 evaluation points. For such models, the results show that the UxHw method leads to as much as 37.7x speedup compared to the Monte Carlo alternative. For narrow uniform measurement uncertainty, the particle filter falsely assigns zero likelihood as much as 81.89% of the time whereas UxHw achieves 1.52% false-zero rate. The UxHw approach achieves filter RMSE improvement of as much as 18.9% (average 3.3%) over the Monte Carlo alternative.
This paper considers the problem of obtaining bounded time-average expected queue sizes in a single-queue system with a partial-feedback structure. Time is slotted; in slot $t$ the transmitter chooses a rate $V(t)$ from a continuous interval. Transmission succeeds if and only if $V(t)\le C(t)$, where channel capacities $\{C(t)\}$ and arrivals are i.i.d. draws from fixed but unknown distributions. The transmitter observes only binary acknowledgments (ACK/NACK) indicating success or failure. Let $\varepsilon>0$ denote a sufficiently small lower bound on the slack between the arrival rate and the capacity region. We propose a phased algorithm that progressively refines a discretization of the uncountable infinite rate space and, without knowledge of $\varepsilon$, achieves a $\mathcal{O}\!\big(\log^{3.5}(1/\varepsilon)/\varepsilon^{3}\big)$ time-average expected queue size uniformly over the horizon. We also prove a converse result showing that for any rate-selection algorithm, regardless of whether $\varepsilon$ is known, there exists an environment in which the worst-case time-average expected queue size is $\Omega(1/\varepsilon^{2})$. Thus, while a gap remains in the setting without knowledge of $\varepsilon$, we show that if $\varepsilon$ is known, a simple single-stage UCB type policy with a fixed discretization of the rate space achieves $\mathcal{O}\!\big(\log(1/\varepsilon)/\varepsilon^{2}\big)$, matching the converse up to logarithmic factors.
Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
We address the problem of time-frequency audio inpainting, where the goal is to fill missing spectrogram portions with reliable information. Despite recent advances, existing approaches still face limitations in both reconstruction quality and computational efficiency. To bridge this gap, we propose a method that utilizes a phase-aware signal prior which exploits estimates of the instantaneous frequency. An optimization problem is formulated and solved using the generalized Chambolle-Pock algorithm. The proposed method is evaluated against other time-frequency inpainting methods, specifically a deep-prior audio inpainting neural network and the autoregression-based approach known as Janssen-TF. Our proposed approach surpassed these methods by a large margin in the objective evaluation as well as in the conducted subjective listening test, improving the state of the art. In addition, the reconstructions are obtained with a substantially reduced computational cost compared to alternative methods.
Color quantization represents an image using a fraction of its original number of colors while only minimally losing its visual quality. The $k$-means algorithm is commonly used in this context, but has mostly been applied in the machine-based RGB colorspace composed of the three primary colors. However, some recent studies have indicated its improved performance in human perception-based colorspaces. We investigated the performance of $k$-means color quantization at four quantization levels in the RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces, on 148 varied digital images spanning a wide range of scenes, subjects and settings. The Visual Information Fidelity (VIF) measure numerically assessed the quality of the quantized images, and showed that in about half of the cases, $k$-means color quantization is best in the RGB space, while at other times, and especially for higher quantization levels ($k$), the CIE-XYZ colorspace is where it usually does better. There are also some cases, especially at lower $k$, where the best performance is obtained in the CIE-LUV colorspace. Further analysis of the performances in terms of the distributions of the hue, chromaticity and luminance in an image presents a nuanced perspective and characterization of the images for which each colorspace is better for $k$-means color quantization.
This paper aims to provide a clear and rigorous understanding of commonly recognized safety constraints in physical human-robot interaction, particularly regarding ISO/TS 15066. We investigate the derivation of these constraints, critically examine the underlying assumptions, and evaluate their practical implications for system-level safety and performance in industrially relevant scenarios. Key design parameters within safety-critical control architectures are identified, and numerical examples are provided to quantify performance degradation arising from typical approximations and design decisions in manufacturing environments. Within this analysis, the fundamental role of energy in safety assessment is emphasized, providing focused insights into energy-based safety methodologies for collaborative industrial robot systems.
Model compression has become an important tool for making image super resolution models more efficient. However, the gap between the best compressed models and the full precision model still remains large and a need for deeper understanding of compression theory on more performant models remains. Prior research on quantization of LLMs has shown that Hadamard transformations lead to weights and activations with reduced outliers, which leads to improved performance. We argue that while the Hadamard transform does reduce the effect of outliers, an empirical analysis on how the transform functions remains needed. By studying the distributions of weights and activations of SwinIR-light, we show with statistical analysis that lower errors is caused by the Hadamard transforms ability to reduce the ranges, and increase the proportion of values around $0$. Based on these findings, we introduce CompSRT, a more performant way to compress the image super resolution transformer network SwinIR-light. We perform Hadamard-based quantization, and we also perform scalar decomposition to introduce two additional trainable parameters. Our quantization performance statistically significantly surpasses the SOTA in metrics with gains as large as 1.53 dB, and visibly improves visual quality by reducing blurriness at all bitwidths. At $3$-$4$ bits, to show our method is compatible with pruning for increased compression, we also prune $40\%$ of weights and show that we can achieve $6.67$-$15\%$ reduction in bits per parameter with comparable performance to SOTA.
Urban mobility systems are transitioning toward electric, on-demand services, creating operational challenges for fleet management under energy and service-quality constraints. The Electric Dial-a-Ride Problem (E-DARP) extends the classical dial-a-ride problem by incorporating limited battery capacity and nonlinear charging dynamics, increasing computational complexity and limiting the scalability of exact methods for real-time use. This paper proposes a deep reinforcement learning approach based on an edge-centric graph neural network encoder and an attention-driven route construction policy. By operating directly on edge attributes such as travel time and energy consumption, the method captures non-Euclidean, asymmetric, and energy-dependent routing costs in real road networks. The learned policy jointly optimizes routing, charging, and service quality without relying on Euclidean assumptions or handcrafted heuristics. The approach is evaluated on two case studies using ride-sharing data from San Francisco. On benchmark instances, the method achieves solutions within 0.4% of best-known results while reducing computation times by orders of magnitude. A second case study considers large-scale instances with up to 250 request pairs, realistic energy models, and nonlinear charging. On these instances, the learned policy outperforms Adaptive Large Neighborhood Search (ALNS) by 9.5% in solution quality while achieving 100% service completion, with inference times under 10 seconds compared to hours for the metaheuristic. Finally, sensitivity analyses quantify the impact of battery capacity, fleet size, ride-sharing capacity, and reward weights, while robustness experiments show that deterministically trained policies generalize effectively under stochastic conditions.
Foundation models for echocardiography often struggle to disentangle anatomical signal from the stochastic speckle and acquisition artifacts inherent to ultrasound. We present EchoJEPA, a foundation model trained on 18 million echocardiograms across 300K patients, representing the largest pretraining corpus for this modality to date. By leveraging a latent predictive objective, EchoJEPA learns robust anatomical representations that ignore speckle noise. We validate this using a novel multi-view probing framework with frozen backbones, where EchoJEPA outperforms state-of-the-art baselines by approximately 20% in left ventricular ejection fraction (LVEF) estimation and 17% in right ventricular systolic pressure (RVSP) estimation. The model also exhibits remarkable sample efficiency, reaching 79% view classification accuracy with only 1% of labeled data versus 42% for the best baseline trained on 100%. Crucially, EchoJEPA demonstrates superior generalization, degrading by only 2% under physics-informed acoustic perturbations compared to 17% for competitors. Most remarkably, its zero-shot performance on pediatric patients surpasses fully fine-tuned baselines, establishing latent prediction as a superior paradigm for robust, generalizable medical AI.
Growing renewable penetration introduces substantial uncertainty into power system operations, necessitating frequent adaptation of dispatch objectives and constraints and challenging expertise-intensive, near-real-time modeling workflows. Large Language Models (LLMs) provide a promising avenue for automating this process by translating natural-language (NL) operational requirements into executable optimization models via semantic reasoning and code synthesis. Yet existing LLM datasets and benchmarks for optimization modeling primarily target coarse-grained cross-domain generalization, offering limited, rigorous evaluation in power-system settings, particularly for Optimal Power Flow (OPF). We therefore introduce \textbf{ProOPF-D} and \textbf{ProOPF-B}, a dataset and benchmark for professional-grade OPF modeling: ProOPF-D contains 12K instances pairing NL requests with parameter adjustments and structural extensions to a canonical OPF, together with executable implementations; ProOPF-B provides 121 expert-annotated test cases with ground-truth code, enabling end-to-end evaluation under both concrete and abstract OPF modeling regimes.
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale MrHiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Central to the efficacy of prognostics and health management methods is the acquisition and analysis of degradation data, which encapsulates the evolving health condition of engineering systems over time. Degradation data serves as a rich source of information, offering invaluable insights into the underlying degradation processes, failure modes, and performance trends of engineering systems. This paper provides an overview of publicly available degradation data sets.
A system of non-tradable credits that flow between individuals like karma, hence proposed under that name, is a mechanism for repeated resource allocation that comes with attractive efficiency and fairness properties, in theory. In this study, we test karma in an online experiment in which human subjects repeatedly compete for a resource with time-varying and stochastic individual preferences or urgency to acquire the resource. We confirm that karma has significant and sustained welfare benefits even in a population with no prior training. We identify mechanism usage in contexts with sporadic high urgency, more so than with frequent moderate urgency, and implemented as a simple (binary) karma bidding scheme as particularly effective for welfare improvements: relatively larger aggregate efficiency gains are realized that are (almost) Pareto superior. These findings provide guidance for further testing and for future implementation plans of such mechanisms in the real world.
Image inpainting is a valuable technique for enhancing images that have been corrupted. The primary challenge in this research revolves around the extent of corruption in the input image that the deep learning model must restore. To address this challenge, we introduce a Generative Adversarial Network (GAN) for learning and replicating the missing pixels. Additionally, we have developed a distinct variant of the Super-Resolution GAN (SRGAN), which we refer to as the Semi-SRGAN (SSRGAN). Furthermore, we leveraged three diverse datasets to assess the robustness and accuracy of our proposed model. Our training process involves varying levels of pixel corruption to attain optimal accuracy and generate high-quality images.
This is the second article in a series of three dealing with the exploitation of speckle for aberration correction and reverberation compensation in reflection imaging. When probing heterogeneous media with waves, we have to cope with multi-scale fluctuations of the wave velocity. On the one hand, short-scale heterogeneities induce back-scattered echoes whose random interference generate a speckle pattern on the beamformed image. On the other hand, large-scale fluctuations of the wave-velocity can distort the focused wave-fronts, resulting in aberrations on the same image. In this paper, we show how the self-portrait of the wave evolves as a function of the speed-of-sound model. Strikingly, a Gouy phase shift is observed when the speed-of-sound model is optimal. This particularly sensitive feature enables: (i) an optimization of the speed-of-sound model for each pixel of the image; (ii) a local and fine compensation of defocus across the field-of-view, thereby compensating for most aberrations in the image. Experiment in a tissue-mimicking phantom and numerical simulations are first presented to validate our method. It is then applied to in-vivo liver data of a difficult-to-image patient. The speed-of-sound optimization allows an axial compensation of aberrations and a depth-reassignment of each singly-scattered echo to the actual position of the associated scatterer. As distance measurement is often critical for diagnosis, such a wave speed optimization can be crucial for ultrasound but also for any other imaging methods based on the principle of echo-location.
Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of emotion information across different representations. Our method outperforms state-of-the-art models in objective and subjective evaluations across different video datasets, demonstrating its effectiveness in generating music aligned to video both emotionally and temporally. Our demo and output samples are available at this https URL.
Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at this https URL.
Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) We propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.
Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.