Fluid antenna systems represent an innovative approach in wireless communication, recently applied in multiple access to optimize the signal-to-interference-plus-noise ratio through port selection. This letter frames the port selection problem as a multi-label classification task for the first time, improving best-port selection with limited port observations. We address this challenge by leveraging liquid neural networks (LNNs) to predict the optimal port under emerging fluid antenna multiple access scenarios alongside a more general $\alpha$-$\mu$ fading model. We also apply hyperparameter optimization to refine LNN architectures for different observation scenarios. Our approach yields lower outage probability values than existing methods.
Machine Learning models are increasingly used in businesses to detect faults and anomalies in complex systems. In this work, we take this approach a step further: beyond merely detecting anomalies, we aim to identify the optimal control strategy that restores the system to a safe state with minimal disruption. We frame this challenge as a counterfactual problem: given a Machine Learning model that classifies system states as either good or anomalous, our goal is to determine the minimal adjustment to the system's control variables (i.e., its current status) that is necessary to return it to the good state. To achieve this, we leverage a mathematical model that finds the optimal counterfactual solution while respecting system specific constraints. Notably, most counterfactual analysis in the literature focuses on individual cases where a person seeks to alter their status relative to a decision made by a classifier, such as for loan approval or medical diagnosis. Our work addresses a fundamentally different challenge: optimizing counterfactuals for a complex energy system, specifically an offshore wind turbine oil type transformer. This application not only advances counterfactual optimization in a new domain but also opens avenues for broader research in this area. Our tests on real world data provided by our industrial partner show that our methodology easily adapts to user preferences and brings savings in the order of 3 million euros per year in a typical farm.
Alzheimer's Disease (AD) is an irreversible neurodegenerative disease characterized by progressive cognitive decline as its main symptom. In the research field of deep learning-assisted diagnosis of AD, traditional convolutional neural networks and simple feature concatenation methods fail to effectively utilize the complementary information between multimodal data, and the simple feature concatenation approach is prone to cause the loss of key information during the process of modal fusion. In recent years, the development of deep learning technology has brought new possibilities for solving the problem of how to effectively fuse multimodal features. This paper proposes a novel deep learning algorithm framework to assist medical professionals in AD diagnosis. By fusing medical multi-view information such as brain fluorodeoxyglucose positron emission tomography (PET), magnetic resonance imaging (MRI), genetic data, and clinical data, it can accurately detect the presence of AD, Mild Cognitive Impairment (MCI), and Cognitively Normal (CN). The innovation of the algorithm lies in the use of an asymmetric cross-modal cross-attention mechanism, which can effectively capture the key information features of the interactions between different data modal features. This paper compares the asymmetric cross-modal cross-attention mechanism with the traditional algorithm frameworks of unimodal and multimodal deep learning models for AD diagnosis, and evaluates the importance of the asymmetric cross-modal cross-attention mechanism. The algorithm model achieves an accuracy of 94.88% on the test set.
This paper investigates the asymptotic behavior of the deterministic and stochastic Cramér-Rao Bounds (CRB) for semi-blind channel estimation in massive multiple-input multiple-output (MIMO) systems. We derive and analyze mathematically tractable expressions for both metrics under various asymptotic regimes, which govern the growth rates of the number of antennas, the number of users, the training sequence length, and the transmission block length. Unlike the existing work, our results show that the CRB can be made arbitrarily small as the transmission block length increases, but only when the training sequence length grows at the same rate and the number of users remains fixed. However, if the number of training sequences remains proportional to the number of users, the channel estimation error is always lower-bounded by a non-vanishing constant. Numerical results are presented to support our findings and demonstrate the advantages of semi-blind channel estimation in reducing the required number of training sequences.
Introduction: Chest CT scans are increasingly used in dyspneic patients where acute heart failure (AHF) is a key differential diagnosis. Interpretation remains challenging and radiology reports are frequently delayed due to a radiologist shortage, although flagging such information for emergency physicians would have therapeutic implication. Artificial intelligence (AI) can be a complementary tool to enhance the diagnostic precision. We aim to develop an explainable AI model to detect radiological signs of AHF in chest CT with an accuracy comparable to thoracic radiologists. Methods: A single-center, retrospective study during 2016-2021 at Copenhagen University Hospital - Bispebjerg and Frederiksberg, Denmark. A Boosted Trees model was trained to predict AHF based on measurements of segmented cardiac and pulmonary structures from acute thoracic CT scans. Diagnostic labels for training and testing were extracted from radiology reports. Structures were segmented with TotalSegmentator. Shapley Additive explanations values were used to explain the impact of each measurement on the final prediction. Results: Of the 4,672 subjects, 49% were female. The final model incorporated twelve key features of AHF and achieved an area under the ROC of 0.87 on the independent test set. Expert radiologist review of model misclassifications found that 24 out of 64 (38%) false positives and 24 out of 61 (39%) false negatives were actually correct model predictions, with the errors originating from inaccuracies in the initial radiology reports. Conclusion: We developed an explainable AI model with strong discriminatory performance, comparable to thoracic radiologists. The AI model's stepwise, transparent predictions may support decision-making.
Accurate channel estimation is crucial for the improvement of signal processing performance in wireless communications. However, traditional model-based methods frequently experience difficulties in dynamic environments. Similarly, alternative machine-learning approaches typically lack generalization across different datasets due to variations in channel characteristics. To address this issue, in this study, we propose a novel domain adaptation approach to bridge the gap between the quasi-static channel model (QSCM) and the map-based channel model (MBCM). Specifically, we first proposed a channel estimation pipeline that takes into account realistic channel simulation to train our foundation model. Then, we proposed domain adaptation methods to address the estimation problem. Using simulation-based training to reduce data requirements for effective application in practical wireless environments, we find that the proposed strategy enables robust model performance, even with limited true channel information.
Recent years have witnessed remarkable progress in developing Vision-Language Models (VLMs) capable of processing both textual and visual inputs. These models have demonstrated impressive performance, leading to their widespread adoption in various applications. However, this widespread raises serious concerns regarding user privacy, particularly when models inadvertently process or expose private visual information. In this work, we frame the preservation of privacy in VLMs as an adversarial attack problem. We propose a novel attack strategy that selectively conceals information within designated Region Of Interests (ROIs) in an image, effectively preventing VLMs from accessing sensitive content while preserving the semantic integrity of the remaining image. Unlike conventional adversarial attacks that often disrupt the entire image, our method maintains high coherence in unmasked areas. Experimental results across three state-of-the-art VLMs namely LLaVA, Instruct-BLIP, and BLIP2-T5 demonstrate up to 98% reduction in detecting targeted ROIs, while maintaining global image semantics intact, as confirmed by high similarity scores between clean and adversarial outputs. We believe that this work contributes to a more privacy conscious use of multimodal models and offers a practical tool for further research, with the source code publicly available at: this https URL.
Functional magnetic resonance imaging (fMRI) has been commonly used to construct functional connectivity networks (FCNs) of the human brain. TFCNs are primarily limited to quantifying pairwise relationships between ROIs ignoring higher order dependencies between multiple brain regions. Recently, hypergraph construction methods from fMRI time series data have been proposed to characterize the high-order relations among multiple ROIs. While there have been multiple methods for constructing hypergraphs from fMRI time series, the question of how to characterize the topology of these hypergraphs remains open. In this paper, we make two key contributions to the field of community detection in brain hypernetworks. First, we construct a hypergraph for each subject capturing high order dependencies between regions. Second, we introduce a spectral clustering based approach on hypergraphs to detect overlapping community structure. Finally, the proposed method is implemented to detect the consensus community structure across multiple subjects. The proposed method is applied to resting state fMRI data from Human Connectome Project to summarize the overlapping community structure across a group of healthy young adults.
Identifying the actual cause of events in engineered systems is a fundamental challenge in system analysis. Finding such causes becomes more challenging in the presence of noise and uncertainty in real-world systems. In this paper, we adopt the notion of probabilistic actual causality by Fenton-Glynn, which is a probabilistic extension of Halpern and Pearl's actual causality, and propose a novel method to formally reason about causal effect of events in systems subject to uncertainty. We (1) formulate the discovery of probabilistic actual causes in computing systems as an SMT problem, and (2) address the scalability challenges by introducing an abstraction-refinement technique that significantly improves efficiency. We demonstrate the effectiveness of our approach through three case studies, identifying probabilistic causes of safety violations in (1) the Mountain Car problem, (2) the Lunar Lander benchmark, and (3) MPC controller for an F-16 autopilot simulator.
Zero-shot voice conversion (VC) synthesizes speech in a target speaker's voice while preserving linguistic and paralinguistic content. However, timbre leakage-where source speaker traits persist-remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution. Audio samples can be accessed at this https URL
This study presents the modeling, control design, and performance analysis of a DC-DC buck converter using state-space averaging techniques. Buck converters are essential in modern power electronics for regulating DC voltages in renewable energy and electric vehicle systems. The paper first introduces the basic operation of buck converters and emphasizes the need for voltage regulation through closed-loop control systems. A state-space averaged model is derived to simplify the nonlinear switched dynamics, enabling a more effective analysis and controller design. The small-signal transfer function from the duty cycle to the output voltage is obtained to support control development. In addition, the Proportional-Integral (PI) control based on the frequency-domain method was explored. The PI controller was tuned to achieve various phase margins and is evaluated through Bode plots, step responses, and performance metrics, revealing trade-offs between overshoot, settling time, and steady-state error. A complete simulation of the controlled buck converter verifies its ability to maintain a stable output voltage across wide input voltage variations. The results validate the effectiveness of state-space averaging in control design and highlight the robustness of feedback systems in power electronic converters.
The motion planning problem of generating dynamically feasible, collision-free trajectories in non-convex environments is a fundamental challenge for autonomous systems. Decomposing the problem into path planning and path tracking improves tractability, but integrating these components in a theoretically sound and computationally efficient manner is challenging. We propose the Path Feasibility Governor (PathFG), a framework for integrating path planners with nonlinear Model Predictive Control (MPC). The PathFG manipulates the reference passed to the MPC controller, guiding it along a path while ensuring constraint satisfaction, stability, and recursive feasibility. The PathFG is modular, compatible with replanning, and improves computational efficiency and reliability by reducing the need for long prediction horizons. We prove safety and asymptotic stability with a significantly expanded region of attraction, and validate its real-time performance through a simulated case study of quadrotor navigation in a cluttered environment.
In spinal vertebral mobility disease, accurately extracting and contouring vertebrae is essential for assessing mobility impairments and monitoring variations during flexion-extension movements. Precise vertebral contouring plays a crucial role in surgical planning; however, this process is traditionally performed manually by radiologists or surgeons, making it labour-intensive, time-consuming, and prone to human error. In particular, mobility disease analysis requires the individual contouring of each vertebra, which is both tedious and susceptible to inconsistencies. Automated methods provide a more efficient alternative, enabling vertebra identification, segmentation, and contouring with greater accuracy and reduced time consumption. In this study, we propose a novel U-Net variation designed to accurately segment thoracic vertebrae from anteroposterior view on X-Ray images. Our proposed approach, incorporating a ``sandwich" U-Net structure with dual activation functions, achieves a 4.1\% improvement in Dice score compared to the baseline U-Net model, enhancing segmentation accuracy while ensuring reliable vertebral contour extraction.
Large language models have shown a remarkable ability to extract meaning from unstructured data, offering new ways to interpret biomedical signals beyond traditional numerical methods. In this study, we present a matrix factorization framework for bioacoustic signal analysis which is enhanced by large language models. The focus is on separating bioacoustic signals that commonly overlap in clinical recordings, using matrix factorization to decompose the mixture into interpretable components. A large language model is then applied to the separated signals to associate distinct acoustic patterns with potential medical conditions such as cardiac rhythm disturbances or respiratory abnormalities. Recordings were obtained from a digital stethoscope applied to a clinical manikin to ensure a controlled and high-fidelity acquisition environment. This hybrid approach does not require labeled data or prior knowledge of source types, and it provides a more interpretable and accessible framework for clinical decision support. The method demonstrates promise for integration into future intelligent diagnostic tools.
The bistatic Integrated Sensing and Communication (ISAC) is poised to become a key application for next generation communication networks (e.g., B5G/6G), providing simultaneous sensing and communication services with minimal changes to existing network infrastructure and hardware. However, a significant challenge in bistatic cooperative sensing is clock asynchronism, arising from the use of different clocks at far separated transmitters and receivers. This asynchrony leads to Timing Offsets (TOs) and Carrier Frequency Offsets (CFOs), potentially causing sensing ambiguity. Traditional synchronization methods typically rely on static reference links or GNSS-based timing sources, both of which are often unreliable or unavailable in UAVbased bistatic ISAC scenarios. To overcome these limitations, we propose a Time-Varying Offset Estimation (TVOE) framework tailored for clock-asynchronous bistatic ISAC systems, which leverages the geometrically predictable characteristics of the Line-of-Sight (LoS) path to enable robust, infrastructure-free synchronization. The framework treats the LoS delay and the Doppler shift as dynamic observations and models their evolution as a hidden stochastic process. A state-space formulation is developed to jointly estimate TO and CFO via an Extended Kalman Filter (EKF), enabling real-time tracking of clock offsets across successive frames. Furthermore, the estimated offsets are subsequently applied to correct the timing misalignment of all Non-Line-of-Sight (NLoS) components, thereby enhancing the high-resolution target sensing performance. Extensive simulation results demonstrate that the proposed TVOE method improves the estimation accuracy by 60%.
Bistatic Integrated Sensing and Communication (ISAC) is poised to become a cornerstone technology in next-generation communication networks, such as Beyond 5G (B5G) and 6G, by enabling the concurrent execution of sensing and communication functions without requiring significant modifications to existing infrastructure. Despite its promising potential, a major challenge in bistatic cooperative sensing lies in the degradation of sensing accuracy, primarily caused by the inherently weak received signals resulting from high reflection losses in complex environments. Traditional methods have predominantly relied on adaptive filtering techniques to enhance the Signal-to-Noise Ratio (SNR) by dynamically adjusting the filter coefficients. However, these methods often struggle to adapt effectively to the increasingly complex and diverse network topologies. To address these challenges, we propose a novel Image Super-Resolution-based Signal Enhancement (ISR-SE) framework that significantly improves the recognition and recovery capabilities of ISAC signals. Specifically, we first perform a time-frequency analysis by applying the Short-Time Fourier Transform (STFT) to the received signals, generating spectrograms that capture the frequency, magnitude, and phase components. These components are then mapped into RGB images, where each channel represents one of the extracted features, enabling a more intuitive and informative visualization of the signal structure. To enhance these RGB images, we design an improved denoising network that combines the strengths of the UNet architecture and diffusion models. This hybrid architecture leverages UNet's multi-scale feature extraction and the generative capacity of diffusion models to perform effective image denoising, thereby improving the quality and clarity of signal representations under low-SNR conditions.
Neural speaker diarization is widely used for overlap-aware speaker diarization, but it requires large multi-speaker datasets for training. To meet this data requirement, large datasets are often constructed by combining multiple corpora, including those originally designed for multi-speaker automatic speech recognition (ASR). However, ASR datasets often feature loosely defined segment boundaries that do not align with the stricter conventions of diarization benchmarks. In this work, we show that such boundary looseness significantly impacts the diarization error rate, reducing evaluation reliability. We also reveal that models trained on data with varying boundary precision tend to learn dataset-specific looseness, leading to poor generalization across out-of-domain datasets. Training with standardized tight boundaries via forced alignment improves not only diarization performance, especially in streaming scenarios, but also ASR performance when combined with simple post-processing.
There has been increasing interest in the generation of high-quality, realistic synthetic medical images in recent years. Such synthetic datasets can mitigate the scarcity of public datasets for artificial intelligence research, and can also be used for educational purposes. In this paper, we propose a combination of diffusion-based generation (PanoDiff) and Super-Resolution (SR) for generating synthetic dental panoramic radiographs (PRs). The former generates a low-resolution (LR) seed of a PR (256 X 128) which is then processed by the SR model to yield a high-resolution (HR) PR of size 1024 X 512. For SR, we propose a state-of-the-art transformer that learns local-global relationships, resulting in sharper edges and textures. Experimental results demonstrate a Frechet inception distance score of 40.69 between 7243 real and synthetic images (in HR). Inception scores were 2.55, 2.30, 2.90 and 2.98 for real HR, synthetic HR, real LR and synthetic LR images, respectively. Among a diverse group of six clinical experts, all evaluating a mixture of 100 synthetic and 100 real PRs in a time-limited observation, the average accuracy in distinguishing real from synthetic images was 68.5% (with 50% corresponding to random guessing).
Lensless cameras replace traditional optics with thin masks, leading to highly multiplexed measurements akin to encryption. However, static masks in conventional designs leave systems vulnerable to simple attacks. This work explores the use of programmable masks to enhance security by dynamically varying the mask patterns. We perform our experiments with a low-cost system (around 100 USD) based on a liquid crystal display. Experimental results demonstrate that variable masks successfully block a variety of attacks while enabling high-quality recovery for legitimate users. The system's encryption strength exceeds AES-256, achieving effective key lengths over 2'500 bits. Additionally, we demonstrate how a programmable mask enables robust authentication and verification, as each mask pattern leaves a unique fingerprint on the image. When combined with a lensed system, lensless measurements can serve as analog certificates, providing a novel solution for verifying image authenticity and combating deepfakes.
The dense and distributed deployment of sub-THz radio units (RUs) alongside sub-10 GHz access point (AP) is a promising approach to provide high data rate and reliable coverage for future 6G applications. However, beam search or RU selection for the sub-THz RUs incurs significant overhead and high power consumption. To address this, we introduce a method that leverages deep learning to infer a suitable sub-THz RU candidate from a set of sub-THz RUs using the sub-10 GHz channel characteristics. A novel aspect of this work is the consideration of inter-band beam configuration (IBBC), defined as the broadside angle between the low-band and high-band antenna patterns of the user equipment (UE). Since IBBC indicates the beamforming information or UE's orientation, it is typically not shared with the network as a part of signalling. Therefore, we propose a solution strategy to infer a suitable sub-THz RU even when UEs do not share their IBBC information. Simulation results illustrate the performance of the inferred sub-THz RU and highlights the detrimental impact of neglecting UE orientation on the systems performance.
Affine frequency division multiplexing (AFDM) has recently emerged as an excellent backward-compatible 6G waveform. In this paper, an enhanced AFDM is proposed whereby the delay-Doppler (DD) coupling phase is considered. Specifically, we study matched filtering (MF) assisted channel estimation (CE) for AFDM systems in complex doubly selective channels. By deriving the complete input-output relationship, the inter-chirp-carrier interference, signal-to-interference-plus-noise ratio (SINR), and the effective SINR loss of AFDM, are investigated in discrete affine Fourier transform (DAFT) domain. Further, we look into the path ambiguity problem and show that it may lead to severe performance deterioration in fractional-delay fractional-Doppler channels. To address such a problem, we introduce an MF assisted CE scheme building upon a novel pilot arrangement across two consecutive AFDM transmissions. This allows us to sequentially estimate the parameters of each path by exploiting the separability and approximate orthogonality of different paths in the DAFT domain, thus leading to significantly reduced complexity. Furthermore, based on generalized Fibonacci search (GFS), an MF-GFS scheme is proposed to avoid significantly redundant computation, which can be extended to typical wide-band systems. Extensive simulation results indicate that the proposed schemes offer superior advantages in terms of their improved communication performance and lower complexity.
Power systems Unit Commitment (UC) problem determines the generator commitment schedule and dispatch decisions to realize the reliable and economic operation of power networks. The growing penetration of stochastic renewables and demand behaviors makes it necessary to solve the UC problem timely. It is possible to derive lightweight, faster-to-solve UC models via constraint screening to eliminate redundant constraints. However, the screening process remains computationally cumbersome due to the need of solving numerous linear programming (LP) problems. To reduce the number of LPs to solve, we introduce a novel perspective on such classic LP-based screening. Our key insights lie in the principle that redundant constraints will be satisfied by all vertices of the screened feasible region. Using the UC decision variables' bounds tightened by solving much fewer LPs, we build an outer approximation for the UC feasible region as the screened region. A matrix operation is then designed and applied to the outer approximation's vertices to identify all redundant constraints on-the-fly. Adjustments for the outer approximation are further explored to improve screening efficiency by considering the load operating range and cutting planes derived from UC cost and discrete unit status prediction. Extensive simulations are performed on a set of testbeds up to 2,383 buses to substantiate the effectiveness of the proposed schemes. Compared to classic LP-based screening, our schemes can achieve up to 8.8x acceleration while finding the same redundant constraints.
This study introduces a novel multi-objective reinforcement learning (MORL) approach for autonomous intersection management, aiming to balance traffic efficiency and environmental sustainability across electric and internal combustion vehicles. The proposed method utilizes MORL to identify Pareto-optimal policies, with a post-hoc fairness criterion guiding the selection of the final policy. Simulation results in a complex intersection scenario demonstrate the approach's effectiveness in optimizing traffic efficiency and emissions reduction while ensuring fairness across vehicle categories. We believe that this criterion can lay the foundation for ensuring equitable service, while fostering safe, efficient, and sustainable practices in smart urban mobility.
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at this https URL.
Enhancing the user's own-voice for head-worn microphone arrays is an important task in noisy environments to allow for easier speech communication and user-device interaction. However, a rarely addressed challenge is the change of the microphones' transfer functions when one or more of the microphones gets occluded by skin, clothes or hair. The underlying problem for beamforming-based speech enhancement is the (potentially rapidly) changing transfer functions of both the own-voice and the noise component that have to be accounted for to achieve optimal performance. In this paper, we address the problem of an occluded microphone in a head-worn microphone array. We investigate three alternative mitigation approaches by means of (i) conventional adaptive beamforming, (ii) switching between a-priori estimates of the beamformer coefficients for the occluded and unoccluded state, and (iii) a hybrid approach using a switching-adaptive beamformer. In an evaluation with real-world recordings and simulated occlusion, we demonstrate the advantages of the different approaches in terms of noise reduction, own-voice distortion and robustness against voice activity detection errors.
Deep learning-based hearing loss compensation (HLC) seeks to enhance speech intelligibility and quality for hearing impaired listeners using neural networks. One major challenge of HLC is the lack of a ground-truth target. Recent works have used neural networks to emulate non-differentiable auditory peripheral models in closed-loop frameworks, but this approach lacks flexibility. Alternatively, differentiable auditory models allow direct optimization, yet previous studies focused on individual listener profiles, or joint noise reduction (NR) and HLC without balancing each task. This work formulates NR and HLC as a multi-task learning problem, training a system to simultaneously predict denoised and compensated signals from noisy speech and audiograms using a differentiable auditory model. Results show the system achieves similar objective metric performance to systems trained for each task separately, while being able to adjust the balance between NR and HLC during inference.
Conventional wisdom suggests that single-photon lidar (SPL) should operate in low-light conditions to minimize dead-time effects. Many methods have been developed to mitigate these effects in synchronous SPL systems. However, solutions for free-running SPL remain limited despite the advantage of reduced histogram distortion from dead times. To improve the accuracy of free-running SPL, we propose a computationally efficient joint maximum likelihood estimator of the signal flux, the background flux, and the depth using only histograms, along with a complementary regularization framework that incorporates a learned point cloud score model as a prior. Simulations and experiments demonstrate that free-running SPL yields lower estimation errors than its synchronous counterpart under identical conditions, with our regularization further improving accuracy.
In Inverse Synthetic Aperture Radar (ISAR), random missing entries of the received radar echo matrix deteriorate the imaging quality, compromising target distinction from the background. Compressive sensing techniques or matrix completion prior to conventional imaging have been used in recent years to solve this issue. However, while the former techniques fail to preserve target continuity due to the sparsity constraint, the latter fails for high missing ratios. This paper proposes to use deep image prior (DIP) to complete the complex radar data and then obtain the radar image by conventional Fourier imaging. Real and imaginary parts are separately completed by independent deep structures and then put together for the imaging part. The proposed DIP based imaging method has been compared with IALM, 2D-SL0 and NNM methods visually and quantitatively for both simulated and real data. The results demonstrate an increase of 100% for some extreme cases in terms of RMSE, 50% increase on Correlation and 30% increase on IC metrics quantitatively.
Effective channel estimation CE is critical for optimizing the performance of 5G New Radio NR systems particularly in dynamic environments where traditional methods struggle with complexity and adaptability This paper introduces GraphNet a novel lightweight Graph Neural Network GNNbased estimator designed to enhance CE in 5G NR Our proposed method utilizes a GNN architecture that minimizes computational overhead while capturing essential features necessary for accurate CE We evaluate GraphNet across various channel conditions from slowvarying to highly dynamic environments and compare its performance to ChannelNet a wellknown deep learningbased CE method GraphNet not only matches ChannelNets performance in stable conditions but significantly outperforms it in highvariation scenarios particularly in terms of Block Error Rate It also includes builtin noise estimation that enhances robustness in challenging channel conditions Furthermore its significantly lighter computational footprint makes GraphNet highly suitable for realtime deployment especially on edge devices with limited computational resources By underscoring the potential of GNNs to transform CE processes GraphNet offers a scalable and robust solution that aligns with the evolving demands of 5G technologies highlighting its efficiency and performance as a nextgeneration solution for wireless communication systems
Recently, hybrid non-orthogonal multiple access (H-NOMA) technology, which effectively utilizes both NOMA and orthogonal multiple access (OMA) technologies through flexible resource allocation in a single transmission, has demonstrated immense potential for enhancing the performance of wireless communication systems. To further release the potential of HNOMA, this paper proposes a novel design of H-NOMA which jointly incorporates hybrid successive interference cancellation (HSIC) and power adaptation (PA) in the NOMA transmission phase. To reveal the potential of the proposed HSIC-PA aided H-NOMA scheme, closed-form expression for the probability of the event that H-NOMA can achieve a higher data rate than pure OMA by consuming less energy is rigorously derived. Furthermore, the asymptotic analysis demonstrates that the probability of the proposed H-NOMA scheme approaches 1 in the high signal-to-noise ratio (SNR) regime without any constraints on either users' target rates or transmit power ratios. This represents a significant improvement over conventional H-NOMA schemes, which require specific restrictive conditions to achieve probability 1 at high SNRs as shown in existing work. The above observation indicates that with less energy consumption, the proposed HSIC-PA aided H-NOMA can achieve a higher data rate than pure OMA with probability 1 at high SNRs, and hence a higher energy efficiency. Finally, numerical results are provided to verify the accuracy of the analysis and also demonstrate the superior performance of the proposed H-NOMA scheme.
We present the DKU system for Task 2 of the MLC-SLM Challenge, which aims to perform multi-speaker automatic speech recognition directly from raw audio without Oracle speaker labels or time boundaries. Our approach builds upon a diarization-aware framework integrating speaker embeddings and temporal utterance boundaries into a Qwen2.5-based large language model (LLM). Then, we enhance the system's multilingual performance by fine-tuning language-specific adapters and LoRA modules within the LLM decoder. Finally, our system achieves the tcpWER of 23.56\% and 18.08\% on the development and test sets of the MLC-SLM dataset, substantially outperforming the official baseline.
This paper proposes a neural stochastic optimization method for efficiently solving the two-stage stochastic unit commitment (2S-SUC) problem under high-dimensional uncertainty scenarios. The proposed method approximates the second-stage recourse problem using a deep neural network trained to map commitment decisions and uncertainty features to recourse costs. The trained network is subsequently embedded into the first-stage UC problem as a mixed-integer linear program (MILP), allowing for explicit enforcement of operational constraints while preserving the key uncertainty characteristics. A scenario-embedding network is employed to enable dimensionality reduction and feature aggregation across arbitrary scenario sets, serving as a data-driven scenario reduction mechanism. Numerical experiments on IEEE 5-bus, 30-bus, and 118-bus systems demonstrate that the proposed neural two-stage stochastic optimization method achieves solutions with an optimality gap of less than 1%, while enabling orders-of-magnitude speedup compared to conventional MILP solvers and decomposition-based methods. Moreover, the model's size remains constant regardless of the number of scenarios, offering significant scalability for large-scale stochastic unit commitment problems.
In the context of Synthetic Aperture Radar (SAR) image recognition, traditional methods often struggle with the intrinsic limitations of SAR data, such as weak texture, high noise, and ambiguous object boundaries. This work explores a novel perspective by reformulating SAR target recognition as a multimodal reasoning task. We leverage multimodal large language models (MLLMs), specifically GPT-4o, to perform target classification based on SAR imagery, guided by candidate categories and enhanced with Chain-of-Thought (CoT) reasoning. A new dataset is constructed based on the FAIR-CSAR benchmark, comprising raw SAR images, structured target annotations, candidate label sets, and GPT-generated CoT reasoning chains. Experimental results show that the MLLMs are capable of generating logically coherent and interpretable inferences in most scenarios. Our analysis highlights both the strengths and current limitations of MLLMs in interpreting SAR imagery, and we provide detailed insights into model behavior through failure case analysis. This work demonstrates the feasibility of incorporating MLLMs into SAR analysis pipelines and establishes a foundation for future research in SAR-oriented visual reasoning.
This article presents a physics-aware convolutional long short-term memory (PC-LSTM) network for efficient and accurate extraction of mutual impedance matrices in dipole antenna arrays. By reinterpreting the Green's function through a physics-aware neural network and embedding it into an adaptive loss function, the proposed machine learning-based approach achieves enhanced physical interpretability in mutual coupling modeling. Also, an attention mechanism is carefully designed to calibrate complex-valued features by fusing the real and imaginary parts of the Green's function matrix. These fused representations are then processed by a convolutional long short-term memory network, and the impedance matrix of the linear antenna array can be finally derived. Validation against five benchmarks underscores the efficacy of the proposed approach, demonstrating accurate impedance extraction with up to a 7x speedup compared to CST Microwave Studio, making it a fast alternative to full-wave simulations for mutual coupling characterization.
Pre-training methods have greatly improved the performance of sound event localization and detection (SELD). However, existing Transformer-based models still face high computational cost. To solve this problem, we present a stereo SELD system using a pre-trained PSELDnet and a bidirectional Mamba sequence model. Specifically, we replace the Conformer module with a BiMamba module. We also use asymmetric convolutions to better capture the time and frequency relationships in the audio signal. Test results on the DCASE2025 Task 3 development dataset show that our method performs better than both the baseline and the original PSELDnet with a Conformer decoder. In addition, the proposed model costs fewer computing resources than the baselines. These results show that the BiMamba architecture is effective for solving key challenges in SELD tasks. The source code is publicly accessible at this https URL alexandergwm/DCASE2025 TASK3 Stereo PSELD Mamba.
We propose a novel framework for phase retrieval that leverages Langevin dynamics to enable efficient posterior sampling, yielding reconstructions that explicitly balance distortion and perceptual quality. Unlike conventional approaches that prioritize pixel-wise accuracy, our method navigates the perception-distortion tradeoff through a principled combination of stochastic sampling, learned denoising, and model-based updates. The framework comprises three variants of increasing complexity, integrating theoretically grounded Langevin inference, adaptive noise schedule learning, parallel reconstruction sampling, and warm-start initialization from classical solvers. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple benchmarks, both in terms of fidelity and perceptual quality.
Phase retrieval involves recovering a signal from intensity-only measurements, crucial in many fields such as imaging, holography, optical computing, crystallography, and microscopy. Although there are several well-known phase retrieval algorithms, including classical iterative solvers, the reconstruction performance often remains sensitive to initialization and measurement noise. Recently, image-to-image diffusion models have gained traction in various image reconstruction tasks, yielding significant theoretical insights and practical breakthroughs. In this work, we introduce a novel phase retrieval approach based on an image-to-image diffusion framework called Inversion by Direct Iteration. Our method begins with an enhanced initialization stage that leverages a hybrid iterative technique, combining the Hybrid Input-Output and Error Reduction methods and incorporating a novel acceleration mechanism to obtain a robust crude estimate. Then, it iteratively refines this initial crude estimate using the learned image-to-image pipeline. Our method achieves substantial improvements in both training efficiency and reconstruction quality. Furthermore, our approach utilizes aggregation techniques to refine quality metrics and demonstrates superior results compared to both classical and contemporary techniques. This highlights its potential for effective and efficient phase retrieval across various applications.
This paper presents a novel identification approach of Koopman models of nonlinear systems with inputs under rather general noise conditions. The method uses deep state-space encoders based on the concept of state reconstructability and an efficient multiple-shooting formulation of the squared loss of the prediction error to estimate the dynamics and the lifted state from input-output data. Furthermore, the Koopman model structure includes an innovation noise term that is used to handle process and measurement noise. It is shown that the proposed approach is statistically consistent and computationally efficient due to the multiple-shooting formulation where, on subsections of the data, multi-step prediction errors can be calculated in parallel. The latter allows for efficient batch optimization of the network parameters and, at the same time, excellent long-term prediction capabilities of the obtained models. The performance of the approach is illustrated by nonlinear benchmark examples.
Proton Pump Inhibitors (PPIs) are the standard of care for gastric acid disorders but carry significant risks when administered chronically at high doses. Precise long-term control of gastric acidity is challenged by the impracticality of invasive gastric acid monitoring beyond 72 hours and wide inter-patient variability. We propose a noninvasive, symptom-based framework that tailors PPI dosing solely on patient-reported reflux and digestive symptom patterns. A Bayesian Neural Network prediction model learns to predict patient symptoms and quantifies its uncertainty from historical symptom scores, meal, and PPIs intake data. These probabilistic forecasts feed a chance-constrained Model Predictive Control (MPC) algorithm that dynamically computes future PPI doses to minimize drug usage while enforcing acid suppression with high confidence - without any direct acid measurement. In silico studies over diverse dietary schedules and virtual patient profiles demonstrate that our learning-augmented MPC reduces total PPI consumption by 65 percent compared to standard fixed regimens, while maintaining acid suppression with at least 95 percent probability. The proposed approach offers a practical path to personalized PPI therapy, minimizing treatment burden and overdose risk without invasive sensors.
With the development of wireless network technologies, the wireless image transmission area has become prominent. The need for high resolution, data traffic density, widespread use of multimedia applications, and the importance of high rate and reliable image transmission in medical and military fields necessitate the design of novel and high-performance wireless image transmission systems. This paper proposes a code index modulation (CIM)-based image transmission (CIM-IT) system that utilizes spreading code index and quadrature amplitude modulation (QAM) symbol for image transmission over a wireless channel. The proposed CIM-IT system maps bits to each pixel value of the image to be transmitted and transmits these bits over a wireless channel using a single-input and multiple-output system comprising code index modulation and QAM techniques. At the receiver, the active spreading code index and the selected QAM symbol are estimated using a despreading-based maximum likelihood detector, and the corresponding bits are obtained. The image conveyed from the transmitter is then reconstructed at the receiver side using the pixel values corresponding to the bits. The obtained noisy image is enhanced using important enhancement filters. In addition, an advanced filter is proposed to improve the transmitted degraded image with optimum results. Furthermore, error performance, spectral efficiency, energy efficiency, and throughputof the CIM-IT system are performed and the results are compared with traditional wireless communication techniques.
Public electric vehicle (EV) charging infrastructure is crucial for accelerating EV adoption and reducing transportation emissions; however, disparities in infrastructure access have raised significant equity concerns. This systematic review synthesizes existing knowledge and identifies gaps regarding equity in EV public charging research. Following structured review protocols, 91 peer-reviewed studies from Scopus and Google Scholar were analyzed, focusing explicitly on equity considerations. The findings indicate that current research on EV public charging equity mainly adopted geographic information systems (GIS), network optimization, behavioral modeling, and hybrid analytical frameworks, yet lacks consistent normative frameworks for assessing equity outcomes. Equity assessments highlight four key dimensions: spatial accessibility, cost burdens, reliability and usability, and user awareness and trust. Socio-economic disparities, particularly income, housing tenure, and ethnicity, frequently exacerbate inequitable access, disproportionately disadvantaging low-income, renter, and minority populations. Additionally, infrastructure-specific choices, including charger reliability, strategic location, and pricing strategies, significantly influence adoption patterns and equity outcomes. However, existing literature primarily reflects North American, European, and Chinese contexts, revealing substantial geographical and methodological limitations. This review suggests the need for more robust normative evaluations of equity, comprehensive demographic data integration, and advanced methodological frameworks, thereby guiding targeted, inclusive, and context-sensitive infrastructure planning and policy interventions.
Medical Imagings are considered one of the crucial diagnostic tools for different bones-related diseases, especially bones fractures. This paper investigates the robustness of pre-trained deep learning models for classifying bone fractures in X-ray images and seeks to address global healthcare disparity through the lens of technology. Three deep learning models have been tested under varying simulated equipment quality conditions. ResNet50, VGG16 and EfficientNetv2 are the three pre-trained architectures which are compared. These models were used to perform bone fracture classification as images were progressively degraded using noise. This paper specifically empirically studies how the noise can affect the bone fractures detection and how the pre-trained models performance can be changes due to the noise that affect the quality of the X-ray images. This paper aims to help replicate real world challenges experienced by medical imaging technicians across the world. Thus, this paper establishes a methodological framework for assessing AI model degradation using transfer learning and controlled noise augmentation. The findings provide practical insight into how robust and generalizable different pre-trained deep learning powered computer vision models can be when used in different contexts.
Optimal power management of battery energy storage systems (BESS) is crucial for their safe and efficient operation. Numerical optimization techniques are frequently utilized to solve the optimal power management problems. However, these techniques often fall short of delivering real-time solutions for large-scale BESS due to their computational complexity. To address this issue, this paper proposes a computationally efficient approach. We introduce a new set of decision variables called power-sharing ratios corresponding to each cell, indicating their allocated power share from the output power demand. We then formulate an optimal power management problem to minimize the system-wide power losses while ensuring compliance with safety, balancing, and power supply-demand match constraints. To efficiently solve this problem, a parameterized control policy is designed and leveraged to transform the optimal power management problem into a parameter estimation problem. We then implement the ensemble Kalman inversion to estimate the optimal parameter set. The proposed approach significantly reduces computational requirements due to 1) the much lower dimensionality of the decision parameters and 2) the estimation treatment of the optimal power management problem. Finally, we conduct extensive simulations to validate the effectiveness of the proposed approach. The results show promise in accuracy and computation time compared with explored numerical optimization techniques.
Pneumonia is a leading cause of mortality in children under five, requiring accurate chest X-ray diagnosis. This study presents a machine learning-based Pediatric Chest Pneumonia Classification System to assist healthcare professionals in diagnosing pneumonia from chest X-ray images. The CNN-based model was trained on 5,863 labeled chest X-ray images from children aged 0-5 years from the Guangzhou Women and Children's Medical Center. To address limited data, we applied augmentation techniques (rotation, zooming, shear, horizontal flipping) and employed GANs to generate synthetic images, addressing class imbalance. The system achieved optimal performance using combined original, augmented, and GAN-generated data, evaluated through accuracy and F1 score metrics. The final model was deployed via a Flask web application, enabling real-time classification with probability estimates. Results demonstrate the potential of deep learning and GANs in improving diagnostic accuracy and efficiency for pediatric pneumonia classification, particularly valuable in resource-limited clinical settings this https URL
Analog in-memory computing (AIMC) is an energy-efficient alternative to digital architectures for accelerating machine learning and signal processing workloads. However, its energy efficiency is limited by the high energy cost of the column analog-to-digital converters (ADCs). Reducing the ADC precision is an effective approach to lowering its energy cost. However, doing so also reduces the AIMC's computational accuracy thereby making it critical to identify the minimum precision required to meet a target accuracy. Prior works overestimate the ADC precision requirements by modeling quantization error as input-independent noise, maximizing the signal-to-quantization-noise ratio (SQNR), and ignoring the discrete nature of ideal pre-ADC signal. We address these limitations by developing analytical expressions for estimating the compute signal-to-noise ratio (CSNR), a true metric of accuracy for AIMCs, and propose CACTUS, an algorithm to obtain CSNR-optimal ADC parameters. Using a circuit-aware behavioral model of an SRAM-based AIMC in a 28nm CMOS process, we show that for a 256-dimensional binary dot product, CACTUS reduces the ADC precision requirements by 3b while achieving 6dB higher CSNR over prior methods. We also delineate operating conditions under which our proposed CSNR-optimal ADCs outperform conventional SQNR-optimal ADCs.
We address the problem of optimal joint scheduling of deferrable and nondeferrable demand involving colocated stochastic supply. Deferrable demand can be delayed within its service deadline, whereas nondeferrable demand must be scheduled immediately. Under a finite-horizon stochastic dynamic programming formulation, we show that the optimal scheduling policy is a ``procrastination policy'' that delays scheduling as much as possible and is characterized by three procrastination parameters. Exploiting the low-dimensional parameterization of the optimal policy, we propose a Procrastination Threshold Reinforcement Learning algorithm. Numerical experiments based on real-world test data confirm that the threshold-learning algorithm closely approximates the optimal policy and outperforms standard benchmarks.
The Deep Prior framework has emerged as a powerful generative tool which can be used for reconstructing sound fields in an environment from few sparse pressure measurements. It employs a neural network that is trained solely on a limited set of available data and acts as an implicit prior which guides the solution of the underlying optimization problem. However, a significant limitation of the Deep Prior approach is its inability to generalize to new acoustic configurations, such as changes in the position of a sound source. As a consequence, the network must be retrained from scratch for every new setup, which is both computationally intensive and time-consuming. To address this, we investigate transfer learning in Deep Prior via Low-Rank Adaptation (LoRA), which enables efficient fine-tuning of a pre-trained neural network by introducing a low-rank decomposition of trainable parameters, thus allowing the network to adapt to new measurement sets with minimal computational overhead. We embed LoRA into a MultiResUNet-based Deep Prior model and compare its adaptation performance against full fine-tuning of all parameters as well as classical retraining, particularly in scenarios where only a limited number of microphones are used. The results indicate that fine-tuning, whether done completely or via LoRA, is especially advantageous when the source location is the sole changing parameter, preserving high physical fidelity, and highlighting the value of transfer learning for acoustics applications.
Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters -- 193M for our Base and 462M for our Large models.
Model Predictive Control (MPC)-based Reinforcement Learning (RL) offers a structured and interpretable alternative to Deep Neural Network (DNN)-based RL methods, with lower computational complexity and greater transparency. However, standard MPC-RL approaches often suffer from slow convergence, suboptimal policy learning due to limited parameterization, and safety issues during online adaptation. To address these challenges, we propose a novel framework that integrates MPC-RL with Multi-Objective Bayesian Optimization (MOBO). The proposed MPC-RL-MOBO utilizes noisy evaluations of the RL stage cost and its gradient, estimated via a Compatible Deterministic Policy Gradient (CDPG) approach, and incorporates them into a MOBO algorithm using the Expected Hypervolume Improvement (EHVI) acquisition function. This fusion enables efficient and safe tuning of the MPC parameters to achieve improved closed-loop performance, even under model imperfections. A numerical example demonstrates the effectiveness of the proposed approach in achieving sample-efficient, stable, and high-performance learning for control systems.
Central to Earth observation is the trade-off between spatial and temporal resolution. For temperature, this is especially critical because real-world applications require high spatiotemporal resolution data. Current technology allows for hourly temperature observations at 2 km, but only every 16 days at 100 m, a gap further exacerbated by cloud cover. Earth system models offer continuous hourly temperature data, but at a much coarser spatial resolution (9-31 km). Here, we present a physics-guided deep learning framework for temperature data reconstruction that integrates these two data sources. The proposed framework uses a convolutional neural network that incorporates the annual temperature cycle and includes a linear term to amplify the coarse Earth system model output into fine-scale temperature values observed from satellites. We evaluated this framework using data from two satellites, GOES-16 (2 km, hourly) and Landsat (100 m, every 16 days), and demonstrated effective temperature reconstruction with hold-out and in situ data across four datasets. This physics-guided deep learning framework opens new possibilities for generating high-resolution temperature data across spatial and temporal scales, under all weather conditions and globally.
In Zak-OTFS (orthogonal time frequency space) modulation the carrier waveform is a pulse in the delay-Doppler (DD) domain, formally a quasi-periodic localized function with specific periods along delay and Doppler. When the channel delay spread is less than the delay period, and the channel Doppler spread is less than the Doppler period, the response to a single Zak-OTFS carrier provides an image of the scattering environment and can be used to predict the effective channel at all other carriers. The image of the scattering environment changes slowly, making it possible to employ precoding at the transmitter. Precoding techniques were developed more than thirty years ago for wireline modem channels (V.34 standard) defined by linear convolution where a pulse in the time domain (TD) is used to probe the one-dimensional partial response channel. The action of a doubly spread channel on Zak-OTFS modulation determines a two-dimensional partial response channel defined by twisted convolution, and we develop a novel precoding technique for this channel. The proposed precoder leads to separate equalization of each DD carrier which has significantly lower complexity than joint equalization of all carriers. Further, the effective precoded channel results in non-interfering DD carriers which significantly reduces the overhead of guard carriers separating data and pilot carriers, which improves the spectral efficiency significantly.
Massive Aerial Processing for X MAP-X is an innovative framework for reconstructing spatially correlated ground data, such as environmental or industrial measurements distributed across a wide area, into data maps using a single high altitude pseudo-satellite (HAPS) and a large number of distributed sensors. With subframe-level data reconstruction, MAP-X provides a transformative solution for latency-sensitive IoT applications. This article explores two distinct approaches for AI integration in the post-processing stage of MAP-X. The DNN-based pointwise estimation approach enables real-time, adaptive reconstruction through online training, while the CNN-based image reconstruction approach improves reconstruction accuracy through offline training with non-real-time data. Simulation results show that both approaches significantly outperform the conventional inverse discrete Fourier transform (IDFT)-based linear post-processing method. Furthermore, to enable AI-enhanced MAP-X, we propose a ground-HAPS cooperation framework, where terrestrial stations collect, process, and relay training data to the HAPS. With its enhanced capability in reconstructing field data, AI-enhanced MAP-X is applicable to various real-world use cases, including disaster response and network management.
This study investigates the effectiveness of U-Net architectures integrated with various convolutional neural network (CNN) backbones for automated lung cancer detection and segmentation in chest CT images, addressing the critical need for accurate diagnostic tools in clinical settings. A balanced dataset of 832 chest CT images (416 cancerous and 416 non-cancerous) was preprocessed using Contrast Limited Adaptive Histogram Equalization (CLAHE) and resized to 128x128 pixels. U-Net models were developed with three CNN backbones: ResNet50, VGG16, and Xception, to segment lung regions. After segmentation, CNN-based classifiers and hybrid models combining CNN feature extraction with traditional machine learning classifiers (Support Vector Machine, Random Forest, and Gradient Boosting) were evaluated using 5-fold cross-validation. Metrics included accuracy, precision, recall, F1-score, Dice coefficient, and ROC-AUC. U-Net with ResNet50 achieved the best performance for cancerous lungs (Dice: 0.9495, Accuracy: 0.9735), while U-Net with VGG16 performed best for non-cancerous segmentation (Dice: 0.9532, Accuracy: 0.9513). For classification, the CNN model using U-Net with Xception achieved 99.1 percent accuracy, 99.74 percent recall, and 99.42 percent F1-score. The hybrid CNN-SVM-Xception model achieved 96.7 percent accuracy and 97.88 percent F1-score. Compared to prior methods, our framework consistently outperformed existing models. In conclusion, combining U-Net with advanced CNN backbones provides a powerful method for both segmentation and classification of lung cancer in CT scans, supporting early diagnosis and clinical decision-making.
Super-resolution (SR) has been a pivotal task in image processing, aimed at enhancing image resolution across various applications. Recently, look-up table (LUT)-based approaches have attracted interest due to their efficiency and performance. However, these methods are typically designed for fixed scale factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing ASISR techniques often employ implicit neural representations, which come with considerable computational cost and memory demands. To address these limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework that operates ASISR by learning to blend multiple interpolation functions to maximize their representational capacity. Specifically, we introduce IM-Net, a network trained to predict mixing weights for interpolation functions based on local image patterns and the target scale factor. To enhance efficiency of interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are employed to replace computationally expensive operations, enabling lightweight and fast inference on CPUs while preserving reconstruction quality. Experimental results on several benchmark datasets demonstrate that IM-LUT consistently achieves a superior balance between image quality and efficiency compared to existing methods, highlighting its potential as a promising solution for resource-constrained applications.
This work investigates speech enhancement (SE) from the perspective of language models (LMs). We propose a novel method that leverages Direct Preference Optimization (DPO) to improve the perceptual quality of enhanced speech. Using UTMOS, a neural MOS prediction model, as a proxy for human ratings, our approach guides optimization toward perceptually preferred outputs. This differs from existing LM-based SE methods that focus on maximizing the likelihood of clean speech tokens, which may misalign with human perception and degrade quality despite low prediction error. Experiments on the 2020 Deep Noise Suppression Challenge test sets demonstrate that applying DPO to a pretrained LM-based SE model yields consistent improvements across various speech quality metrics, with relative gains of up to 56%. To our knowledge, this is the first application of DPO to SE and the first to incorporate proxy perceptual feedback into LM-based SE training, pointing to a promising direction for perceptually aligned SE.
Multiple advantages had been identified with the integration of data acquisition into any existing system configuration and implementation. Using data acquisition as a support into a monitoring system has not only improved its overall performance and reliability but also lowered its operational and maintenance cost because of its real-time data collection from node sensors. As renewable energy needs to be sustainable for it to fully support the energy demand of communities, its management and control still needs to be improved and enhanced. Smart systems are considered the next generation technological improvement of any system that exists. It is the prelude to autonomous systems from industrial applications to home automation. Data acquisition is only a part of these smart systems that help in the remote management and control of these devices. Remote monitoring functionality enhances the operation and reliability which help in making proactive decisions during critical situations and circumstances. Even with data acquisition enhancements, there is still room for improving its implementation regarding data security and privacy and accuracy of information being exchanged between nodes. Current technological advancements have already shown promising results and have widen its utilization spectrum by covering almost any field of specialization. With increasing implementation and design complexity that comes with its enhancements, challenges and issues are also faced that needs to be addressed and considered to mitigate the effects of such.
In multiple-input multiple-output integrated sensing and communication (MIMO ISAC) systems, radio frequency chain (i.e., RF chain) selection plays a vital role in reducing hardware cost, power consumption, and computational complexity. However, designing an effective RF chain selection strategy is challenging due to the disparity in performance metrics between communication and sensing-mutual information (MI) versus beam-pattern mean-squared error (MSE) or the Cramér-Rao lower bound (CRLB). To overcome this, we propose a low-complexity greedy RF chain selection framework maximizing a unified MI-based performance metric applicable to both functions. By decomposing the total MI into individual contributions of each RF chain, we introduce two approaches: greedy eigen-based selection (GES) and greedy cofactor-based selection (GCS), which iteratively identify and remove the RF chains with the lowest contribution. We further extend our framework to beam selection for beamspace MIMO ISAC systems, introducing diagonal beam selection (DBS) as a simplified solution. Simulation results show that our proposed methods achieve near-optimal performance with significantly lower complexity than exhaustive search, demonstrating their practical effectiveness for MIMO ISAC systems.
Precise segmentation of brain tumors from magnetic resonance imaging (MRI) is essential for neuro-oncology diagnosis and treatment planning. Despite advances in deep learning methods, automatic segmentation remains challenging due to tumor morphological heterogeneity and complex three-dimensional spatial relationships. Current techniques primarily rely on visual features extracted from MRI sequences while underutilizing semantic knowledge embedded in medical reports. This research presents a multi-level fusion architecture that integrates pixel-level, feature-level, and semantic-level information, facilitating comprehensive processing from low-level data to high-level concepts. The semantic-level fusion pathway combines the semantic understanding capabilities of Contrastive Language-Image Pre-training (CLIP) models with the spatial feature extraction advantages of 3D U-Net through three mechanisms: 3D-2D semantic bridging, cross-modal semantic guidance, and semantic-based attention mechanisms. Experimental validation on the BraTS 2020 dataset demonstrates that the proposed model achieves an overall Dice coefficient of 0.8567, representing a 4.8% improvement compared to traditional 3D U-Net, with a 7.3% Dice coefficient increase in the clinically important enhancing tumor (ET) region.
Wireless channel modeling in complex environments is crucial for wireless communication system design and deployment. Traditional channel modeling approaches face challenges in balancing accuracy, efficiency, and scalability, while recent neural approaches such as neural radiance field (NeRF) suffer from long training and slow inference. To tackle these challenges, we propose voxelized radiance field (VoxelRF), a novel neural representation for wireless channel modeling that enables fast and accurate synthesis of spatial spectra. VoxelRF replaces the costly multilayer perception (MLP) used in NeRF-based methods with trilinear interpolation of voxel grid-based representation, and two shallow MLPs to model both propagation and transmitter-dependent effects. To further accelerate training and improve generalization, we introduce progressive learning, empty space skipping, and an additional background entropy loss function. Experimental results demonstrate that VoxelRF achieves competitive accuracy with significantly reduced computation and limited training data, making it more practical for real-time and resource-constrained wireless applications.
Brain tumor segmentation plays a critical role in clinical diagnosis and treatment planning, yet the variability in imaging quality across different MRI scanners presents significant challenges to model generalization. To address this, we propose the Edge Iterative MRI Lesion Localization System (EdgeIMLocSys), which integrates Continuous Learning from Human Feedback to adaptively fine-tune segmentation models based on clinician feedback, thereby enhancing robustness to scanner-specific imaging characteristics. Central to this system is the Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS), which employs a Modality-Aware Adaptive Encoder (M2AE) to extract multi-scale semantic features efficiently, and a Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) to model complementary cross-modal relationships via graph structures. Additionally, we introduce a novel Voxel Refinement UpSampling Module (VRUM) that synergistically combines linear interpolation and multi-scale transposed convolutions to suppress artifacts while preserving high-frequency details, improving segmentation boundary accuracy. Our proposed GMLN-BTS model achieves a Dice score of 85.1% on the BraTS2017 dataset with only 4.58 million parameters, representing a 98% reduction compared to mainstream 3D Transformer models, and significantly outperforms existing lightweight approaches. This work demonstrates a synergistic breakthrough in achieving high-accuracy, resource-efficient brain tumor segmentation suitable for deployment in resource-constrained clinical environments.
This paper presents a trust-based predictive multi-agent consensus protocol that analyses neighbours' anticipation data and makes coordination decisions. Agents in the network share their future predicted data over a finite look-ahead horizon with their neighbours and update their predictions in a rolling-horizon fashion. The prediction data is then used by agents to learn both the trust and the commitment traits exhibited by their neighbours over time. The proposed protocol is named as the Anticipatory Distributed Coordination (ADC) protocol. Lyapunov theory-based agreement convergence between agents is provided, followed by demonstrations using numerical simulations.
A broad range of applications involve signals with irregular structures that can be represented as a graph. As the underlying structures can change over time, the tracking dynamic graph topologies from observed signals is a fundamental challenge in graph signal processing (GSP), with applications in various domains, such as power systems, the brain-machine interface, and communication systems. In this paper, we propose a method for tracking dynamic changes in graph topologies. Our approach builds on a representation of the dynamics as a graph-based nonlinear state-space model (SSM), where the observations are graph signals generated through graph filtering, and the underlying evolving topology serves as the latent states. In our formulation, the graph Laplacian matrix is parameterized using the incidence matrix and edge weights, enabling a structured representation of the state. In order to track the evolving topology in the resulting SSM, we develop a sparsity-aware extended Kalman filter (EKF) that integrates $\ell_1$-regularized updates within the filtering process. Furthermore, a dynamic programming scheme to efficiently compute the Jacobian of the graph filter is introduced. Our numerical study demonstrates the ability of the proposed method to accurately track sparse and time-varying graphs under realistic conditions, with highly nonlinear measurements, various noise levels, and different change rates, while maintaining low computational complexity.
The angular droop control is a grid-forming control strategy that exploits the idea of power-to-angle droop to achieve exact frequency synchronization with no stringent separation between primary and secondary frequency control. In this work, we conduct hardware experiments in the Smart Energy System Control Laboratory at Karlsruhe Institute of Technology (KIT) to test and validate the angular droop control for low voltage power grids in two different test scenarios. First, we verify its grid-forming capabilities after a major event, e.g., following a blackout, demonstrated via power-to-angle droop behavior. For this, we propose two implementation schemes that rely either on direct or indirect actuation of the modulation signal and draw a comparison between them. Second, we investigate the plug-and-play capabilities, i.e., local stability and power sharing for a two-converter system and provide suitable tuning for the control gains. Our experimental findings illustrate the usefulness of hardware test and validation for DC/AC converter control, the practical challenges entailed and the proposed remedies.
Uncertainties influencing the dynamical systems pose a significant challenge in estimating the achievable performance of a controller aiming to control such uncertain systems. When the uncertainties are of stochastic nature, obtaining hard guarantees for the robustness of a controller aiming to hedge against the uncertainty is not possible. This issue set the platform for the development of probabilistic robust control approaches. In this work, we utilise the gap metric between the known nominal model and the unknown perturbed model of the uncertain system as a tool to gauge the robustness of a controller and formulate the gap as a random variable in the setting with stochastic uncertainties. Main results of this paper includes giving probabilistic bound on the gap exceeding a known threshold followed by bounds on the expected gap value and probabilistic robust stability in terms of the gap metric. Further, we also provide a probabilistic controller performance certification under gap uncertainty and probabilistic guarantee on the achievable $\mathcal{H}_{\infty}$ robustness. Numerical simulations are provided at many places to demonstrate the proposed approach.
Components of electrical power systems are susceptible to failures caused by lightning strikes, aging or human errors. These faults can cause equipment damage, affect system reliability, and results in expensive repair costs. As electric power systems are becoming more complex, traditional protection methods face limitations and shortcomings. Faults in power systems can occur at anytime and anywhere, can be caused by a natural disaster or an accident, and their occurrence can be hardly predicted or avoided; therefore, it is crucial to accurately estimate the fault location and quickly restore service. The development of methods capable of accurately detecting, locating and removing faults is essential (i.e. fast isolation of faults is necessary to maintain the system stability at transmission levels; accurate and fast detection and location of faults are essential for increasing reliability and customer satisfaction at distribution levels). This has motivated the development of new and more efficient methods. Methods developed to detect and locate faults in power systems can be divided into two categories, conventional and artificial intelligence-based techniques. Although the utilization of artificial intelligence (AI) techniques offer tremendous potential, they are challenging and time consuming (i.e. many AI techniques require training data for processing). This paper presents a survey of the application of AI techniques to fault diagnosis (detection, classification and location of faults) of lines and cables of power systems at both transmission and distribution levels. The paper provides a short introduction to AI concepts, a brief summary of the application of AI techniques to power system analysis and design, and a discussion on AI-based fault diagnosis methods.
This paper proposes a deep learning-based beamforming design framework that directly maps a target beam pattern to optimal beamforming vectors across multiple antenna array architectures, including digital, analog, and hybrid beamforming. The proposed method employs a lightweight encoder-decoder network where the encoder compresses the complex beam pattern into a low-dimensional feature vector and the decoder reconstructs the beamforming vector while satisfying hardware constraints. To address training challenges under diverse and limited channel station information (CSI) conditions, a two-stage training process is introduced, which consists of an offline pre-training for robust feature extraction using an auxiliary module, followed by online training of the decoder with a composite loss function that ensures alignment between the synthesized and target beam patterns in terms of the main lobe shape and side lobe suppression. Simulation results based on NYUSIM-generated channels show that the proposed method can achieve spectral efficiency close to that of fully digital beamforming under limited CSI and outperforms representative existing methods.
We study the optimal placement of an unlimited-capacity battery in power grids under a centralized market model, where the independent system operator (ISO) aims to minimize total generation costs through load shifting. The optimal battery placement is not well understood by the existing literature, especially regarding the influence of network topology on minimizing generation costs. Our work starts with decomposing the Mixed-Integer Linear Programming (MILP) problem into a series of Linear Programming (LP) formulations. For power grids with sufficiently large generation capacity or tree topologies, we derive analytical cost expressions demonstrating that, under reasonable assumptions, the weighted degree is the only topological factor for optimal battery placement. We also discuss the minor impact of higher-order topological conditions on tree-topology networks. To find the localized nature of a single battery's impact, we establish that the relative cost-saving benefit of a single battery decreases as the network scales. Furthermore, we design a low-complexity algorithm for weakly-cyclic networks. Numerical experiments show that our algorithm is not only approximately 100 times faster than commercial solvers but also maintains high accuracy even when some theoretical assumptions are relaxed.
Decomposing multivariate time series with certain basic dynamics is crucial for understanding, predicting and controlling nonlinear spatiotemporally dynamic systems such as the brain. Dynamic mode decomposition (DMD) is a method for decomposing nonlinear spatiotemporal dynamics into several basic dynamics (dynamic modes; DMs) with intrinsic frequencies and decay rates. In particular, unlike Fourier transform-based methods, which are used to decompose a single-channel signal into the amplitudes of sinusoidal waves with discrete frequencies at a regular interval, DMD can derive the intrinsic frequencies of a multichannel signal on the basis of the available data; furthermore, it can capture nonstationary components such as alternations between states with different intrinsic frequencies. Here, we propose the use of the distribution of intrinsic frequencies derived from DMDs (DM frequencies) to characterise neural activities. The distributions of DM frequencies in the electroencephalograms of healthy subjects and patients with dementia or Parkinson's disease in a resting state were evaluated. By using the distributions, these patients were distinguished from healthy subjects with significantly greater accuracy than when using amplitude spectra derived by discrete Fourier transform. This finding suggests that the distribution of DM frequencies exhibits distinct behaviour from amplitude spectra, and therefore, the distribution may serve as a new biomarker by characterising the nonlinear spatiotemporal dynamics of electrophysiological signals.
Acoustic beamforming models typically assume wide-sense stationarity of speech signals within short time frames. However, voiced speech is better modeled as a cyclostationary (CS) process, a random process whose mean and autocorrelation are $T_1$-periodic, where $\alpha_1=1/T_1$ corresponds to the fundamental frequency of vowels. Higher harmonic frequencies are found at integer multiples of the fundamental. This work introduces a cyclic multichannel Wiener filter (cMWF) for speech enhancement derived from a cyclostationary model. This beamformer exploits spectral correlation across the harmonic frequencies of the signal to further reduce the mean-squared error (MSE) between the target and the processed input. The proposed cMWF is optimal in the MSE sense and reduces to the MWF when the target is wide-sense stationary. Experiments on simulated data demonstrate considerable improvements in scale-invariant signal-to-distortion ratio (SI-SDR) on synthetic data but also indicate high sensitivity to the accuracy of the estimated fundamental frequency $\alpha_1$, which limits effectiveness on real data.
This letter investigates the potential of pinching-antenna systems for enhancing physical layer security. By pre-installing multiple pinching antennas at discrete positions along a waveguide, the capability of the considered system to perform amplitude and phase adjustment is validated through the formulation of a secrecy rate maximization problem. Specifically, amplitude control is applied to enhance the signal quality at the legitimate user, while phase alignment is designed to degrade the received signal quality at the eavesdropper. This cooperation among pinching antennas is modeled as a coalitional game, and a corresponding antenna activation algorithm is proposed. The individual impact of each antenna is quantified based on the Shapley value and marginal contribution, providing a fair and efficient method for performance evaluation. Simulation results show that the considered pinching-antenna system achieves significant improvements in secrecy rate, and that the Shapley value based algorithm outperforms conventional coalition value based solutions.
The aim of this letter is to explore the capability of pinching-antenna systems to construct line-of-sight (LoS) links in the presence of LoS blockages. Specifically, pinching antennas are pre-installed at preconfigured positions along waveguides and can be selectively activated to create LoS links for enhancing desired signals and non-line-of-sight (NLoS) links for eliminating inter-user interference. On this basis, a sum-rate maximization problem is formulated by jointly optimizing waveguide assignment and antenna activation. To solve this problem, a matching based algorithm is proposed using two distinct preference designs. Simulation results demonstrate that the considered pinching-antenna system and proposed solutions can dynamically establish LoS links and effectively exploit LoS blockages to mitigate interference, thereby significantly improving system throughput.
Speech processing algorithms often rely on statistical knowledge of the underlying process. Despite many years of research, however, the debate on the most appropriate statistical model for speech still continues. Speech is commonly modeled as a wide-sense stationary (WSS) process. However, the use of the WSS model for spectrally correlated processes is fundamentally wrong, as WSS implies spectral uncorrelation. In this paper, we demonstrate that voiced speech can be more accurately represented as a cyclostationary (CS) process. By employing the CS rather than the WSS model for processes that are inherently correlated across frequency, it is possible to improve the estimation of cross-power spectral densities (PSDs), source separation, and beamforming. We illustrate how the correlation between harmonic frequencies of CS processes can enhance system identification, and validate our findings using both simulated and real speech data.
Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.
We show how a recently published 2d model for traffic flow can be further improved. Besides other improvements and simplifications, we present not only a method to compute the necessary time step restrictions, but also a subcycling for the inflow and outflow. This drastically reduces computational cost on large domains with coarse grids, i.\,e.\ for simulations of a whole region instead of a small part of a city or town.
Accurate and timely cancer diagnosis from histopathological slides is vital for effective clinical decision-making. This paper introduces DepViT-CAD, a deployable AI system for multi-class cancer diagnosis in histopathology. At its core is MAViT, a novel Multi-Attention Vision Transformer designed to capture fine-grained morphological patterns across diverse tumor types. MAViT was trained on expert-annotated patches from 1008 whole-slide images, covering 11 diagnostic categories, including 10 major cancers and non-tumor tissue. DepViT-CAD was validated on two independent cohorts: 275 WSIs from The Cancer Genome Atlas and 50 routine clinical cases from pathology labs, achieving diagnostic sensitivities of 94.11% and 92%, respectively. By combining state-of-the-art transformer architecture with large-scale real-world validation, DepViT-CAD offers a robust and scalable approach for AI-assisted cancer diagnostics. To support transparency and reproducibility, software and code will be made publicly available at GitHub.
In this paper, we introduce ASDKit, a toolkit for anomalous sound detection (ASD) task. Our aim is to facilitate ASD research by providing an open-source framework that collects and carefully evaluates various ASD methods. First, ASDKit provides training and evaluation scripts for a wide range of ASD methods, all handled within a unified framework. For instance, it includes the autoencoder-based official DCASE baseline, representative discriminative methods, and self-supervised learning-based methods. Second, it supports comprehensive evaluation on the DCASE 2020--2024 datasets, enabling careful assessment of ASD performance, which is highly sensitive to factors such as datasets and random seeds. In our experiments, we re-evaluate various ASD methods using ASDKit and identify consistently effective techniques across multiple datasets and trials. We also demonstrate that ASDKit reproduces the state-of-the-art-level performance on the considered datasets.
Digital twins are increasingly applied in transportation modelling to replicate real-world traffic dynamics and evaluate mobility and energy efficiency. This study presents a SUMO-based digital twin that simulates mixed ICEV-EV traffic on a major motorway segment, leveraging multi-sensor data fusion from inductive loops, GPS probes, and toll records. The model is validated under both complete and partial information scenarios, achieving 93.1% accuracy in average speed estimation and 97.1% in average trip length estimation. Statistical metrics, including KL Divergence and Wasserstein Distance, demonstrate strong alignment between simulated and observed traffic patterns. Furthermore, CO2 emissions were overestimated by only 0.8-2.4%, and EV power consumption underestimated by 1.0-5.4%, highlighting the model's robustness even with incomplete vehicle classification information.
This paper investigates downlink transmission in 5G Integrated Satellite-Terrestrial Networks (ISTNs) supporting automotive users (UEs) in urban environments, where base stations (BSs) and Low Earth Orbit (LEO) satellites (LSats) cooperate to serve moving UEs over shared C-band frequency carriers. Urban settings, characterized by dense obstructions, together with UE mobility, and the dynamic movement and coverage of LSats pose significant challenges to user association and resource allocation. To address these challenges, we formulate a multi-objective optimization problem designed to improve both throughput and seamless handover (HO). Particularly, the formulated problem balances sum-rate (SR) maximization and connection change (CC) minimization through a weighted trade-off by jointly optimizing power allocation and BS-UE/LSat-UE associations over a given time window. This is a mixed-integer and non-convex problem which is inherently difficult to solve. To solve this problem efficiently, we propose an iterative algorithm based on the Successive Convex Approximation (SCA) technique. Furthermore, we introduce a practical prediction-based algorithm capable of providing efficient solutions in real-world implementations. Especially, the simulations use a realistic 3D map of London and UE routes obtained from the Google Navigator application to ensure practical examination. Thanks to these realistic data, the simulation results can show valuable insights into the link budget assessment in urban areas due to the impact of buildings on transmission links under the blockage, reflection, and diffraction effects. Furthermore, the numerical results demonstrate the effectiveness of our proposed algorithms in terms of SR and the CC-number compared to the greedy and benchmark algorithms.
This work presents several improvements to the closed-loop stability verification framework using semialgebraic sets and convex semidefinite programming to examine neural-network-based control systems regulating nonlinear dynamical systems. First, the utility of the framework is greatly expanded: two semialgebraic functions mimicking common, smooth activation functions are presented and compatibility with control systems incorporating Recurrent Equilibrium Networks (RENs) and thereby Recurrent Neural Networks (RNNs) is established. Second, the validity of the framework's state-of-the-art stability analyses is established via an alternate proof. Third, based on this proof, two new optimization problems simplifying the analysis of local stability properties are presented. To simplify the analysis of a closed-loop system's Region of Attraction (RoA), the first problem explicitly parameterizes a class of candidate Lyapunov functions larger than in previous works. The second problem utilizes the unique guarantees available under the condition of invariance to further expand the set of candidate Lyapunov functions and directly determine whether an invariant set forms part of the system's RoA. These contributions are successfully demonstrated in two numerical examples and suggestions for future research are provided.
Lewy Body Disease (LBD) is a common yet understudied form of dementia that imposes a significant burden on public health. It shares clinical similarities with Alzheimer's disease (AD), as both progress through stages of normal cognition, mild cognitive impairment, and dementia. A major obstacle in LBD diagnosis is data scarcity, which limits the effectiveness of deep learning. In contrast, AD datasets are more abundant, offering potential for knowledge transfer. However, LBD and AD data are typically collected from different sites using different machines and protocols, resulting in a distinct domain shift. To effectively leverage AD data while mitigating domain shift, we propose a Transferability Aware Transformer (TAT) that adapts knowledge from AD to enhance LBD diagnosis. Our method utilizes structural connectivity (SC) derived from structural MRI as training data. Built on the attention mechanism, TAT adaptively assigns greater weights to disease-transferable features while suppressing domain-specific ones, thereby reducing domain shift and improving diagnostic accuracy with limited LBD data. The experimental results demonstrate the effectiveness of TAT. To the best of our knowledge, this is the first study to explore domain adaptation from AD to LBD under conditions of data scarcity and domain shift, providing a promising framework for domain-adaptive diagnosis of rare diseases.
Air traffic control (ATC) demands multi-tasking under time pressure with high consequences of an error. This can induce stress. Detecting stress is a key point in maintaining the high safety standards of ATC. However, processing ATC voice data entails privacy restrictions, e.g. the General Data Protection Regulation (GDPR) law. Anonymizing the ATC voice data is one way to comply with these restrictions. In this paper, different architectures for stress detection for anonymized ATCO speech are evaluated. Our best networks reach a stress detection accuracy of 93.6% on an anonymized version of the Speech Under Simulated and Actual Stress (SUSAS) dataset and an accuracy of 80.1% on our anonymized ATC simulation dataset. This shows that privacy does not have to be an impediment in building well-performing deep-learning-based models.
Beam alignment (BA) is a crucial process in millimeter-wave (mmWave) communications, enabling precise directional transmission and efficient link establishment. However, due to characteristics like omnidirectional exposure and the broadcast nature of the BA phase, it is particularly vulnerable to eavesdropping and identity impersonation attacks. To this end, this paper proposes a novel secure framework named CovertAuth, designed to enhance the security of the BA phase against such attacks. In particular, to combat eavesdropping attacks, the closed-form expressions of successful BA probability and covert transmission rate are first derived. Then, a covert communication problem aimed at jointly optimizing beam training budget and transmission power is formulated to maximize covert communication rate, subject to the covertness requirement. An alternating optimization algorithm combined with successive convex approximation is employed to iteratively achieve optimal results. To combat impersonation attacks, the mutual coupling effect of antenna array impairments is explored as a device feature to design a weighted-sum energy detector based physical layer authentication scheme. Moreover, theoretical models for authentication metrics like detection and false alarm probabilities are also provided to conduct performance analysis. Based on these models, an optimization problem is constructed to determine the optimal weight value that maximizes authentication accuracy. Finally, simulation results demonstrate that CovertAuth presents improved detection accuracy under the same covertness requirement compared to existing works.
Hardware accelerators like GPUs are now ubiquitous in data centers, but are not fully supported by common cloud abstractions such as Functions as a Service (FaaS). Many popular and emerging FaaS applications such as machine learning and scientific computing can benefit from GPU acceleration. However, FaaS frameworks (such as OpenWhisk) are not capable of providing this acceleration because of the impedance mismatch between GPUs and the FaaS programming model, which requires virtualization and sandboxing of each function. The challenges are amplified due to the highly dynamic and heterogeneous FaaS workloads. This paper presents the design and implementation of a FaaS system for providing GPU acceleration in a black-box manner (without modifying function code). Running small functions in containerized sandboxes is challenging due to limited GPU concurrency and high cold-start overheads, resulting in heavy queueing of function invocations. We show how principles from I/O scheduling, such as fair queuing and anticipatory scheduling, can be translated to function scheduling on GPUs. We develop MQFQ-Sticky, an integrated fair queueing and GPU memory management approach, which balances the tradeoffs between locality, fairness, and latency. Empirical evaluation on a range of workloads shows that it reduces function latency by 2x to 20x compared to existing GPU and CPU queueing policies.
We consider solutions to the linear quadratic Gaussian (LQG) regulator problem via policy gradient (PG) methods. Although PG methods have demonstrated strong theoretical guarantees in solving the linear quadratic regulator (LQR) problem, despite its nonconvex landscape, their theoretical understanding in the LQG setting remains limited. Notably, the LQG problem lacks gradient dominance in the classical parameterization, i.e., with a dynamic controller, which hinders global convergence guarantees. In this work, we study PG for the LQG problem by adopting an alternative parameterization of the set of stabilizing controllers and employing a lifting argument. We refer to this parameterization as a history representation of the control input as it is parameterized by past input and output data from the previous p time-steps. This representation enables us to establish gradient dominance and approximate smoothness for the LQG cost. We prove global convergence and per-iteration stability guarantees for policy gradient LQG in model-based and model-free settings. Numerical experiments on an open-loop unstable system are provided to support the global convergence guarantees and to illustrate convergence under different history lengths of the history representation.
Developing autonomous agents that quickly explore an environment and adapt their behavior online is a canonical challenge in robotics and machine learning. While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradient-based behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose behavioral exploration: training agents to internalize what it means to explore and adapt in-context over the space of ``expert'' behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how ``exploratory'' the expert's behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, ``expert-like'' exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.
We study the problem of imitating an expert demonstrator in a continuous state-and-action dynamical system. While imitation learning in discrete settings such as autoregressive language modeling has seen immense success and popularity in recent years, imitation in physical settings such as autonomous driving and robot learning has proven comparably more complex due to the compounding errors problem, often requiring elaborate set-ups to perform stably. Recent work has demonstrated that even in benign settings, exponential compounding errors are unavoidable when learning solely from expert-controlled trajectories, suggesting the need for more advanced policy parameterizations or data augmentation. To this end, we present minimal interventions that provably mitigate compounding errors in continuous state-and-action imitation learning. When the system is open-loop stable, we prescribe "action chunking," i.e., predicting and playing sequences of actions in open-loop; when the system is possibly unstable, we prescribe "noise injection," i.e., adding noise during expert demonstrations. These interventions align with popular choices in modern robot learning, though the benefits we derive are distinct from the effects they were designed to target. Our results draw insights and tools from both control theory and reinforcement learning; however, our analysis reveals novel considerations that do not naturally arise when either literature is considered in isolation.
We generalize the low-rank decomposition problem, such as principal and independent component analysis (PCA, ICA) for continuous-time vector-valued signals and provide a model-agnostic implicit neural signal representation framework to learn numerical approximations to solve the problem. Modeling signals as continuous-time stochastic processes, we unify the approaches to both the PCA and ICA problems in the continuous setting through a contrast function term in the network loss, enforcing the desired statistical properties of the source signals (decorrelation, independence) learned in the decomposition. This extension to a continuous domain allows the application of such decompositions to point clouds and irregularly sampled signals where standard techniques are not applicable.
In this paper, a novel Three dimensional (3D) positioning framework of fluid antenna system (FAS)-enabled unmanned aerial vehicles (UAVs) is developed. In the proposed framework, a set of controlled UAVs cooperatively estimate the real-time 3D position of a target UAV. Here, the active UAV transmits a measurement signal to the passive UAVs via the reflection from the target UAV. Each passive UAV estimates the distance of the active-target-passive UAV link and selects an antenna port to share the distance information with the base station (BS) that calculates the real-time position of the target UAV. As the target UAV is moving due to its task operation, the controlled UAVs must optimize their trajectories and select optimal antenna port, aiming to estimate the real-time position of the target UAV. We formulate this problem as an optimization problem to minimize the target UAV positioning error via optimizing the trajectories of all controlled UAVs and antenna port selection of passive UAVs. Here, an attention-based recurrent multi-agent reinforcement learning (AR-MARL) scheme is proposed, which enables each controlled UAV to use the local Q function to determine its trajectory and antenna port while optimizing the target UAV positioning performance without knowing the trajectories and antenna port selections of other controlled UAVs. Different from current MARL methods, the proposed method uses a recurrent neural network (RNN) that incorporates historical state-action pairs of each controlled UAV, and an attention mechanism to analyze the importance of these historical state-action pairs, thus improving the global Q function approximation accuracy and the target UAV positioning accuracy. Simulation results show that the proposed AR-MARL scheme can reduce the average positioning error by up to 17.5% and 58.5% compared to the VD-MARL scheme and the proposed method without FAS.
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusion, and noise. Our benchmark, RoHOI, includes 20 corruption types based on HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the related field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at this https URL.
Despite substantial improvements in ASR, performance tends to degrade when faced with adverse conditions such as speaker accents. Generative error correction (GER) leverages the rich linguistic knowledge and exceptional reasoning ability of LLMs, significantly outperforming typical LM methods. However, it lacks specificity in accented speech scenarios. In this study, we leverage GER to improve the accuracy of transcription predictions by addressing the two primary features of accented speech recognition. To fully leverage pronunciation information, we propose the multi-modal GER, which integrates pronunciation information from the speech modality, and the multi-granularity GER, which incorporates fine-grained phoneme-level information related to pronunciation. These two methods enable the LLM to utilize the pronunciation information of accented speech and the semantic information from word-level hypotheses for accurate transcription predictions through LoRA fine-tuning. On the one hand, we employ a three-stage training strategy to train separate multi-modal GER models for each accent to obtain mono-accent LoRA experts. By adopting our proposed HDMoLE method, which incorporates hierarchical routing and dynamic thresholds within the mixture of LoRA experts, we effectively merge multiple mono-accent LoRA experts within a single multi-modal GER to overcome the challenges posed by accent diversity. On the other hand, multi-granularity GER leverages the N-best word-level and phoneme-level hypotheses generated by the HDMoLE model to predict the final accented speech transcriptions. Experimental results on the multi-accent English dataset demonstrate the efficacy of our proposed methods. Our methods achieve a remarkable relative WER reduction of 67.35% compared to the Whisper-large-v3 baseline.
In this paper, we introduce a novel framework for spatial audio understanding of first-order ambisonic (FOA) signals through a question answering (QA) paradigm, aiming to extend the scope of sound event localization and detection (SELD) towards spatial scene understanding and reasoning. First, we curate and release fine-grained spatio-temporal textual descriptions for the STARSS23 dataset using a rule-based approach, and further enhance linguistic diversity using large language model (LLM)-based rephrasing. We also introduce a QA dataset aligned with the STARSS23 scenes, covering various aspects such as event presence, localization, spatial, and temporal relationships. To increase language variety, we again leverage LLMs to generate multiple rephrasings per question. Finally, we develop a baseline spatial audio QA model that takes FOA signals and natural language questions as input and provides answers regarding various occurrences, temporal, and spatial relationships of sound events in the scene formulated as a classification task. Despite being trained solely with scene-level question answering supervision, our model achieves performance that is comparable to a fully supervised sound event localization and detection model trained with frame-level spatiotemporal annotations. The results highlight the potential of language-guided approaches for spatial audio understanding and open new directions for integrating linguistic supervision into spatial scene analysis.
Patch-based transformer surrogates have become increasingly effective for modeling spatiotemporal dynamics, but the fixed patch size is a major limitation for budget-conscience deployment in production. We introduce two lightweight, architecture-agnostic modules-the Convolutional Kernel Modulator (CKM) and Convolutional Stride Modulator (CSM)-that enable dynamic patch size control at inference in patch based models, without retraining or accuracy loss. Combined with a cyclic patch-size rollout, our method mitigates patch artifacts and improves long-term stability for video-like prediction tasks. Applied to a range of challenging 2D and 3D PDE benchmarks, our approach improves rollout fidelity and runtime efficiency. To our knowledge, this is the first framework to enable inference-time patch-size tunability in patch-based PDE surrogates. Its plug-and-play design makes it broadly applicable across architectures-establishing a general foundation for compute-adaptive modeling in PDE surrogate tasks.
Dementia, a neurodegenerative disease, alters speech patterns, creating communication barriers and raising privacy concerns. Current speech technologies, such as automatic speech transcription (ASR), struggle with dementia and atypical speech, further challenging accessibility. This paper presents a novel dementia obfuscation in speech framework, ClaritySpeech, integrating ASR, text obfuscation, and zero-shot text-to-speech (TTS) to correct dementia-affected speech while preserving speaker identity in low-data environments without fine-tuning. Results show a 16% and 10% drop in mean F1 score across various adversarial settings and modalities (audio, text, fusion) for ADReSS and ADReSSo, respectively, maintaining 50% speaker similarity. We also find that our system improves WER (from 0.73 to 0.08 for ADReSS and 0.15 for ADReSSo) and speech quality from 1.65 to ~2.15, enhancing privacy and accessibility.
Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.7% SPR, 6.5% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable.
Text-to-Speech (TTS) systems in Lombard speaking style can improve the overall intelligibility of speech, useful for hearing loss and noisy conditions. However, training those models requires a large amount of data and the Lombard effect is challenging to record due to speaker and noise variability and tiring recording conditions. Voice conversion (VC) has been shown to be a useful augmentation technique to train TTS systems in the absence of recorded data from the target speaker in the target speaking style. In this paper, we are concerned with Lombard speaking style transfer. Our goal is to convert speaker identity while preserving the acoustic attributes that define the Lombard speaking style. We compare voice conversion models with implicit and explicit acoustic feature conditioning. We observe that our proposed implicit conditioning strategy achieves an intelligibility gain comparable to the model conditioned on explicit acoustic features, while also preserving speaker similarity.
There is a major shortage of Speech-to-Speech Translation (S2ST) datasets for high resource-to-low resource language pairs such as English-to-Yoruba. Thus, in this study, we curated the Bilingual English-to-Yoruba Speech-to-Speech Translation Corpus Version 1 (BENYO-S2ST-Corpus-1). The corpus is based on a hybrid architecture we developed for large-scale direct S2ST corpus creation at reduced cost. To achieve this, we leveraged non speech-to-speech Standard Yoruba (SY) real-time audios and transcripts in the YORULECT Corpus as well as the corresponding Standard English (SE) transcripts. YORULECT Corpus is small scale(1,504) samples, and it does not have paired English audios. Therefore, we generated the SE audios using pre-trained AI models (i.e. Facebook MMS). We also developed an audio augmentation algorithm named AcoustAug based on three latent acoustic features to generate augmented audios from the raw audios of the two languages. BENYO-S2ST-Corpus-1 has 12,032 audio samples per language, which gives a total of 24,064 sample size. The total audio duration for the two languages is 41.20 hours. This size is quite significant. Beyond building S2ST models, BENYO-S2ST-Corpus-1 can be used to build pretrained models or improve existing ones. The created corpus and Coqui framework were used to build a pretrained Yoruba TTS model (named YoruTTS-0.5) as a proof of concept. The YoruTTS-0.5 gave a F0 RMSE value of 63.54 after 1,000 epochs, which indicates moderate fundamental pitch similarity with the reference real-time audio. Ultimately, the corpus architecture in this study can be leveraged by researchers and developers to curate datasets for multilingual high-resource-to-low-resource African languages. This will bridge the huge digital divides in translations among high and low-resource language pairs. BENYO-S2ST-Corpus-1 and YoruTTS-0.5 are publicly available at (this https URL).
Autonomous systems across diverse domains have underscored the need for drift-resilient state estimation. Although satellite-based positioning and cameras are widely used, they often suffer from limited availability in many environments. As a result, positioning must rely solely on inertial sensors, leading to rapid accuracy degradation over time due to sensor biases and noise. To counteract this, alternative update sources-referred to as information aiding-serve as anchors of certainty. Among these, the zero-velocity update (ZUPT) is particularly effective in providing accurate corrections during stationary intervals, though it is restricted to surface-bound platforms. This work introduces a controlled ZUPT (C-ZUPT) approach for aerial navigation and control, independent of surface contact. By defining an uncertainty threshold, C-ZUPT identifies quasi-static equilibria to deliver precise velocity updates to the estimation filter. Extensive validation confirms that these opportunistic, high-quality updates significantly reduce inertial drift and control effort. As a result, C-ZUPT mitigates filter divergence and enhances navigation stability, enabling more energy-efficient hovering and substantially extending sustained flight-key advantages for resource-constrained aerial systems.
Mobile Edge Computing (MEC) enables low-latency applications by bringing computation closer to the user, but dynamic task arrivals and communication threats like jamming complicate reliable task offloading and resource allocation. In this paper, we formulate a dynamic MEC framework considering the transmission diversity that jointly addresses task scheduling and resource block (RB) assignment in the presence of jamming. First, we define and evaluate key network metrics-including dropped task ratio and bandwidth utilization-while maintaining service continuity by accounting for the existing commitments of the edge server to previously offloaded tasks. Then, we propose a jamming-aware offloading and RB allocation framework that leverages transmission diversity and optimal scheduling across distributed gNBs. The proposed solution is compared to a similar scenario without transmission diversity and two baseline strategies of first-come-first-served (FCFS) and shortest task first (STF). The proposed algorithm effectively mitigates the impact of jamming while enhancing resource utilization and minimizing task drop rates, making it highly suitable for mission-critical MEC applications. At signal-to-jamming-and-noise ratio (SJNR) of 4 dB, the proposed method achieves a $0.26$ task drop rate, outperforming the scenario without transmission diversity with a task drop rate of 0.50 and STF and FCFS strategies with 0.52 and 0.63 task drop rates, respectively.
Accurate sound propagation simulation is essential for delivering immersive experiences in virtual applications, yet industry methods for acoustic modeling often do not account for the full breadth of acoustic wave phenomena. This paper proposes a novel two-dimensional (2D) finite-difference time-domain (FDTD) framework that simulates sound propagation as a wave-based model in Unreal Engine, with an emphasis on capturing lower frequency wave phenomena, embedding occlusion, diffraction, reflection and interference in generated impulse responses. The process begins by discretizing the scene geometry into a 2D grid via a top-down projection from which obstacle masks and boundary conditions are derived. A Python-based FDTD solver injects a sine sweep at a source position, and virtual quadraphonic microphone arrays record pressure field responses at pre-defined listener positions. De-convolution of the pressure responses yields multi-channel impulse responses that retain spatial directionality which are then integrated into Unreal Engine's audio pipeline for dynamic playback. Benchmark tests confirm agreement with analytical expectations, and the paper outlines hybrid extensions aimed at commercial viability.
Cell-free massive multiple-input multiple-output (MIMO)-aided integrated sensing and communication (ISAC) systems are investigated where distributed access points jointly serve users and sensing targets. We demonstrate that only a subset of access points (APs) has to be activated for both tasks, while deactivating redundant APs is essential for power savings. This motivates joint active AP selection and power control for optimizing energy efficiency. The resultant problem is a mixed-integer nonlinear program (MINLP). To address this, we propose a model-based Branch-and-Bound approach as a strong baseline to guide a semi-supervised heterogeneous graph neural network (HetGNN) for selecting the best active APs and the power allocation. Comprehensive numerical results demonstrate that the proposed HetGNN reduces power consumption by 20-25\% and runs nearly 10,000 times faster than model-based benchmarks.
Unmanned Aerial Vehicles (UAV) have emerged as versatile platforms, driving the demand for accurate modeling to support developmental testing. This paper proposes data-driven modeling software for UAV. Emphasizes the utilization of cost-effective sensors to obtain orientation and location data subsequently processed through the application of data filtering algorithms and sensor fusion techniques to improve the data quality to make a precise model visualization on the software. UAV's orientation is obtained using processed Inertial Measurement Unit (IMU) data and represented using Quaternion Representation to avoid the gimbal lock problem. The UAV's location is determined by combining data from the Global Positioning System (GPS), which provides stable geographic coordinates but slower data update frequency, and the accelerometer, which has higher data update frequency but integrating it to get position data is unstable due to its accumulative error. By combining data from these two sensors, the software is able to calculate and continuously update the UAV's real-time position during its flight operations. The result shows that the software effectively renders UAV orientation and position with high degree of accuracy and fluidity
Multi-agent reinforcement learning faces fundamental challenges that conventional approaches have failed to overcome: exponentially growing joint action spaces, non-stationary environments where simultaneous learning creates moving targets, and partial observability that constrains coordination. Current methods remain reactive, employing stimulus-response mechanisms that fail when facing novel scenarios. We argue for a transformative paradigm shift from reactive to proactive multi-agent intelligence through generative AI-based reinforcement learning. This position advocates reconceptualizing agents not as isolated policy optimizers, but as sophisticated generative models capable of synthesizing complex multi-agent dynamics and making anticipatory decisions based on predictive understanding of future interactions. Rather than responding to immediate observations, generative-RL agents can model environment evolution, predict other agents' behaviors, generate coordinated action sequences, and engage in strategic reasoning accounting for long-term dynamics. This approach leverages pattern recognition and generation capabilities of generative AI to enable proactive decision-making, seamless coordination through enhanced communication, and dynamic adaptation to evolving scenarios. We envision this paradigm shift will unlock unprecedented possibilities for distributed intelligence, moving beyond individual optimization toward emergent collective behaviors representing genuine collaborative intelligence. The implications extend across autonomous systems, robotics, and human-AI collaboration, promising solutions to coordination challenges intractable under traditional reactive frameworks.
Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo is available online.\footnote{this https URL
Stacked intelligent metasurfaces (SIMs), which integrate multiple programmable metasurface layers, have recently emerged as a promising technology for advanced wave-domain signal processing. SIMs benefit from flexible spatial degree-of-freedom (DoF) while reducing the requirement for costly radio-frequency (RF) chains. However, current state-of-the-art SIM designs face challenges such as complex phase shift optimization and energy attenuation from multiple layers. To address these aspects, we propose incorporating meta-fibers into SIMs, with the aim of reducing the number of layers and enhancing the energy efficiency. First, we introduce a meta-fiber-connected 2-layer SIM that exhibits the same flexible signal processing capabilities as conventional multi-layer structures, and explains the operating principle. Subsequently, we formulate and solve the optimization problem of minimizing the mean square error (MSE) between the SIM channel and the desired channel matrices. Specifically, by designing the phase shifts of the meta-atoms associated with the transmitting-SIM and receiving-SIM, a non-interference system with parallel subchannels is established. In order to reduce the computational complexity, a closed-form expression for each phase shift at each iteration of an alternating optimization (AO) algorithm is proposed. We show that the proposed algorithm is applicable to conventional multi-layer SIMs. The channel capacity bound and computational complexity are analyzed to provide design insights. Finally, numerical results are illustrated, demonstrating that the proposed two-layer SIM with meta-fiber achieves over a 25% improvement in channel capacity while reducing the total number of meta-atoms by 59% as compared with a conventional seven-layer SIM.
Sound event detection (SED) has made strong progress in controlled environments with clear event categories. However, real-world applications often take place in open environments. In such cases, current methods often produce predictions with too much confidence and lack proper ways to measure uncertainty. This limits their ability to adapt and perform well in new situations. To solve this problem, we are the first to use ensemble methods in SED to improve robustness against out-of-domain (OOD) inputs. We propose a confidence calibration method called Energy-based Open-World Softmax (EOW-Softmax), which helps the system better handle uncertainty in unknown scenes. We further apply EOW-Softmax to sound occurrence and overlap detection (SOD) by adjusting the prediction. In this way, the model becomes more adaptable while keeping its ability to detect overlapping events. Experiments show that our method improves performance in open environments. It reduces overconfidence and increases the ability to handle OOD situations.
Today, Wi-Fi is over 25 years old. Yet, despite sharing the same branding name, today's Wi-Fi boasts entirely new capabilities that were not even on the roadmap 25 years ago. This article aims to provide a holistic and comprehensive technical and historical tutorial on Wi-Fi, beginning with IEEE 802.11b (Wi-Fi 1) and looking forward to IEEE 802.11bn (Wi-Fi 8). This is the first tutorial article to span these eight generations. Rather than a generation-by-generation exposition, we describe the key mechanisms that have advanced Wi-Fi. We begin by discussing spectrum allocation and coexistence, and detailing the IEEE 802.11 standardization cycle. Second, we provide an overview of the physical layer and describe key elements that have enabled data rates to increase by over 1,000x. Third, we describe how Wi-Fi Medium Access Control has been enhanced from the original Distributed Coordination Function to now include capabilities spanning from frame aggregation to wideband spectrum access. Fourth, we describe how Wi-Fi 5 first broke the one-user-at-a-time paradigm and introduced multi-user access. Fifth, given the increasing use of mobile, battery-powered devices, we describe Wi-Fi's energy-saving mechanisms over the generations. Sixth, we discuss how Wi-Fi was enhanced to seamlessly aggregate spectrum across 2.4 GHz, 5 GHz, and 6 GHz bands to improve throughput, reliability, and latency. Finally, we describe how Wi-Fi enables nearby Access Points to coordinate in order to improve performance and efficiency. In the Appendix, we further discuss Wi-Fi developments beyond 802.11bn, including integrated mmWave operations, sensing, security and privacy extensions, and the adoption of AI/ML.
We present the first sizeable corpus of Thai speech emotion recognition, THAI-SER, containing 41 hours and 36 minutes (27,854 utterances) from 100 recordings made in different recording environments: Zoom and two studio setups. The recordings contain both scripted and improvised sessions, acted by 200 professional actors (112 females and 88 males, aged 18 to 55) and were directed by professional directors. There are five primary emotions: neutral, angry, happy, sad, and frustrated, assigned to the actors when recording utterances. The utterances are annotated with an emotional category using crowdsourcing. To control the annotation process's quality, we also design an extensive filtering and quality control scheme to ensure that the majority agreement score remains above 0.71. We evaluate our annotated corpus using two metrics: inter-annotator reliability and human recognition accuracy. Inter-annotator reliability score was calculated using Krippendorff's alpha, where our corpus, after filtering, achieved an alpha score of 0.692, higher than a recommendation of 0.667. For human recognition accuracy, our corpus scored up to 0.772 post-filtering. We also provide the results of the model trained on the corpus evaluated on both in-corpus and cross-corpus setups. The corpus is publicly available under a Creative Commons BY-SA 4.0, as well as our codes for the experiments.
Artificial intelligence (AI) systems often interact with multiple agents. The regulation of such AI systems often requires that {\em a priori\/} guarantees of fairness and robustness be satisfied. With stochastic models of agents' responses to the outputs of AI systems, such {\em a priori\/} guarantees require non-trivial reasoning about the corresponding stochastic systems. Here, we present an open-source PyTorch-based toolkit for the use of stochastic control techniques in modelling interconnections of AI systems and properties of their repeated uses. It models robustness and fairness desiderata in a closed-loop fashion, and provides {\em a priori\/} guarantees for these interconnections. The PyTorch-based toolkit removes much of the complexity associated with the provision of fairness guarantees for closed-loop models of multi-agent systems.
High-resolution elevation estimations are essential to understand catchment and hillslope hydrology, study urban morphology and dynamics, and monitor the growth, decline, and mortality of terrestrial ecosystems. Various deep learning approaches (e.g., super-resolution techniques, monocular depth estimation) have been developed to create high-resolution Digital Elevation Models (DEMs). However, super-resolution techniques are limited by the upscaling factor, and monocular depth estimation lacks global elevation context, making its conversion to a seamless DEM restricted. The recently introduced technique of prompt-based monocular depth estimation has opened new opportunities to extract estimates of absolute elevation in a global context. We present here a framework for the estimation of high-resolution DEMs as a new paradigm for absolute global elevation mapping. It is exemplified using low-resolution Shuttle Radar Topography Mission (SRTM) elevation data as prompts and high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP). The approach fine-tunes a vision transformer encoder with LiDAR-derived DEMs and employs a versatile prompting strategy, enabling tasks such as DEM estimation, void filling, and updating. Our framework achieves a 100x resolution gain (from 30-m to 30-cm), surpassing prior methods by an order of magnitude. Evaluations across three diverse U.S. landscapes show robust generalization, capturing urban structures and fine-scale terrain features with < 5 m MAE relative to LiDAR, improving over SRTM by up to 18%. Hydrological analysis confirms suitability for hazard and environmental studies. We demonstrate scalability by applying the framework to large regions in the U.S. and Israel. All code and pretrained models are publicly available at: this https URL.
Large-area microscopy with submicron resolution is limited by tradeoffs between field of view (FOV), resolution, and imaging speed. Samples are rarely flat across centimeter-scale FOV, which often requires existing solutions to use mechanical scanning to ensure focused capture at reduced throughput. Here, we present PANORAMA, a single-shot, re-imaging microscope that achieves seamless, gigapixel imaging over a 16.3$\times$18.8 $\text{mm}^2$ FOV at 0.84 um resolution without mechanical scanning. By using a telecentric photolithography lens, a large-aperture tube lens, and a flat micro-camera array with adaptive per-camera focus control, PANORAMA maintains submicron focus across flat, curved or uneven samples that span centimeters. This approach improves imaging throughput and adaptability, enabling gigapixel multi-modal microscopy of large flat and non-flat samples in one shot, thus broadening its applications in biomedical and materials imaging.
This paper presents a unified planning-control strategy for competing with other racing cars called IteraOptiRacing in autonomous racing environments. This unified strategy is proposed based on Iterative Linear Quadratic Regulator for Iterative Tasks (i2LQR), which can improve lap time performance in the presence of surrounding racing obstacles. By iteratively using the ego car's historical data, both obstacle avoidance for multiple moving cars and time cost optimization are considered in this unified strategy, resulting in collision-free and time-optimal generated trajectories. The algorithm's constant low computation burden and suitability for parallel computing enable real-time operation in competitive racing scenarios. To validate its performance, simulations in a high-fidelity simulator are conducted with multiple randomly generated dynamic agents on the track. Results show that the proposed strategy outperforms existing methods across all randomly generated autonomous racing scenarios, enabling enhanced maneuvering for the ego racing car.
Real-world data is often represented through the relationships between data samples, forming a graph structure. In many applications, it is necessary to learn this graph structure from the observed data. Current graph learning research has primarily focused on unsigned graphs, which consist only of positive edges. However, many biological and social systems are better described by signed graphs that account for both positive and negative interactions, capturing similarity and dissimilarity between samples. In this paper, we develop a method for learning signed graphs from a set of smooth signed graph signals. Specifically, we employ the net Laplacian as a graph shift operator (GSO) to define smooth signed graph signals as the outputs of a low-pass signed graph filter defined by the net Laplacian. The signed graph is then learned by formulating a non-convex optimization problem where the total variation of the observed signals is minimized with respect to the net Laplacian. The proposed problem is solved using alternating direction method of multipliers (ADMM) and a fast algorithm reducing the per-ADMM iteration complexity from quadratic to linear in the number of nodes is introduced. Furthermore, theoretical proofs of convergence for the algorithm and a bound on the estimation error of the learned net Laplacian as a function of sample size, number of nodes, and graph topology are provided. Finally, the proposed method is evaluated on simulated data and gene regulatory network inference problem and compared to existing signed graph learning methods.
We investigate the effects of four strategies for improving the ecological validity of synthetic room impulse response (RIR) datasets for monoaural Speech Enhancement (SE). We implement three features on top of the traditional image source method-based (ISM) shoebox RIRs: multiband absorption coefficients, source directivity and receiver directivity. We additionally consider mesh-based RIRs from the SoundSpaces dataset. We then train a DeepFilternet3 model for each RIR dataset and evaluate the performance on a test set of real RIRs both objectively and subjectively. We find that RIRs which use frequency-dependent acoustic absorption coefficients (MB-RIRs) can obtain +0.51dB of SDR and a +8.9 MUSHRA score when evaluated on real RIRs. The MB-RIRs dataset is publicly available for free download.
In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget, and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks, and we show that a single early-exit model can be competitive with state-of-the-art models trained at many compute and parameter budgets. Our framework enables fine-grained dynamic compute-scaling of speech separation networks while achieving state-of-the-art performance and interpretable exit conditions.
Navigation in dynamic environments requires autonomous systems to reason about uncertainties in the behavior of other agents. In this paper, we introduce a unified framework that combines trajectory planning with multimodal predictions and active probing to enhance decision-making under uncertainty. We develop a novel risk metric that seamlessly integrates multimodal prediction uncertainties through mixture models. When these uncertainties follow a Gaussian mixture distribution, we prove that our risk metric admits a closed-form solution, and is always finite, thus ensuring analytical tractability. To reduce prediction ambiguity, we incorporate an active probing mechanism that strategically selects actions to improve its estimates of behavioral parameters of other agents, while simultaneously handling multimodal uncertainties. We extensively evaluate our framework in autonomous navigation scenarios using the MetaDrive simulation environment. Results demonstrate that our active probing approach successfully navigates complex traffic scenarios with uncertain predictions. Additionally, our framework shows robust performance across diverse traffic agent behavior models, indicating its broad applicability to real-world autonomous navigation challenges. Code and videos are available at this https URL.
Autonomous vehicles (AVs) are becoming increasingly popular, with their applications now extending beyond just a mode of transportation to serving as mobile actuators of a traffic flow to control flow dynamics. This contrasts with traditional fixed-location actuators, such as traffic signals, and is referred to as Lagrangian traffic control. However, designing effective Lagrangian traffic control policies for AVs that generalize across traffic scenarios introduces a major challenge. Real-world traffic environments are highly diverse, and developing policies that perform robustly across such diverse traffic scenarios is challenging. It is further compounded by the joint complexity of the multi-agent nature of traffic systems, mixed motives among participants, and conflicting optimization objectives subject to strict physical and external constraints. To address these challenges, we introduce Multi-Residual Mixture of Expert Learning (MRMEL), a novel framework for Lagrangian traffic control that augments a given suboptimal nominal policy with a learned residual while explicitly accounting for the structure of the traffic scenario space. In particular, taking inspiration from residual reinforcement learning, MRMEL augments a suboptimal nominal AV control policy by learning a residual correction, but at the same time dynamically selects the most suitable nominal policy from a pool of nominal policies conditioned on the traffic scenarios and modeled as a mixture of experts. We validate MRMEL using a case study in cooperative eco-driving at signalized intersections in Atlanta, Dallas Fort Worth, and Salt Lake City, with real-world data-driven traffic scenarios. The results show that MRMEL consistently yields superior performance-achieving an additional 4%-9% reduction in aggregate vehicle emissions relative to the strongest baseline in each setting.
In unmanned aerial vehicle (UAV) networks, communication protocols and algorithms are essential for cooperation and collaboration between UAVs. Simulation provides a cost-effective solution for prototyping, debugging, and analyzing protocols and algorithms, avoiding the prohibitive expenses of field experiments. In this paper, we present ``UavNetSim-v1'', an open-source Python-based simulation platform designed for rapid development, testing, and evaluating the protocols and algorithms in UAV networks. ``UavNetSim-v1'' provides most of the functionalities developers may need, including routing/medium access control (MAC) protocols, topology control algorithms and mobility/energy models, while maintaining ease of use. Furthermore, the platform supports comprehensive performance evaluation and features an interactive visualization interface for in-depth algorithm analysis. In short, ``UavNetSim-v1'' lends itself to both rapid prototyping and educational purposes, and can serve as a lightweight yet powerful alternative to mature network simulators for UAV communication research.
Designing satellite constellation systems involves complex multidisciplinary optimization in which coverage serves as a primary driver of overall system cost and performance. Among the various design considerations, constellation configuration -- how satellites are placed and distributed in space relative to each other -- predominantly determines the resulting coverage. In constellation configuration design, coverage can be considered either as an objective or a constraint, driven by mission objectives. State-of-the-art literature addresses each situation on a case-by-case basis, applying a unique set of assumptions, modeling, and solution methods. Although such a problem-based methodology is valuable, users often face implementation challenges when performing trade-off studies across different mission scenarios, as each scenario must be handled distinctly. In response, we propose a unifying framework consisting of five mixed-integer linear program formulations that are of practical significance, extensible to more complex mission narratives using additional constraints, and capable of obtaining provably optimal constellation configurations. It can handle various metrics and mission scenarios, such as percent coverage, average or maximum revisit times, fixed number of satellites, spatiotemporally varying coverage requirements, and ground-, aerial-, or space-based, static or mobile targets. The paper presents several add-ons, case studies, and comparative analyses to demonstrate the versatility of the proposed framework.
The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: this https URL
The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.
Evaluation of text-to-music systems is constrained by the cost and availability of collecting experts for assessment. AudioMOS 2025 Challenge track 1 is created to automatically predict music impression (MI) as well as text alignment (TA) between the prompt and the generated musical piece. This paper reports our winning system, which uses a dual-branch architecture with pre-trained MuQ and RoBERTa models as audio and text encoders. A cross-attention mechanism fuses the audio and text representations. For training, we reframe the MI and TA prediction as a classification task. To incorporate the ordinal nature of MOS scores, one-hot labels are converted to a soft distribution using a Gaussian kernel. On the official test set, a single model trained with this method achieves a system-level Spearman's Rank Correlation Coefficient (SRCC) of 0.991 for MI and 0.952 for TA, corresponding to a relative improvement of 21.21\% in MI SRCC and 31.47\% in TA SRCC over the challenge baseline.
This letter investigates the optimal allocation of large language model (LLM) inference workloads across heterogeneous edge data centers (DCs) over time. Each DC features on-site renewable generation and faces dynamic electricity prices and spatiotemporal variability in renewable availability. The central question is: how can inference workloads be optimally distributed to the DCs to minimize energy consumption, carbon emissions, and water usage while enhancing user experience? This letter proposes a novel optimization model for LLM service providers to reduce operational costs and environmental impacts. Numerical results validate the efficacy of the proposed approach.
Our research uncovers a novel privacy risk associated with multimodal large language models (MLLMs): the ability to infer sensitive personal attributes from audio data -- a technique we term audio private attribute profiling. This capability poses a significant threat, as audio can be covertly captured without direct interaction or visibility. Moreover, compared to images and text, audio carries unique characteristics, such as tone and pitch, which can be exploited for more detailed profiling. However, two key challenges exist in understanding MLLM-employed private attribute profiling from audio: (1) the lack of audio benchmark datasets with sensitive attribute annotations and (2) the limited ability of current MLLMs to infer such attributes directly from audio. To address these challenges, we introduce AP^2, an audio benchmark dataset that consists of two subsets collected and composed from real-world data, and both are annotated with sensitive attribute labels. Additionally, we propose Gifts, a hybrid multi-agent framework that leverages the complementary strengths of audio-language models (ALMs) and large language models (LLMs) to enhance inference capabilities. Gifts employs an LLM to guide the ALM in inferring sensitive attributes, then forensically analyzes and consolidates the ALM's inferences, overcoming severe hallucinations of existing ALMs in generating long-context responses. Our evaluations demonstrate that Gifts significantly outperforms baseline approaches in inferring sensitive attributes. Finally, we investigate model-level and data-level defense strategies to mitigate the risks of audio private attribute profiling. Our work validates the feasibility of audio-based privacy attacks using MLLMs, highlighting the need for robust defenses, and provides a dataset and framework to facilitate future research.
Investment herding, a phenomenon where households mimic the decisions of others rather than relying on their own analysis, has significant effects on financial markets and household behavior. Excessive investment herding may reduce investments and lead to a depletion of household consumption, which is called the crowding-out effect. While existing research has qualitatively examined the impact of investment herding on consumption, quantitative studies in this area remain limited. In this work, we investigate the optimal investment and consumption decisions of households under the impact of investment herding. We formulate an optimization problem to model how investment herding influences household decisions over time. Based on the optimal control theory, we solve for the analytical solutions of optimal investment and consumption decisions. We theoretically analyze the impact of investment herding on household consumption decisions and demonstrate the existence of the crowding-out effect. We further explore how parameters, such as interest rate, excess return rate, and volatility, influence the crowding-out effect. Finally, we conduct a real data test to validate our theoretical analysis of the crowding-out effect. This study is crucial to understanding the impact of investment herding on household consumption and offering valuable insights for policymakers seeking to stimulate consumption and mitigate the negative effects of investment herding on economic growth.
The superimposed pilot transmission scheme offers substantial potential for improving spectral efficiency in MIMO-OFDM systems, but it presents significant challenges for receiver design due to pilot contamination and data interference. To address these issues, we propose an advanced iterative receiver based on joint channel estimation, detection, and decoding, which refines the receiver outputs through iterative feedback. The proposed receiver incorporates two adaptive channel estimation strategies to enhance robustness under time-varying and mismatched channel conditions. First, a variational message passing (VMP) method and its low-complexity variant (VMP-L) are introduced to perform inference without relying on time-domain correlation. Second, a deep learning (DL) based estimator is developed, featuring a convolutional neural network with a despreading module and an attention mechanism to extract and fuse relevant channel features. Extensive simulations under multi-stream and high-mobility scenarios demonstrate that the proposed receiver consistently outperforms conventional orthogonal pilot baselines in both throughput and block error rate. Moreover, over-the-air experiments validate the practical effectiveness of the proposed design. Among the methods, the DL based estimator achieves a favorable trade-off between performance and complexity, highlighting its suitability for real-world deployment in dynamic wireless environments.
Automated vehicles (AVs) face a critical need to adopt socially compatible behaviors and cooperate effectively with human-driven vehicles (HVs) in heterogeneous traffic environment. However, most existing lane-changing frameworks overlook HVs' dynamic trust levels, limiting their ability to accurately predict human driver behaviors. To address this gap, this study proposes a trust-aware game-theoretic lane-changing decision (TGLD) framework. First, we formulate a multi-vehicle coalition game, incorporating fully cooperative interactions among AVs and partially cooperative behaviors from HVs informed by real-time trust evaluations. Second, we develop an online trust evaluation method to dynamically estimate HVs' trust levels during lane-changing interactions, guiding AVs to select context-appropriate cooperative maneuvers. Lastly, social compatibility objectives are considered by minimizing disruption to surrounding vehicles and enhancing the predictability of AV behaviors, thereby ensuring human-friendly and context-adaptive lane-changing strategies. A human-in-the-loop experiment conducted in a highway on-ramp merging scenario validates our TGLD approach. Results show that AVs can effectively adjust strategies according to different HVs' trust levels and driving styles. Moreover, incorporating a trust mechanism significantly improves lane-changing efficiency, maintains safety, and contributes to transparent and adaptive AV-HV interactions.
Deep learning models incorporating linear SSMs have gained attention for capturing long-range dependencies in sequential data. However, their large parameter sizes pose challenges for deployment on resource-constrained devices. In this study, we propose an efficient parameter reduction method for these models by applying $H^{2}$ model order reduction techniques from control theory to their linear SSM components. In experiments, the LRA benchmark results show that the model compression based on our proposed method outperforms an existing method using the Balanced Truncation, while successfully reducing the number of parameters in the SSMs to $1/32$ without sacrificing the performance of the original models.
The unscented Kalman filter is a nonlinear estimation algorithm commonly used in navigation applications. The prediction of the mean and covariance matrix is crucial to the stable behavior of the filter. This prediction is done by propagating the sigma points according to the dynamic model at hand. In this paper, we introduce an innovative method to propagate the sigma points according to the nonlinear dynamic model of the navigation error state vector. This improves the filter accuracy and navigation performance. We demonstrate the benefits of our proposed approach using real sensor data recorded by an autonomous underwater vehicle during several scenarios.
Fast-response voltage regulation is essential for data-center Voltage Regulation Modules (VRMs) powering Artificial Intelligence (AI) workloads, which exhibit both small-amplitude fluctuations and abrupt full-load steps. This paper introduces a control scheme that integrates a linear controller and a nonlinear controller for variable-frequency Series-Capacitor Buck (SCB) converters. First, an accurate small-signal model is derived via a Switching-Synchronized Sampled State-Space (5S) framework, yielding discrete-time transfer functions and root-locus insights for direct digital design. A critical concern for SCB converters is series-capacitor oscillation during heavy load steps if the strict switching sequence is not maintained. To accelerate large-signal transients, a time-optimal control strategy based on Pontryagins Maximum Principle (PMP) relaxes the switching constraints to compute time-optimal switching sequences. A transition logic is then proposed to integrate the high-bandwidth small-signal controller and the large-signal controller. Simulations demonstrate a rapid output voltage recovery under a heavy load step-up, over ten times faster than a linear controller-only design. Preliminary hardware tests indicate a stable rejection to heavy load disturbances with zero steady-state error.
While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.
Cell-Free Massive multiple-input multiple-output (MIMO) systems are investigated with the support of a reconfigurable intelligent surface (RIS). The RIS phase shifts are designed for improved channel estimation in the presence of spatial correlation. Specifically, we formulate the channel estimate and estimation error expressions using linear minimum mean square error (LMMSE) estimation for the aggregated channels. An optimization problem is then formulated to minimize the average normalized mean square error (NMSE) subject to practical phase shift constraints. To circumvent the problem of inherent nonconvexity, we then conceive an enhanced version of the differential evolution algorithm that is capable of avoiding local minima by introducing an augmentation operator applied to some high-performing Diffential Evolution (DE) individuals. Numerical results indicate that our proposed algorithm can significantly improve the channel estimation quality of the state-of-the-art benchmarks.
Inspection of complex underwater structures with tethered underwater vehicles is often hindered by the risk of tether entanglement. We propose REACT (real-time entanglement-aware coverage path planning for tethered underwater vehicles), a framework designed to overcome this limitation. REACT comprises a fast geometry-based tether model using the signed distance field (SDF) map for accurate, real-time simulation of taut tether configurations around arbitrary structures in 3D. This model enables an efficient online replanning strategy by enforcing a maximum tether length constraint, thereby actively preventing entanglement. By integrating REACT into a coverage path planning framework, we achieve safe and optimal inspection paths, previously challenging due to tether constraints. The complete REACT framework's efficacy is validated in a pipe inspection scenario, demonstrating safe, entanglement-free navigation and full-coverage inspection. Simulation results show that REACT achieves complete coverage while maintaining tether constraints and completing the total mission 20% faster than conventional planners, despite a longer inspection time due to proactive avoidance of entanglement that eliminates extensive post-mission disentanglement. Real-world experiments confirm these benefits, where REACT completes the full mission, while the baseline planner fails due to physical tether entanglement.
The Low-Power Wake-Up Signal (LP-WUS) and Low-Power Synchronization Signal (LP-SS), introduced in 3GPP 5G-Advanced Release 19, represent a major step forward in enabling power-efficient IoT communications. This paper presents a comprehensive overview of the LP-WUS and LP-SS procedures in the RRC_IDLE and RRC_INACTIVE states, and outlines key physical layer design choices. The LP-WUS is designed to be detected by a low-power energy detector (ED), allowing the main radio (MR) to remain switched off. This architecture enables power savings of up to 80% compared to conventional 5G paging mechanisms.
We present Spatial Lifting (SL), a novel methodology for dense prediction tasks. SL operates by lifting standard inputs, such as 2D images, into a higher-dimensional space and subsequently processing them using networks designed for that higher dimension, such as a 3D U-Net. Counterintuitively, this dimensionality lifting allows us to achieve good performance on benchmark tasks compared to conventional approaches, while reducing inference costs and significantly lowering the number of model parameters. The SL framework produces intrinsically structured outputs along the lifted dimension. This emergent structure facilitates dense supervision during training and enables robust, near-zero-additional-cost prediction quality assessment at test time. We validate our approach across 19 benchmark datasets (13 for semantic segmentation and 6 for depth estimation), demonstrating competitive dense prediction performance while reducing the model parameter count by over 98% (in the U-Net case) and lowering inference costs. Spatial Lifting introduces a new vision modeling paradigm that offers a promising path toward more efficient, accurate, and reliable deep networks for dense prediction tasks in vision.
Reliable in-vitro models are used for optoelectronic device development such as fluorescence detection devices for fluorescence-guided surgery of gliomas. A common approach is based on inducing gliomas in animal models. This is followed by a dosage of 5-ALA to induce Protoporphyrin IX (PpIX) in the glioma and which fluoresces. Although these approaches excel in capturing key biomolecular and physiological features of the tumour, they are inherently indeterministic. This limits the scope of their use for preclinical device development, where consistent and controllable tumour reproduction across multiple animals is needed. Approaches using fluorescence markers in gelatine provide a simple replication but fail to capture the complexities of in-vivo models. In this study, we introduce an exogenous brain tumour model for assessing PpIX fluorescence detection. The model was developed by injecting a PpIX solution into the cortical region of a resected adult rat brain, the injection site simulated a tumoral region with elevated PpIX concentration. The tumoral region had a gradient of concentrations, with a peak at the centre and a decrease towards the margins, akin to in-vivo gliomas. The fluorescence profile was compared to in-vivo conditions using 5-ALA and correlated well with other reported works, achieving a correlation of R2>0.93. The model's validity was tested by examining the effect of the solvent, DMSO, on the Autofluorescence (AF) of the brain sample and the short-term effect of storage on AF was analysed. Examinations confirmed the solvent did not alter AF, and the brain sample should be stored in Hanks Balanced Salt Solution and refrigerated to maintain moisture and preserve AF. The model accurately replicated surgical fluorescence conditions and offers a suitable alternative to glioma induction, benefiting the development of fluorescence detection devices across design iterations.
In practice, navigation of mobile robots in confined environments is often done using a spatially discrete cost-map to represent obstacles. Path following is a typical use case for model predictive control (MPC), but formulating constraints for obstacle avoidance is challenging in this case. Typically the cost and constraints of an MPC problem are defined as closed-form functions and typical solvers work best with continuously differentiable functions. This is contrary to spatially discrete occupancy grid maps, in which a grid's value defines the cost associated with occupancy. This paper presents a way to overcome this compatibility issue by re-formulating occupancy grid maps to continuously differentiable functions to be embedded into the MPC scheme as constraints. Each obstacle is defined as a polygon -- an intersection of half-spaces. Any half-space is a linear inequality representing one edge of a polygon. Using AND and OR operators, the combined set of all obstacles and therefore the obstacle avoidance constraints can be described. The key contribution of this paper is the use of fuzzy logic to re-formulate such constraints that include logical operators as inequality constraints which are compatible with standard MPC formulation. The resulting MPC-based trajectory planner is successfully tested in simulation. This concept is also applicable outside of navigation tasks to implement logical or verbal constraints in MPC.
We present a demo of DQLoRA, an Adapter-Guided Distillation framework for robust speech recognition under low-resource and noisy conditions. Our method employs a frozen Whisper model as the teacher to provide semantic supervision, and a lightweight Wav2Vec2 student equipped with QLoRA-based Adapters. Training is conducted on the FLEURS dataset augmented with DNS-style noise. The student is optimized by jointly minimizing CTC loss and KL-based distillation loss, enabling efficient adaptation while preserving recognition accuracy.
The aerospace industry has experienced significant transformations over the last decade, driven by technological advancements and innovative solutions in goods and personal transportation. This evolution has spurred the emergence of numerous start-ups that now face challenges traditionally encountered by established aerospace companies. Among these challenges is the efficient processing of digital intra-device communication interfaces for onboard equipment - a critical component for ensuring seamless system integration and functionality. Addressing this challenge requires solutions that emphasize clear and consistent interface descriptions, automation of processes, and reduced labor-intensive efforts. This paper presents a novel process and toolchain designed to streamline the development of digital interfaces and onboard software, which our team has successfully applied in several completed projects. The proposed approach focuses on automation and flexibility while maintaining compliance with design assurance requirements.
Federated learning (FL) enables decentralized model training without centralizing raw data. However, practical FL deployments often face a key realistic challenge: Clients participate intermittently in server aggregation and with unknown, possibly biased participation probabilities. Most existing convergence results either assume full-device participation, or rely on knowledge of (in fact uniform) client availability distributions -- assumptions that rarely hold in practice. In this work, we characterize the optimization problem that consistently adheres to the stochastic dynamics of the well-known \emph{agnostic Federated Averaging (FedAvg)} algorithm under random (and variably-sized) client availability, and rigorously establish its convergence for convex, possibly nonsmooth losses, achieving a standard rate of order $\mathcal{O}(1/\sqrt{T})$, where $T$ denotes the aggregation horizon. Our analysis provides the first convergence guarantees for agnostic FedAvg under general, non-uniform, stochastic client participation, without knowledge of the participation distribution. We also empirically demonstrate that agnostic FedAvg in fact outperforms common (and suboptimal) weighted aggregation FedAvg variants, even with server-side knowledge of participation weights.
Earth observation satellites (EOS) play a pivotal role in capturing and analyzing planetary phenomena, ranging from natural disasters to societal development. The EOS scheduling problem (EOSSP), which optimizes the schedule of EOS, is often solved with respect to nadir-directional EOS systems, thus restricting the observation time of targets and, consequently, the effectiveness of each EOS. This paper leverages state-of-the-art constellation reconfigurability to develop the reconfigurable EOS scheduling problem (REOSSP), wherein EOS are assumed to be maneuverable, forming a more optimal constellation configuration at multiple opportunities during a schedule. This paper develops a novel mixed-integer linear programming formulation for the REOSSP to optimally solve the scheduling problem for given parameters. Additionally, since the REOSSP can be computationally expensive for large-scale problems, a rolling horizon procedure (RHP) solution method is developed. The performance of the REOSSP is benchmarked against the EOSSP, which serves as a baseline, through a set of random instances where problem characteristics are varied and a case study in which Hurricane Sandy is used to demonstrate realistic performance. These experiments demonstrate the value of constellation reconfigurability in its application to the EOSSP, yielding solutions that improve performance, while the RHP enhances computational runtime for large-scale REOSSP instances.
With the rapid advancement of generative audio models, distinguishing between human-composed and generated music is becoming increasingly challenging. As a response, models for detecting fake music have been proposed. In this work, we explore the robustness of such systems under audio augmentations. To evaluate model generalization, we constructed a dataset consisting of both real and synthetic music generated using several systems. We then apply a range of audio transformations and analyze how they affect classification accuracy. We test the performance of a recent state-of-the-art musical deepfake detection model in the presence of audio augmentations. The performance of the model decreases significantly even with the introduction of light augmentations.
Non-metric music forms the core of the repertoire in Iranian classical music. Dastgahi music serves as the underlying theoretical system for both Iranian art music and certain folk traditions. At the heart of Iranian classical music lies the radif, a foundational repertoire that organizes melodic material central to performance and pedagogy. In this study, we introduce the first digital corpus representing the complete non-metrical radif repertoire, covering all 13 existing components of this repertoire. We provide MIDI files (about 281 minutes in total) and data spreadsheets describing notes, note durations, intervals, and hierarchical structures for 228 pieces of music. We faithfully represent the tonality including quarter-tones, and the non-metric aspect. Furthermore, we provide supporting basic statistics, and measures of complexity and similarity over the corpus. Our corpus provides a platform for computational studies of Iranian classical music. Researchers might employ it in studying melodic patterns, investigating improvisational styles, or for other tasks in music information retrieval, music theory, and computational (ethno)musicology.
Pansharpening refers to the process of integrating a high resolution panchromatic (PAN) image with a lower resolution multispectral (MS) image to generate a fused product, which is pivotal in remote sensing. Despite the effectiveness of CNNs in addressing this challenge, they are inherently constrained by the uniform application of convolutional kernels across all spatial positions, overlooking local content variations. To overcome this issue, we introduce RAPNet, a new architecture that leverages content-adaptive convolution. At its core, RAPNet employs the Receptive-field Adaptive Pansharpening Convolution (RAPConv), designed to produce spatially adaptive kernels responsive to local feature context, thereby enhancing the precision of spatial detail extraction. Additionally, the network integrates the Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an attention mechanism to achieve an optimal balance between spatial detail enhancement and spectral fidelity. Comprehensive evaluations on publicly available datasets confirm that RAPNet delivers superior performance compared to existing approaches, as demonstrated by both quantitative metrics and qualitative assessments. Ablation analyses further substantiate the effectiveness of the proposed adaptive components.
Masked Autoencoders (MAEs) trained on audio spectrogram patches have emerged as a prominent approach for learning self-supervised audio representations. While several recent papers have evaluated key aspects of training MAEs on audio data, the majority of these approaches still leverage vanilla transformer building blocks, whereas the transformer community has seen steady integration of newer architectural advancements. In this work, we propose AudioMAE++, a revamped audio masked autoencoder with two such enhancements, namely macaron-style transformer blocks with gated linear units. When pretrained on the AudioSet dataset, the proposed AudioMAE++ models outperform existing MAE based approaches on 10 diverse downstream tasks, demonstrating excellent performance on audio classification and speech-based benchmarks. The proposed AudioMAE++ models also demonstrate excellent scaling characteristics, outperforming directly comparable standard MAE baselines with up to 4x more parameters.
In off-axis Quantitative Phase Imaging (QPI), artificial neural networks have been recently applied for phase retrieval with aberration compensation and phase unwrapping. However, the involved neural network architectures are largely unoptimized and inefficient with low inference speed, which hinders the realization of real-time imaging. Here, we propose a Neural Architecture Search (NAS) generated Phase Retrieval Net (NAS-PRNet) for accurate and fast phase retrieval. NAS-PRNet is an encoder-decoder style neural network, automatically found from a large neural network architecture search space through NAS. By modifying the differentiable NAS scheme from SparseMask, we learn the optimized skip connections through gradient descent. Specifically, we implement MobileNet-v2 as the encoder and define a synthesized loss that incorporates phase reconstruction loss and network sparsity loss. NAS-PRNet has achieved high-fidelity phase retrieval by achieving a peak Signal-to-Noise Ratio (PSNR) of 36.7 dB and a Structural SIMilarity (SSIM) of 86.6% as tested on interferograms of biological cells. Notably, NAS-PRNet achieves phase retrieval in only 31 ms, representing 15x speedup over the most recent Mamba-UNet with only a slightly lower phase retrieval accuracy.
The development of signal unmixing algorithms is essential for leveraging multimodal datasets acquired through a wide array of scientific imaging technologies, including hyperspectral or time-resolved acquisitions. In experimental physics, enhancing the spatio-temporal resolution or expanding the number of detection channels often leads to diminished sampling rate and signal-to-noise ratio, significantly affecting the efficacy of signal unmixing algorithms. We propose Latent Unmixing, a new approach which applies bandpass filters to the latent space of a multidimensional convolutional neural network to disentangle overlapping signal components. It enables better isolation and quantification of individual signal contributions, especially in the context of undersampled distributions. Using multidimensional convolution kernels to process all dimensions simultaneously enhances the network's ability to extract information from adjacent pixels, and time or spectral bins. This approach enables more effective separation of components in cases where individual pixels do not provide clear, well-resolved information. We showcase the method's practical use in experimental physics through two test cases that highlight the versatility of our approach: fluorescence lifetime microscopy and mode decomposition in optical fibers. The latent unmixing method extracts valuable information from complex signals that cannot be resolved by standard methods. It opens up new possibilities in optics and photonics for multichannel separation at an increased sampling rate.
Hemodynamic parameters are often estimated assuming a constant Newtonian viscosity, even though blood exhibits shear-thinning behavior. This article investigates the influence of blood rheology and hematocrit (Hct) percentage on the estimation of Wall Shear Stress (WSS), rate of viscous Energy Loss ($\dot{E}_L$) at different points in the cardiac cycle, and the Oscillatory Shear Index (OSI). We focus on a hematocrit-dependent power-law non-Newtonian model, considering a wide range of Hct values at physiological temperature, with rheological parameters obtained from previously reported experimental data. In all cases, we systematically compared WSS, $\dot{E}_L$, and OSI using both Newtonian and power-law models, underscoring the crucial role of blood rheology in accurately assessing cardiovascular diseases. Our results show that, in in-silico experiments, differences in WSS and $\dot{E}_L$ across a wide range of Hct values can reach as high as 190\% and 113\% at systole, and as low as -72\% and -74\% at diastole, respectively. In in-vivo data, differences in WSS and $\dot{E}_L$ can reach up to -45\% and -60\% at systole, and range from -69\% to 73\% at diastole. This study enhances our understanding of the impact of blood rheology on hemodynamic parameter estimations using both in-silico and in-vivo aortic 4D Flow MRI data.
This letter proposes a method to integrate auxiliary actuators that enhance the task-space capabilities of commercial underactuated systems, while leaving the internal certified low-level controller untouched. The additional actuators are combined with a feedback-linearizing outer-loop controller, enabling full-pose tracking. We provide conditions under which legacy high-level commands and new actuator inputs can be cohesively coordinated to achieve decoupled control of all degrees of freedom. A comparative study with a standard quadrotor-originally not designed for physical interaction-demonstrates that the proposed modified platform remains stable under contact, while the baseline system diverges. Additionally, simulation results under parameter uncertainty illustrate the robustness of the proposed approach.
Photovoltaic (PV) systems allow us to tap into all abundant solar energy, however they require regular maintenance for high efficiency and to prevent degradation. Traditional manual health check, using Electroluminescence (EL) imaging, is expensive and logistically challenging which makes automated defect detection essential. Current automation approaches require extensive manual expert labeling, which is time-consuming, expensive, and prone to errors. We propose PV-S3 (Photovoltaic-Semi-supervised Semantic Segmentation), a Semi-Supervised Learning approach for semantic segmentation of defects in EL images that reduces reliance on extensive labeling. PV-S3 is an artificial intelligence (AI) model trained using a few labeled images along with numerous unlabeled images. We introduce a novel Semi Cross-Entropy loss function to deal with class imbalance. We evaluate PV-S3 on multiple datasets and demonstrate its effectiveness and adaptability. With merely 20% labeled samples, we achieve an absolute improvement of 9.7% in mean Intersection-over-Union (mIoU), 13.5% in Precision, 29.15% in Recall, and 20.42% in F1-Score over prior state-of-the-art supervised method (which uses 100% labeled samples) on University of Central Florida-Electroluminescence (UCF-EL) dataset (largest dataset available for semantic segmentation of EL images) showing improvement in performance while reducing the annotation costs by 80%. For more details, visit our GitHub repository: this https URL.
We investigate the real-time voltage regulation problem in distribution systems employing online feedback optimization (OFO) with short-range communication between physical neighbours. OFO does not need an accurate grid model nor estimated consumption of non-controllable loads, affords fast calculations, and demonstrates robustness to uncertainties and disturbances, which render it particularly suitable for real-time distribution system applications. However, many OFO controllers require centralized communication, making them susceptible to single-point failures. This paper proposes a distributed OFO design based on a nested feedback optimization strategy and analyzes its convergence. The strategy preserves end-users' privacy by keeping voltage data local. Numerical study results demonstrate that the proposed design achieves effective voltage regulation and outperforms other distributed and local approaches.
In this study, we introduce MGA-Net, a novel mask-guided attention neural network, which extends the U-net model for precision neonatal brain imaging. MGA-Net is designed to extract the brain from other structures and reconstruct high-quality brain images. The network employs a common encoder and two decoders: one for brain mask extraction and the other for brain region reconstruction. A key feature of MGA-Net is its high-level mask-guided attention module, which leverages features from the brain mask decoder to enhance image reconstruction. To enable the same encoder and decoder to process both MRI and ultrasound (US) images, MGA-Net integrates sinusoidal positional encoding. This encoding assigns distinct positional values to MRI and US images, allowing the model to effectively learn from both modalities. Consequently, features learned from a single modality can aid in learning a modality with less available data, such as US. We extensively validated the proposed MGA-Net on diverse and independent datasets from varied clinical settings and neonatal age groups. The metrics used for assessment included the DICE similarity coefficient, recall, and accuracy for image segmentation; structural similarity for image reconstruction; and root mean squared error for total brain volume estimation from 3D ultrasound images. Our results demonstrate that MGA-Net significantly outperforms traditional methods, offering superior performance in brain extraction and segmentation while achieving high precision in image reconstruction and volumetric analysis. Thus, MGA-Net represents a robust and effective preprocessing tool for MRI and 3D ultrasound images, marking a significant advance in neuroimaging that enhances both research and clinical diagnostics in the neonatal period and this http URL code is available at this https URL
Infrared thermography, which has widely spread particularly during the COVID-19 period, has been effectively used for research on health monitoring and emotion estimation. Nevertheless, detecting minute temperature changes with thermography is challenging as it is disturbed by not only noise but also outside temperature surrounding the object. In this study, we demonstrate detecting face temperature variation by implementing lock-in thermography using heartbeat signals as a reference. It allows us to detect minute temperature changes, as low as $\sim$10 mK, on the forehead with a commercially available thermal camera. The proposed approach enables stable measurement of body temperature variation, showing potential for non-contact emotion estimation.
Monitoring electricity consumption at the appliance level is crucial for increasing energy efficiency in residential and commercial buildings. Using a single meter, the non-intrusive load monitoring (NILM) breaks down household consumption down to appliance-level, providing comprehensive insights into end-user electricity behavior. NILM models are trained on a household's total power consumption paired with submetered appliance labels. When sampled at high frequencies ($\geq$ 1 kHz), these datasets capture the full waveform characteristics, significantly improving disaggregation accuracy and model generalization. Nevertheless, such datasets are scarce, collected from a limited number of households, and rarely include labels for power estimation, which complicates their use for model training, evaluation, or debugging. We propose HiFAKES, a pre-trained synthetic data generator that can instantly generate unlimited amounts of fully labeled high-frequency NILM data, including aggregated and submetered current signatures. The data is ready-to-use and annotated for load identification (classification) and power estimation (regression). It allows simulating seen and completely unseen scenarios of appliances' behavior with full control over the number of appliance classes, operational modes, class similarity, brand diversity, and the number of concurrently running devices. We propose a structured methodology to test the generalization of NILM models on simulated unseen households. The reliability of the HiFAKES synthetic data is assessed using a domain-agnostic 3-dimensional metric. The generated signatures achieve high realism (93\% authenticity), closely resemble real-world data (84\% fidelity), and include a reasonable portion of unseen signatures (5\%).
Variable renewable energy droughts, also called "Dunkelflaute", emerge as a challenge for climate-neutral energy systems based on variable renewables. Drawing on 38 historic weather years and an advanced identification method, we characterize European drought events for on- and offshore wind power, solar photovoltaics, and renewable technology portfolios. We show that their characteristics heavily depend on the chosen drought threshold, questioning the usefulness of single-threshold analyses. Applying a multi-threshold framework, we quantify how the complementarity of wind and solar power temporally and spatially alleviates drought frequency, duration, and severity within (portfolio effect) and across countries (balancing effect). We identify the most extreme droughts and show how these drive major discharging periods of long-duration storage in a fully renewable European energy system, based on a policy-relevant decarbonization scenario. Such events comprise sequences of shorter droughts of varying severity. The most extreme event occurred in winter 1996/97 and lasted 55 days in a perfectly interconnected setting. While the average renewable availability during this period was still 47% of its long-run mean, we argue that system planners must consider such events when planning for storage and other flexibility technologies. Methodologically, we conclude that using single calendar years is not suitable for modeling weather-resilient energy scenarios.
The prediction-based nonlinear reference governor (PRG) is an add-on algorithm to enforce constraints on pre-stabilized nonlinear systems by modifying, whenever necessary, the reference signal. The implementation of PRG carries a heavy computational burden, as it may require multiple numerical simulations of the plant model at each sample time. To this end, this paper proposes an alternative approach based on machine learning, where we first use a regression neural network (NN) to approximate the input-output map of the PRG from a set of training data. During the real-time operation, at each sample time, we use the trained NN to compute a nominal reference command, which may not be constraint admissible due to training errors and limited data. We adopt a novel sensitivity-based approach to minimally adjust the nominal reference while ensuring constraint enforcement. We thus refer to the resulting control strategy as the modified neural network reference governor (MNN-RG), which is significantly more computationally efficient than the PRG. The computational and theoretical properties of MNN-RG are presented. Finally, the effectiveness and limitations of the proposed method are studied by applying it as a load governor for constraint management in automotive fuel cell systems through simulation-based case studies.
This paper introduces a novel parameterization to characterize unknown linear time-invariant systems using noisy data. The presented parameterization describes exactly the set of all systems consistent with the available data. We then derive verifiable conditions, when the consistency constraint reduces the set to the true system and when it does not have any impact. Furthermore, we demonstrate how to use this parameterization to perform a direct data-driven estimator synthesis with guarantees on the H_{\infty}-norm. Lastly, we conduct numerical experiments to compare our approach to existing methods.
Despite their simple and robust structure, low cost, and simple cooling system, switched reluctance motors (SRMs) face the challenge of low mean torque. A possible solution is to change the structure of SRMs. This article introduces an innovative combination of the number of rotor teeth and stator teeth of a two-phase switch reluctance motor (TPSRM) with eight teeth for the stator and fourteen teeth for the rotor. As a result of its unique design, which has a short path for passing the main flux, it requires less magnetomotive force. This leads to less core and copper loss, resulting in increased efficiency. Each tooth of the stator in a phase develops a positive torque during the rotation of the rotor, which increases the torque and consequently increases the mean torque of the proposed TPSRM. A current hysteresis control (CHC) is simulated by 2D FEM for the proposed 8/14 TPSRM and the conventional 8/12 TPSRM under the same mechanical load on the shaft to get a current hysteresis reference of 15A at the nominal speed of 600 rpm. To verify the novelty and advantages of the suggested TPSRM, it is compared with the conventional 8/12 TPSRM in terms of mean and peak torque, torque density, and core and copper losses were compared. Lastly, the proposed 8/14 TPSRM is shown to have better performance than the conventional 8/12 TPSRM.
This paper proposes a differentially private gradient-tracking-based distributed stochastic optimization algorithm over directed graphs. In particular, privacy noises are incorporated into each agent's state and tracking variable to mitigate information leakage, after which the perturbed states and tracking variables are transmitted to neighbors. We design two novel schemes for the step-sizes and the sampling number within the algorithm. The sampling parameter-controlled subsampling method employed by both schemes enhances the differential privacy level, and ensures a finite cumulative privacy budget even over infinite iterations. The algorithm achieves both almost sure and mean square convergence for nonconvex objectives. Furthermore, when nonconvex objectives satisfy the Polyak-Lojasiewicz condition, Scheme (S1) achieves a polynomial mean square convergence rate, and Scheme (S2) achieves an exponential mean square convergence rate. The trade-off between privacy and convergence is presented. The effectiveness of the algorithm and its superior performance compared to existing works are illustrated through numerical examples of distributed training on the benchmark datasets "MNIST" and "CIFAR-10".
Retinal diseases are a leading cause of vision impairment and blindness, with timely diagnosis being critical for effective treatment. Optical Coherence Tomography (OCT) has become a standard imaging modality for retinal disease diagnosis, but OCT images often suffer from issues such as speckle noise, complex lesion shapes, and varying lesion sizes, making interpretation challenging. In this paper, we propose a novel framework, WaveNet-SF, to enhance retinal disease detection by integrating the spatial-domain and frequency-domain learning. The framework utilizes wavelet transforms to decompose OCT images into low- and high-frequency components, enabling the model to extract both global structural features and fine-grained details. To improve lesion detection, we introduce a Multi-Scale Wavelet Spatial Attention (MSW-SA) module, which enhances the model's focus on regions of interest at multiple scales. Additionally, a High-Frequency Feature Compensation (HFFC) block is incorporated to recover edge information lost during wavelet decomposition, suppress noise, and preserve fine details crucial for lesion detection. Our approach achieves state-of-the-art (SOTA) classification accuracies of 97.82% and 99.58% on the OCT-C8 and OCT2017 datasets, respectively, surpassing existing methods. These results demonstrate the efficacy of WaveNet-SF in addressing the challenges of OCT image analysis and its potential as a powerful tool for retinal disease diagnosis.
Multi-modal brain MRI provides essential complementary information for clinical diagnosis. However, acquiring all modalities in practice is often constrained by time and cost. To address this, various methods have been proposed to generate missing modalities from available ones. Traditional approaches can be broadly categorized into two main types: paired and unpaired methods. While paired methods for synthesizing missing modalities achieve high accuracy, obtaining large-scale paired datasets is typically impractical. In contrast, unpaired methods, though scalable, often fail to preserve critical anatomical features, such as lesions. In this paper, we propose Fully Guided Schrödinger Bridge (FGSB), a novel framework designed to overcome these limitations by enabling high-fidelity generation with extremely limited paired data. Furthermore, when provided with lesion-specific information such as expert annotations, segmentation tools, or simple intensity thresholds for critical regions, FGSB can generate missing modalities while preserving these significant lesion with reduced data requirements. Our model comprises two stages: 1) Generation Phase: Iteratively refines synthetic images using paired target image and Gaussian noise. Training Phase: Learns optimal transformation pathways from source to target modality by mapping all intermediate states, ensuring consistent and high-fidelity synthesis. Experimental results across multiple datasets demonstrate that FGSB achieved performance comparable to large-data-trained models, while using only two subjects. Incorporating lesion-specific priors further improves the preservation of clinical features.
An efficient framework is conceived for fractional matrix programming (FMP) optimization problems (OPs) namely for minimization and maximization. In each generic OP, either the objective or the constraints are functions of multiple arbitrary continuous-domain fractional functions (FFs). This ensures the framework's versatility, enabling it to solve a broader range of OPs than classical FMP solvers, like Dinkelbach-based algorithms. Specifically, the generalized Dinkelbach algorithm can only solve multiple-ratio FMP problems. By contrast, our framework solves OPs associated with a sum or product of multiple FFs as the objective or constraint functions. Additionally, our framework provides a single-loop solution, while most FMP solvers require twin-loop algorithms. Many popular performance metrics of wireless communications are FFs. For instance, latency has a fractional structure, and minimizing the sum delay leads to an FMP problem. Moreover, the mean square error (MSE) and energy efficiency (EE) metrics have fractional structures. Thus, optimizing EE-related metrics such as the sum or geometric mean of EEs and enhancing the metrics related to spectral-versus-energy-efficiency tradeoff yield FMP problems. Furthermore, both the signal-to-interference-plus-noise ratio and the channel dispersion are FFs. In this paper, we also develop resource allocation schemes for multi-user multiple-input multiple-output (MU-MIMO) systems, using finite block length (FBL) coding, demonstrating attractive practical applications of FMP by optimizing the aforementioned metrics.
The application of machine learning in wireless communications has been extensively explored, with deep unfolding emerging as a powerful model-based technique. Deep unfolding enhances interpretability by transforming complex iterative algorithms into structured layers of deep neural networks (DNNs). This approach seamlessly integrates domain knowledge with deep learning (DL), leveraging the strengths of both methods to simplify complex signal processing tasks in communication systems. To provide a solid foundation, we first present a brief overview of DL and deep unfolding. We then explore the applications of deep unfolding in key areas, including signal detection, channel estimation, beamforming design, decoding for error-correcting codes, sensing and communication, power allocation, and security. Each section focuses on a specific task, highlighting its significance in emerging 6G technologies and reviewing recent advancements in deep unfolding-based solutions. Finally, we discuss the challenges associated with developing deep unfolding techniques and propose potential improvements to enhance their applicability across diverse wireless communication scenarios.
The increasing penetration of Distributed Energy Resources (DERs) in the distribution system has led to the emergence of a new market actor - the aggregator. The aggregator serves as a facilitator, enabling flexibility asset owners to get access to different markets. In which, EVs aggregators are gaining more attention due to their expanding use and potential to provide services in various types of markets, particularly in the reserve market. Currently, TSO indirectly utilizes these resources under the management of the distribution system operators (DSO), which can negatively impact the distribution grid. Conversely, adjustments from DSOs can impact service provision to TSO due to the shortage of TSO usage information. These factors highlight the importance of evaluating the service provision from aggregators under different TSO-DSO coordination schemes. This paper focuses on the provision of flexibility from electric vehicles (EVs) aggregators for balancing service in the TSO-DSO hybrid-managed and compares it with the DSO-managed coordination schemes. The behavior of aggregators reacting to price fluctuations and TSO requests under different coordination schemes and simulation scenarios is thoroughly evaluated. Additionally, their impact on the grid is analyzed through the DSO's congestion management process and validated using data from a real part of the Dutch distribution network. Results find that the hybrid-managed coordination scheme gives more benefit to the aggregator than the DSO-managed scheme and the EVs aggregator will gain more profit in winter than summer due to more upward regulation service is needed.
Deception jamming has long been a significant threat to radar systems, interfering with search, acquisition, and tracking by introducing false information that diverts attention from the targets of interest. As deception strategies become more sophisticated, the vulnerability of radar systems to these attacks continues to escalate. This paper offers a comprehensive review of the evolution of anti-deception jamming techniques, starting with legacy solutions and progressing to the latest advancements. Current research is categorized into three key areas: prevention strategies, which hinder the ability of jammers to alter radar processing; detection strategies, which alert the system to deception and may classify the type of attack; and mitigation strategies, which aim to reduce or suppress the impact of jamming. Additionally, key avenues for further research are highlighted, with a particular emphasis on distributed, cognitive, and AI-enabled radar systems. We envision this paper as a gateway to the existing literature on anti-deception jamming, a critical area for safeguarding radar systems against evolving threats.
Diabetic retinopathy (DR) is a leading cause of vision loss, requiring early and accurate assessment to prevent irreversible damage. Spectral Domain Optical Coherence Tomography (SD-OCT) enables high-resolution retinal imaging, but automated segmentation performance varies, especially in cases with complex fluid and hyperreflective foci (HRF) patterns. This study proposes an active-learning-based deep learning pipeline for automated segmentation of retinal layers, fluid, and HRF, using four state-of-the-art models: U-Net, SegFormer, SwinUNETR, and VM-UNet, trained on expert-annotated SD-OCT volumes. Segmentation accuracy was evaluated with five-fold cross-validation, and retinal thickness was quantified using a K-nearest neighbors algorithm and visualized with Early Treatment Diabetic Retinopathy Study (ETDRS) maps. SwinUNETR achieved the highest overall accuracy (DSC = 0.7719; NSD = 0.8149), while VM-UNet excelled in specific layers. Structural differences were observed between non-proliferative and proliferative DR, with layer-specific thickening correlating with visual acuity impairment. The proposed framework enables robust, clinically relevant DR assessment while reducing the need for manual annotation, supporting improved disease monitoring and treatment planning.
When detecting anomalous sounds in complex environments, one of the main difficulties is that trained models must be sensitive to subtle differences in monitored target signals, while many practical applications also require them to be insensitive to changes in acoustic domains. Examples of such domain shifts include changing the type of microphone or the location of acoustic sensors, which can have a much stronger impact on the acoustic signal than subtle anomalies themselves. Moreover, users typically aim to train a model only on source domain data, which they may have a relatively large collection of, and they hope that such a trained model will be able to generalize well to an unseen target domain by providing only a minimal number of samples to characterize the acoustic signals in that domain. In this work, we review and discuss recent publications focusing on this domain generalization problem for anomalous sound detection in the context of the DCASE challenges on acoustic machine condition monitoring.
SARS-CoV-2, the causative agent of COVID-19, remains a global health concern due to its high transmissibility and evolving variants. Although vaccination efforts and therapeutic advancements have mitigated disease severity, emerging mutations continue to challenge diagnostics and containment strategies. As of mid-February 2025, global test positivity has risen to 11%, marking the highest level in over six months despite widespread immunization efforts. Newer variants demonstrate enhanced host cell binding, increasing both infectivity and diagnostic complexity. This study evaluates the effectiveness of deep transfer learning in delivering rapid, accurate, and mutation-resilient COVID-19 diagnosis from medical imaging, with a focus on scalability and accessibility. We developed an automated detection system using state-of-the-art CNNs, including VGG16, ResNet50, ConvNetXtTiny, MobileNet, NASNetMobile, and DenseNet121 among others, to detect COVID-19 from chest X-ray and CT images. Among all the models evaluated, DenseNet121 emerged as the best-performing architecture for COVID-19 diagnosis using CT and X-ray images. It achieved an impressive accuracy of 98%, with 96.9% precision, 98.9% recall, 97.9% F1-score and 99.8% AUC score, indicating a high degree of consistency and reliability in both detecting positive and negative cases. The confusion matrix showed minimal false positives and false negatives, underscoring the model's robustness in real-world diagnostic scenarios.
We address the real-time remote tracking problem in a status update system comprising two sensors, two independent information sources, and a remote monitor. The status updating follows a pull-based communication, where the monitor commands/pulls the sensors for status updates, i.e., the actual state of the sources. We consider that the observations are correlated, meaning that each sensor sent data could also include the state of the other source due to, e.g., inter-sensor communication or proximity-based monitoring. The effectiveness of data communication is measured by a generic distortion, capturing the underlying application goal. We provide optimal command/pulling policies for the monitor that minimize the average weighted sum distortion and transmission cost. Since the monitor cannot fully observe the exact state of each source, we propose a partially observable Markov decision process (POMDP) and reformulate it as a belief MDP problem. We then effectively truncate the infinite belief space and transform it into a finite-state MDP problem, which is solved via relative value iteration. Simulation results show the effectiveness of the derived policy over age-based and deep-Q network baseline policies.
Positron emission tomography (PET) image denoising, along with lesion and organ segmentation, are critical steps in PET-aided diagnosis. However, existing methods typically treat these tasks independently, overlooking inherent synergies between them as correlated steps in the analysis pipeline. In this work, we present the anatomically and metabolically informed diffusion (AMDiff) model, a unified framework for denoising and lesion/organ segmentation in low-count PET imaging. By integrating multi-task functionality and exploiting the mutual benefits of these tasks, AMDiff enables direct quantification of clinical metrics, such as total lesion glycolysis (TLG), from low-count inputs. The AMDiff model incorporates a semantic-informed denoiser based on diffusion strategy and a denoising-informed segmenter utilizing nnMamba architecture. The segmenter constrains denoised outputs via a lesion-organ-specific regularizer, while the denoiser enhances the segmenter by providing enriched image information through a denoising revision module. These components are connected via a warming-up mechanism to optimize multi-task interactions. Experiments on multi-vendor, multi-center, and multi-noise-level datasets demonstrate the superior performance of AMDiff. For test cases below 20% of the clinical count levels from participating sites, AMDiff achieves TLG quantification biases of -21.60%, outperforming its ablated versions which yield biases of -30.83% (without the lesion-organ-specific regularizer) and -35.63% (without the denoising revision module).
We derive novel deterministic bounds on the approximation error of data-based bilinear surrogate models for unknown nonlinear systems. The surrogate models are constructed using kernel-based extended dynamic mode decomposition to approximate the Koopman operator in a reproducing kernel Hilbert space. Unlike previous methods that require restrictive assumptions on the invariance of the dictionary, our approach leverages kernel-based dictionaries that allow us to control the projection error via pointwise error bounds, overcoming a significant limitation of existing theoretical guarantees. The derived state- and input-dependent error bounds allow for direct integration into Koopman-based robust controller designs with closed-loop guarantees for the unknown nonlinear system. Numerical examples illustrate the effectiveness of the proposed framework.
Variations in Magnetic resonance imaging (MRI) scanners and acquisition protocols cause distribution shifts that degrade reconstruction performance on unseen data. Test-time adaptation (TTA) offers a promising solution to address this discrepancies. However, previous single-shot TTA approaches are inefficient due to repeated training and suboptimal distributional models. Self-supervised learning methods may risk over-smoothing in scarce data scenarios. To address these challenges, we propose a novel Dual-Stage Distribution and Slice Adaptation (D2SA) via MRI implicit neural representation (MR-INR) to improve MRI reconstruction performance and efficiency, which features two stages. In the first stage, an MR-INR branch performs patient-wise distribution adaptation by learning shared representations across slices and modelling patient-specific shifts with mean and variance adjustments. In the second stage, single-slice adaptation refines the output from frozen convolutional layers with a learnable anisotropic diffusion module, preventing over-smoothing and reducing computation. Experiments across five MRI distribution shifts demonstrate that our method can integrate well with various self-supervised learning (SSL) framework, improving performance and accelerating convergence under diverse conditions.
Sensing-assisted communication schemes have recently garnered significant research attention. In this work, we design a dual-function reconfigurable intelligent surface (RIS), integrating both active and passive elements, referred to as the reconfigurable intelligent sensing surface (RISS), to enhance communication. By leveraging sensing results from the active elements, we propose communication enhancement and robust interference suppression schemes for both near-field and far-field models, implemented through the passive elements. These schemes remove the need for base station (BS) feedback for RISS control, simplifying the communication process by replacing traditional channel state information (CSI) feedback with real-time sensing from the active elements. The proposed schemes are theoretically analyzed and then validated using software-defined radio (SDR). Experimental results demonstrate the effectiveness of the sensing algorithms in real-world scenarios, such as direction of arrival (DOA) estimation and radio frequency (RF) identification recognition. Moreover, the RISS-assisted communication system shows strong performance in communication enhancement and interference suppression, particularly in near-field models.
Dynamic metabolic control allows key metabolic fluxes to be modulated in real time, enhancing bioprocess flexibility and expanding available optimization degrees of freedom. This is achieved, e.g., via targeted modulation of metabolic enzyme expression. However, identifying optimal dynamic control policies is challenging due to the generally high-dimensional solution space and the need to manage metabolic burden and cytotoxic effects arising from inducible enzyme expression. The task is further complicated by stochastic dynamics, which reduce bioprocess reproducibility. We propose a reinforcement learning framework} to derive optimal policies by allowing an agent (the controller) to interact with a surrogate dynamic model. To promote robustness, we apply domain randomization, enabling the controller to generalize across uncertainties. When transferred to an experimental system, the agent can in principle continue fine-tuning the policy. Our framework provides an alternative to conventional model-based control such as model predictive control, which requires model differentiation with respect to decision variables; often impractical for complex stochastic, nonlinear, stiff, and piecewise-defined dynamics. In contrast, our approach relies on forward integration of the model, thereby simplifying the task. We demonstrate the framework in two $\textit{Escherichia coli}$ bioprocesses: dynamic control of acetyl-CoA carboxylase for fatty-acid synthesis and of adenosine triphosphatase for lactate synthesis.
Infrared Small Target Detection (IRSTD) system aims to identify small targets in complex backgrounds. Due to the convolution operation in Convolutional Neural Networks (CNNs), applying traditional CNNs to IRSTD presents challenges, since the feature extraction of small targets is often insufficient, resulting in the loss of critical features. To address these issues, we propose a dynamic content-guided attention multiscale feature aggregation network (DCGANet), which adheres to the attention principle of 'coarse-to-fine' and achieves high detection accuracy. First, we propose a selective variable convolution (SVC) module that integrates the benefits of standard convolution, irregular deformable convolution, and multi-rate dilated convolution. This module is designed to expand the receptive field and enhance non-local features, thereby effectively improving the discrimination between targets and backgrounds. Second, the core component of DCGANet is a two-stage content-guided attention module. This module employs a two-stage attention mechanism to initially direct the network's focus to salient regions within the feature maps and subsequently determine whether these regions correspond to targets or background interference. By retaining the most significant responses, this mechanism effectively suppresses false alarms. Additionally, we propose an Adaptive Dynamic Feature Fusion (ADFF) module to substitute for static feature cascading. This dynamic feature fusion strategy enables DCGANet to adaptively integrate contextual features, thereby enhancing its ability to discriminate true targets from false alarms. DCGANet has achieved new benchmarks across multiple datasets.
This paper investigates prime and co-prime integer matrices and their properties. It characterizes all pairwise co-prime integer matrices that are also prime integer matrices. This provides a simple way to construct families of pairwise co-prime integer matrices, that may have applications in multidimensional co-prime sensing and multidimensional Chinese remainder theorem.
Wireless signals are integral to modern society, enabling both communication and increasingly, environmental sensing. While various propagation models exist, ranging from empirical methods to full-wave simulations, the phenomenon of electromagnetic diffraction is often treated as a secondary effect or a correction factor. This paper positions diffraction as a fundamentally important and underutilized mechanism that is rich with information about the physical environment. Specifically, diffraction-inducing elements generate distinct signatures that are rich with information about their underlying properties such as their geometries. We then argue that by understanding and exploiting these relationships, diffraction can be harnessed strategically. We introduce a general optimization framework to formalize this concept, illustrating how diffraction can be leveraged for both inverse problems (sensing scene details such as object geometries from measured fields) and design problems (shaping radio frequency (RF) fields for communication objectives by configuring diffracting elements). Focusing primarily on edge diffraction and Keller's Geometrical Theory of Diffraction (GTD), we discuss specific applications in RF sensing for scene understanding and in communications for RF field programming, drawing upon recent work. Overall, this paper lays out a vision for systematically incorporating diffraction into the design and operation of future wireless systems, paving the way for enhanced sensing capabilities and more robust communication strategies.
Millimeter-wave (mmWave) communication enables high data rates for cellular-connected Unmanned Aerial Vehicles (UAVs). However, a robust beam management remains challenging due to significant path loss and the dynamic mobility of UAVs, which can destabilize the UAV-base station (BS) link. This research presents a GPS-aided deep learning (DL) model that simultaneously predicts current and future optimal beams for UAV mmWave communications, maintaining a Top-1 prediction accuracy exceeding 70% and an average power loss below 0.6 dB across all prediction steps. These outcomes stem from a proposed data set splitting method ensuring balanced label distribution, paired with a GPS preprocessing technique that extracts key positional features, and a DL architecture that maps sequential position data to beam index predictions. The model reduces overhead by approximately 93% (requiring the training of 2 ~ 3 beams instead of 32 beams) with 95% beam prediction accuracy guarantees, and ensures 94% to 96% of predictions exhibit mean power loss not exceeding 1 dB.
This paper investigates the robustness of the Lur'e problem under positivity constraints, drawing on results from the positive Aizerman conjecture and robustness properties of Metzler matrices. Specifically, we consider a control system of Lur'e type in which not only the linear part includes parametric uncertainty but also the nonlinear sector bound is unknown. We investigate tools from positive linear systems to effectively solve the problems in complicated and uncertain nonlinear systems. By leveraging the positivity characteristic of the system, we derive an explicit formula for the stability radius of Lur'e systems. Furthermore, we extend our analysis to systems with neural network (NN) feedback loops. Building on this approach, we also propose a refinement method for sector bounds of NNs. This study introduces a scalable and efficient approach for robustness analysis of both Lur'e and NN-controlled systems. Finally, the proposed results are supported by illustrative examples.
This work presents a fully-digital high-accuracy real-time calibration procedure for frequency and time alignment of open-loop wirelessly coordinated coherent distributed antenna array (CDA) modems, enabling radio frequency (RF) phase coherence of spatially separated commercial off-the-shelf (COTS) software-defined radios (SDRs) without cables or external references such as global navigation satellite system (GNSS). Building on previous work using high-accuracy spectrally-sparse time of arrival (ToA) waveforms and a multi-step ToA refinement process, a high-accuracy two-way time transfer (TWTT)-based timefrequency coordination approach is demonstrated. Due to the two-way nature of the high-accuracy TWTT approach, the time and frequency estimates are Doppler and multi-path tolerant, so long as the channel is reciprocal over the synchronization epoch. This technique is experimentally verified using COTS SDRs in a lab environment in static and dynamic scenarios and with significant multipath scatterers. Time, frequency, and phase stability were evaluated by beamforming over coaxial cables to an oscilloscope which achieved time and phase precisions of ~60 ps-70 ps, with median coherent gains above 99 % using optimized coordination parameters, and a beamforming frequency root-mean-square error (RMSE) of 3.73 ppb in a dynamic scenario. Finally, experiments were conducted to compare the performance of this technique with previous works using an analog continuous-wave two-tone (CWTT) frequency reference technique in both static and dynamic settings.
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: this https URL.
Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human-perception-based CLAP called Human-CLAP by training a contrastive language-audio model using the subjective evaluation score. In our experiments, the results indicate that our Human-CLAP improved the Spearman's rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP.
We present our solution for the Multi-Source COVID-19 Detection Challenge, which classifies chest CT scans from four distinct medical centers. To address multi-source variability, we employ the Spatial-Slice Feature Learning (SSFL) framework with Kernel-Density-based Slice Sampling (KDS). Our preprocessing pipeline combines lung region extraction, quality control, and adaptive slice sampling to select eight representative slices per scan. We compare EfficientNet and Swin Transformer architectures on the validation set. The EfficientNet model achieves an F1-score of 94.68%, compared to the Swin Transformer's 93.34%. The results demonstrate the effectiveness of our KDS-based pipeline on multi-source data and highlight the importance of dataset balance in multi-institutional medical imaging evaluation.
Colorectal polyp segmentation is critical for early detection of colorectal cancer, yet weak and low contrast boundaries significantly limit automated accuracy. Existing deep models either blur fine edge details or rely on handcrafted filters that perform poorly under variable imaging conditions. We propose MEGANet-W, a Wavelet Driven Edge Guided Attention Network that injects directional, parameter free Haar wavelet edge maps into each decoder stage to recalibrate semantic features. Our two main contributions are: (1) a two-level Haar wavelet head for multi orientation edge extraction; and (2) Wavelet Edge Guided Attention (WEGA) modules that fuse wavelet cues with boundary and input branches. On five public polyp datasets, MEGANet-W consistently outperforms existing methods, improving mIoU by up to 2.3% and mDice by 1.2%, while introducing no additional learnable parameters.
Music source separation aims to extract individual sound sources (e.g., vocals, drums, guitar) from a mixed music recording. However, evaluating the quality of separated audio remains challenging, as commonly used metrics like the source-to-distortion ratio (SDR) do not always align with human perception. In this study, we conducted a large-scale listener evaluation on the MUSDB18 test set, collecting approximately 30 ratings per track from seven distinct listener groups. We compared several objective energy-ratio metrics, including legacy measures (BSSEval v4, SI-SDR variants), and embedding-based alternatives (Frechet Audio Distance using CLAP-LAION-music, EnCodec, VGGish, Wave2Vec2, and HuBERT). While SDR remains the best-performing metric for vocal estimates, our results show that the scale-invariant signal-to-artifacts ratio (SI-SAR) better predicts listener ratings for drums and bass stems. Frechet Audio Distance (FAD) computed with the CLAP-LAION-music embedding also performs competitively--achieving Kendall's tau values of 0.25 for drums and 0.19 for bass--matching or surpassing energy-based metrics for those stems. However, none of the embedding-based metrics, including CLAP, correlate positively with human perception for vocal estimates. These findings highlight the need for stem-specific evaluation strategies and suggest that no single metric reliably reflects perceptual quality across all source types. We release our raw listener ratings to support reproducibility and further research.
We present a metasurface camera that jointly performs high-dynamic range (HDR) and hyperspectral imaging in a snapshot. The system integrates exposure bracketing and computed tomography imaging spectrometry (CTIS) by simultaneously forming multiple spatially multiplexed projections with unique power ratios and chromatic aberrations on a photosensor. The measurements are subsequently processed through a deep reconstruction model to generate an HDR image and a hyperspectral datacube. Our simulation studies show that the proposed system achieves higher reconstruction accuracy than previous snapshot hyperspectral imaging methods on benchmark datasets. We assemble a working prototype and demonstrate snapshot reconstruction of 60 dB dynamic range and 10 nm spectral resolution from 600 nm to 700 nm on real-world scenes from a monochrome photosensor.
We study infinite horizon Markov decision processes (MDPs) with "fast-slow" structure, where some state variables evolve rapidly ("fast states") while others change more gradually ("slow states"). This structure commonly arises in practice when decisions must be made at high frequencies over long horizons, and where slowly changing information still plays a critical role in determining optimal actions. Examples include inventory control under slowly changing demand indicators or dynamic pricing with gradually shifting consumer behavior. Modeling the problem at the natural decision frequency leads to MDPs with discount factors close to one, making them computationally challenging. We propose a novel approximation strategy that "freezes" slow states during phases of lower-level planning and subsequently applies value iteration to an auxiliary upper-level MDP that evolves on a slower timescale. Freezing states for short periods of time leads to easier-to-solve lower-level problems, while a slower upper-level timescale allows for a more favorable discount factor. On the theoretical side, we analyze the regret incurred by our frozen-state approach, which leads to simple insights on how to trade off regret versus computational cost. Empirically, we benchmark our new frozen-state methods on three domains, (i) inventory control with fixed order costs, (ii) a gridworld problem with spatial tasks, and (iii) dynamic pricing with reference-price effects. We demonstrate that the new methods produce high-quality policies with significantly less computation, and we show that simply omitting slow states is often a poor heuristic.
Multimodal learning on video and text has seen significant progress, particularly in tasks like text-to-video retrieval, video-to-text retrieval, and video captioning. However, most existing methods and datasets focus exclusively on English. Despite Indonesian being one of the most widely spoken languages, multimodal research in Indonesian remains under-explored, largely due to the lack of benchmark datasets. To address this gap, we introduce the first public Indonesian video-text dataset by translating the English captions in the MSVD dataset into Indonesian. Using this dataset, we evaluate neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. Most existing models rely on feature extractors pretrained on English vision-language datasets, raising concerns about their applicability to Indonesian, given the scarcity of large-scale pretraining resources in the language. We apply a cross-lingual transfer learning approach by leveraging English-pretrained extractors and fine-tuning models on our Indonesian dataset. Experimental results demonstrate that this strategy improves performance across all tasks and metrics. We release our dataset publicly to support future research and hope it will inspire further progress in Indonesian multimodal learning.
Neural compression has brought tremendous progress in designing lossy compressors with good rate-distortion (RD) performance at low complexity. Thus far, neural compression design involves transforming the source to a latent vector, which is then rounded to integers and entropy coded. While this approach has been shown to be optimal on a few specific sources, we show that it can be highly sub-optimal on synthetic sources whose intrinsic dimensionality is greater than one. With integer rounding in the latent space, the quantization regions induced by neural transformations, remain square-like and fail to match those of optimal vector quantization. We demonstrate that this phenomenon is due to the choice of scalar quantization in the latent space, and not the transform design. By employing lattice quantization instead, we propose Lattice Transform Coding (LTC) and show that it approximately recovers optimal vector quantization at reasonable complexity. On real-world sources, LTC improves upon standard neural compressors. LTC also provides a framework that can integrate structurally (near) optimal information-theoretic designs into lossy compression; examples include block coding, which yields coding gain over optimal one-shot coding and approaches the asymptotically-achievable rate-distortion function, as well as nested lattice quantization for low complexity fixed-rate coding.
Recent advancement in online optimization and control has provided novel tools to study online linear quadratic regulator (LQR) problems, where cost matrices are time-varying and unknown in advance. In this work, we study the online linear quadratic Gaussian (LQG) problem over the manifold of stabilizing controllers that are linearly constrained to impose physical conditions such as sparsity. By adopting a Riemannian perspective, we propose the online Newton on manifold (ONM) algorithm, which generates an online controller on-the-fly based on the second-order information of the cost function sequence. To quantify the algorithm performance, we use the notion of regret, defined as the sub-optimality of the algorithm cumulative cost against a (locally) minimizing controller sequence. We establish a regret bound in terms of the path-length of the benchmark minimizer sequence, and we further verify the effectiveness of ONM via simulations.
Accurately segmenting 3D curvilinear structures in medical imaging remains challenging due to their complex geometry and the scarcity of diverse, large-scale datasets for algorithm development and evaluation. In this paper, we use dendritic spine segmentation as a case study and address these challenges by introducing a novel Frenet--Serret Frame-based Decomposition, which decomposes 3D curvilinear structures into a globally \( C^2 \) continuous curve that captures the overall shape, and a cylindrical primitive that encodes local geometric properties. This approach leverages Frenet--Serret Frames and arc length parameterization to preserve essential geometric features while reducing representational complexity, facilitating data-efficient learning, improved segmentation accuracy, and generalization on 3D curvilinear structures. To rigorously evaluate our method, we introduce two datasets: CurviSeg, a synthetic dataset for 3D curvilinear structure segmentation that validates our method's key properties, and DenSpineEM, a benchmark for dendritic spine segmentation, which comprises 4,476 manually annotated spines from 70 dendrites across three public electron microscopy datasets, covering multiple brain regions and species. Our experiments on DenSpineEM demonstrate exceptional cross-region and cross-species generalization: models trained on the mouse somatosensory cortex subset achieve 91.9\% Dice, maintaining strong performance in zero-shot segmentation on both mouse visual cortex (94.1\% Dice) and human frontal lobe (81.8\% Dice) subsets. Moreover, we test the generalizability of our method on the IntrA dataset, where it achieves 77.08\% Dice (5.29\% higher than prior arts) on intracranial aneurysm segmentation. These findings demonstrate the potential of our approach for accurately analyzing complex curvilinear structures across diverse medical imaging fields.
The capacity of a discrete-time channel with correlated phase noises is investigated. In particular, the electro-optic frequency comb system is considered, where the phase noise of each subchannel is a combination of two independent Wiener phase-noise sources. Capacity upper and lower bounds are derived for this channel and are compared with lower bounds obtained by numerically evaluating the achievable information rates using quadrature amplitude modulation constellations. Capacity upper and lower bounds are provided for the high signal-to-noise ratio (SNR) regime. The multiplexing gain (pre-log) is shown to be $M-1$, where $M$ represents the number of subchannels. A constant gap between the asymptotic upper and lower bounds is observed, which depends on the number of subchannels $M$. For the specific case of $M=2$, capacity is characterized up to a term that vanishes as the SNR grows large.
Phase-only compressed sensing (PO-CS) concerns the recovery of sparse signals from the phases of complex measurements. Recent results show that sparse signals in the standard sphere $\mathbb{S}^{n-1}$ can be exactly recovered from complex Gaussian phases by a linearization procedure, which recasts PO-CS as linear compressed sensing and then applies (quadratically constrained) basis pursuit to obtain $\mathbf{x}^\sharp$. This paper focuses on the instance optimality and robustness of $\mathbf{x}^{\sharp}$. First, we strengthen the nonuniform instance optimality of Jacques and Feuillen (2021) to a uniform one over the entire signal space. We show the existence of some universal constant $C$ such that $\|\mathbf{x}^\sharp-\mathbf{x}\|_2\le Cs^{-1/2}\sigma_{\ell_1}(\mathbf{x},\Sigma^n_s)$ holds for all $\mathbf{x}$ in the unit Euclidean sphere, where $\sigma_{\ell_1}(\mathbf{x},\Sigma^n_s)$ is the $\ell_1$ distance of $\mathbf{x}$ to its closest $s$-sparse signal. This is achieved by showing the new sensing matrices corresponding to all approximately sparse signals simultaneously satisfy RIP. Second, we investigate the estimator's robustness to noise and corruption. We show that dense noise with entries bounded by some small $\tau_0$, appearing either prior or posterior to retaining the phases, increments $\|\mathbf{x}^\sharp-\mathbf{x}\|_2$ by $O(\tau_0)$. This is near-optimal (up to log factors) for any algorithm. On the other hand, adversarial corruption, which changes an arbitrary $\zeta_0$-fraction of the measurements to any phase-only values, increments $\|\mathbf{x}^\sharp-\mathbf{x}\|_2$ by $O(\sqrt{\zeta_0\log(1/\zeta_0)})$. The developments are then combined to yield a robust instance optimal guarantee that resembles the standard one in linear compressed sensing.
Unsupervised restoration approaches based on generative adversarial networks (GANs) offer a promising solution without requiring paired datasets. Yet, these GAN-based approaches struggle to surpass the performance of conventional unsupervised GAN-based frameworks without significantly modifying model structures or increasing the computational complexity. To address these issues, we propose a self-collaboration (SC) strategy for existing restoration models. This strategy utilizes information from the previous stage as feedback to guide subsequent stages, achieving significant performance improvement without increasing the framework's inference complexity. The SC strategy comprises a prompt learning (PL) module and a restorer ($Res$). It iteratively replaces the previous less powerful fixed restorer $\overline{Res}$ in the PL module with a more powerful $Res$. The enhanced PL module generates better pseudo-degraded/clean image pairs, leading to a more powerful $Res$ for the next iteration. Our SC can significantly improve the $Res$'s performance by over 1.5 dB without adding extra parameters or computational complexity during inference. Meanwhile, existing self-ensemble (SE) and our SC strategies enhance the performance of pre-trained restorers from different perspectives. As SE increases computational complexity during inference, we propose a re-boosting module to the SC (Reb-SC) to improve the SC strategy further by incorporating SE into SC without increasing inference time. This approach further enhances the restorer's performance by approximately 0.3 dB. Extensive experimental results on restoration tasks demonstrate that the proposed model performs favorably against existing state-of-the-art unsupervised restoration methods. Source code and trained models are publicly available at: this https URL.
Molecular assays are standard of care for detecting genomic alterations in cancer prognosis and therapy selection but are costly, tissue-destructive and time-consuming. Artificial intelligence (AI) applied to routine hematoxylin and eosin (H&E)-stained whole slide images (WSIs) offers a fast and economical alternative for screening molecular biomarkers. We introduce OmniScreen, a high-throughput AI-based system leveraging Virchow2 embeddings extracted from 60,529 cancer patients with paired 489-gene MSK-IMPACT targeted biomarker panel and WSIs. Unlike conventional approaches that train separate models for each biomarker, OmniScreen employs a unified model to predict a broad range of clinically relevant biomarkers across cancers, including low-prevalence targets impractical to model individually. OmniScreen reliably identifies therapeutic targets and shared phenotypic features across common and rare tumors. We investigate the biomarker prediction probabilities and accuracies of OmniScreen in relation to tumor area, cohort size, histologic subtype alignment, and pathway-level morphological patterns. These findings underscore the potential of OmniScreen for routine clinical screening.
Safe and efficient multi-agent navigation in dynamic environments remains inherently challenging, particularly when real-time decision-making is required on resource-constrained platforms. Ensuring collision-free trajectories while adapting to uncertainties without relying on pre-built maps further complicates real-world deployment. To address these challenges, we propose LSTP-Nav, a lightweight end-to-end policy for multi-agent navigation that enables map-free collision avoidance in complex environments by directly mapping raw LiDAR point clouds to motion commands. At the core of this framework lies LSTP-Net, an efficient network that processes raw LiDAR data using a GRU architecture, enhanced with attention mechanisms to dynamically focus on critical environmental features while minimizing computational overhead. Additionally, a novel HS reward optimizes collision avoidance by incorporating angular velocity, prioritizing obstacles along the predicted heading, and enhancing training stability. To narrow the sim-to-real gap, we develop PhysReplay-Simlab, a physics-realistic multi-agent simulator, employs localized replay to mine near-failure experiences. Relying solely on LiDA, LSTP-Nav achieves efficient zero-shot sim-to-real transfer on a CPU-only robotic platform, enabling robust navigation in dynamic environments while maintaining computation frequencies above 40 Hz. Extensive experiments demonstrate that LSTP-Nav outperforms baselines with a 9.58\% higher success rate and a 12.30\% lower collision rate, underscoring its practicality and robustness for real-world applications.
In this paper, we delve into the realm of 4-D light fields (LFs) to enhance underwater imaging plagued by light absorption, scattering, and other challenges. Contrasting with conventional 2-D RGB imaging, 4-D LF imaging excels in capturing scenes from multiple perspectives, thereby indirectly embedding geometric information. This intrinsic property is anticipated to effectively address the challenges associated with underwater imaging. By leveraging both explicit and implicit depth cues present in 4-D LF images, we propose a progressive, mutually reinforcing framework for underwater 4-D LF image enhancement and depth estimation. Specifically, our framework explicitly utilizes estimated depth information alongside implicit depth-related dynamic convolutional kernels to modulate output features. The entire framework decomposes this complex task, iteratively optimizing the enhanced image and depth information to progressively achieve optimal enhancement results. More importantly, we construct the first 4-D LF-based underwater image dataset for quantitative evaluation and supervised training of learning-based methods, comprising 75 underwater scenes and 3675 high-resolution 2K pairs. To craft vibrant and varied underwater scenes, we build underwater environments with various objects and adopt several types of degradation. Through extensive experimentation, we showcase the potential and superiority of 4-D LF-based underwater imaging vis-a-vis traditional 2-D RGB-based approaches. Moreover, our method effectively corrects color bias and achieves state-of-the-art performance. The dataset and code will be publicly available at this https URL.
Understanding land cover holds considerable potential for a myriad of practical applications, particularly as data accessibility transitions from being exclusive to governmental and commercial entities to now including the broader research community. Nevertheless, although the data is accessible to any community member interested in exploration, there exists a formidable learning curve and no standardized process for accessing, pre-processing, and leveraging the data for subsequent tasks. In this study, we democratize this data by presenting a flexible and efficient end to end pipeline for working with the Dynamic World dataset, a cutting-edge near-real-time land use/land cover (LULC) dataset. This includes a pre-processing and representation framework which tackles noise removal, efficient extraction of large amounts of data, and re-representation of LULC data in a format well suited for several downstream tasks. To demonstrate the power of our pipeline, we use it to extract data for an urbanization prediction problem and build a suite of machine learning models with excellent performance. This task is easily generalizable to the prediction of any type of land cover and our pipeline is also compatible with a series of other downstream tasks.
The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.
Data-driven techniques for analysis, modeling, and control of complex dynamical systems are on the uptake. Koopman theory provides the theoretical foundation for the popular kernel extended dynamic mode decomposition (kEDMD). In this work, we propose a novel kEDMD scheme to approximate nonlinear control systems accompanied by an in-depth error analysis. Key features are regularization-based robustness and an adroit decomposition into micro and macro grids enabling flexible sampling. But foremost, we prove proportionality, i.e., explicit dependence on the distance to the (controlled) equilibrium, of the derived bound on the full approximation error. Leveraging this key property, we rigorously show that asymptotic stability of the data-driven surrogate (control) system implies asymptotic stability of the original (control) system and vice versa.
In recent years, Unmanned Aerial Vehicles (UAVs) have been utilized as effective platforms for carrying Wi-Fi Access Points (APs) and cellular Base Stations (BSs), enabling low-cost, agile, and flexible wireless networks with high Quality of Service (QoS). The next generation of wireless communications will rely on increasingly higher frequencies, which are easily obstructed by obstacles. One of the most critical concepts yet to be fully addressed is positioning the UAV at optimal coordinates while accounting for obstacles. To ensure a line of sight (LoS) between UAVs and user equipment (UE), improve QoS, and establish reliable wireless links with maximum coverage, obstacles must be integrated into the proposed placement algorithms. This paper introduces a simulation-based measurement approach for characterizing an air-to-ground (AG) channel in a simple scenario. By considering obstacles, we present a novel perspective on channel characterization. The results, in terms of throughput, packet delivery, packet loss, and delay, are compared using the proposed positioning approach.
Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising $3,382$ AGAVs from $16$ VTA methods. AGAVQA-3k includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further propose AGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA-3k, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. The dataset and code is available at this https URL.
Developing advanced medical imaging retrieval systems is challenging due to the varying definitions of `similar images' across different medical contexts. This challenge is compounded by the lack of large-scale, high-quality medical imaging retrieval datasets and benchmarks. In this paper, we propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities in a scalable and fully automatic manner. Using this approach, we construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans, providing detailed image-image ranking annotations conditioned on diverse anatomical structures. Furthermore, we develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks. These systems also enable flexible, effective image retrieval conditioned on specific anatomical structures described in text, achieving state-of-the-art results on 77 out of 78 metrics.
Image degradation synthesis is highly desirable in a wide variety of applications ranging from image restoration to simulating artistic effects. Existing models are designed to generate one specific or a narrow set of degradations, which often require user-provided degradation parameters. As a result, they lack the generalizability to synthesize degradations beyond their initial design or adapt to other applications. Here we propose the first universal degradation model that can synthesize a broad spectrum of complex and realistic degradations containing both homogeneous (global) and inhomogeneous (spatially varying) components. Our model automatically extracts and disentangles homogeneous and inhomogeneous degradation features, which are later used for degradation synthesis without user intervention. A disentangle-by-compression method is proposed to separate degradation information from images. Two novel modules for extracting and incorporating inhomogeneous degradations are created to model inhomogeneous components in complex degradations. We demonstrate the model's accuracy and adaptability in film-grain simulation and blind image restoration tasks. The demo video, code, and dataset of this project will be released at this http URL.
We investigate the multiuser scheduling problem in multiple-input multiple-output (MIMO) systems using orthogonal frequency division multiplexing (OFDM) and hybrid beamforming in which a base station (BS) communicates with multiple users over millimeter wave (mmWave) channels in the downlink. Improved scheduling is critical for enhancing spectral efficiency and the long-term performance of the system from the perspective of proportional fairness (PF) metric in hybrid beamforming systems due to its limited multiplexing gain. Our objective is to maximize PF by properly designing the analog and digital precoders within the hybrid beamforming and selecting the users subject to the number of radio frequency (RF) chains. Leveraging the characteristics of mmWave channels, we apply a two-timescale protocol. On a long timescale, we assign an analog beam to each user. Scheduling the users and designing the digital precoder are done accordingly on a short timescale. To conduct scheduling, we propose combinatorial solutions, such as greedy and sorting algorithms, followed by a machine learning (ML) approach. Our numerical results highlight the trade-off between the performance and complexity of the proposed approaches. Consequently, we show that the choice of approach depends on the specific criteria within a given scenario.
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans. Project page: this https URL
Partial audio deepfake localization pose unique challenges and remain underexplored compared to full-utterance spoofing detection. While recent methods report strong in-domain performance, their real-world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures generalization and deployment readiness. We propose reframing the localization task as a sequential anomaly detection problem and advocate for the use of threshold-dependent metrics such as accuracy, precision, recall, and F1-score, which better reflect real-world behavior. Specifically, we analyze the performance of the open-source Coarse-to-Fine Proposal Refinement Framework (CFPRF), which achieves a 20-ms EER of 7.61% on the in-domain PartialSpoof evaluation set, but 43.25% and 27.59% on the LlamaPartialSpoof and Half-Truth out-of-domain test sets. Interestingly, our reproduced version of the same model performs worse on in-domain data (9.84%) but better on the out-of-domain sets (41.72% and 14.98%, respectively). This highlights the risks of over-optimizing for in-domain EER, which can lead to models that perform poorly in real-world scenarios. It also suggests that while deep learning models can be effective on in-domain data, they generalize poorly to out-of-domain scenarios, failing to detect novel synthetic samples and misclassifying unfamiliar bona fide audio. Finally, we observe that adding more bona fide or fully synthetic utterances to the training data often degrades performance, whereas adding partially fake utterances improves it.
We show the mutual information between the targets in a Gray-Wyner Network as a bound that separates Wyner's lossy common information and Gács-Körner lossy common information. The results are a generalization of the lossless case presented by Wyner (1975).
Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods relying on global video information struggle with complex scenes and generating audio tailored to specific objects. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with audio. Furthermore, we tailor two data augmentation strategies, Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), to enhance the model's sensitivity to segmented objects. To measure audio-visual correspondence, we designed a new evaluation metric, the CAV score. Extensive experiments demonstrate that our framework offers more precise control and improves generation performance across various metrics. Project Page: this https URL
In recent years, the advancement of imitation learning has led to increased interest in teleoperating low-cost manipulators to collect demonstration data. However, most existing systems rely on unilateral control, which only transmits target position values. While this approach is easy to implement and suitable for slow, non-contact tasks, it struggles with fast or contact-rich operations due to the absence of force feedback. This work demonstrates that fast teleoperation with force feedback is feasible even with force-sensorless, low-cost manipulators by leveraging 4-channel bilateral control. Based on accurately identified manipulator dynamics, our method integrates nonlinear terms compensation, velocity and external force estimation, and variable gain corresponding to inertial variation. Furthermore, using data collected by 4-channel bilateral control, we show that incorporating force information into both the input and output of learned policies improves performance in imitation learning. These results highlight the practical effectiveness of our system for high-fidelity teleoperation and data collection on affordable hardware.
This article presents a novel stream function-based navigational control system for obstacle avoidance, where obstacles are represented as two-dimensional (2D) rigid surfaces in inviscid, incompressible flows. The approach leverages the vortex panel method (VPM) and incorporates safety margins to control the stream function and flow properties around virtual surfaces, enabling navigation in complex, partially observed environments using real-time sensing. To address the limitations of the VPM in managing relative distance and avoiding rapidly accelerating obstacles at close proximity, the system integrates a model predictive controller (MPC) based on higher-order control barrier functions (HOCBF). This integration incorporates VPM trajectory generation, state estimation, and constraint handling into a receding-horizon optimization problem. The 2D rigid surfaces are enclosed using minimum bounding ellipses (MBEs), while an adaptive Kalman filter (AKF) captures and predicts obstacle dynamics, propagating these estimates into the MPC-HOCBF for rapid avoidance maneuvers. Evaluation is conducted using a PX4-powered Clover drone Gazebo simulator and real-time experiments involving a COEX Clover quadcopter equipped with a 360 degree LiDAR sensor.
Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.