New articles on Electrical Engineering and Systems Science

[1] 2407.14564

APS-USCT: Ultrasound Computed Tomography on Sparse Data via AI-Physic Synergy

Ultrasound computed tomography (USCT) is a promising technique that achieves superior medical imaging reconstruction resolution by fully leveraging waveform information, outperforming conventional ultrasound methods. Despite its advantages, high-quality USCT reconstruction relies on extensive data acquisition by a large number of transducers, leading to increased costs, computational demands, extended patient scanning times, and manufacturing complexities. To mitigate these issues, we propose a new USCT method called APS-USCT, which facilitates imaging with sparse data, substantially reducing dependence on high-cost dense data acquisition. Our APS-USCT method consists of two primary components: APS-wave and APS-FWI. The APS-wave component, an encoder-decoder system, preprocesses the waveform data, converting sparse data into dense waveforms to augment sample density prior to reconstruction. The APS-FWI component, utilizing the InversionNet, directly reconstructs the speed of sound (SOS) from the ultrasound waveform data. We further improve the model's performance by incorporating Squeeze-and-Excitation (SE) Blocks and source encoding techniques. Testing our method on a breast cancer dataset yielded promising results. It demonstrated outstanding performance with an average Structural Similarity Index (SSIM) of 0.8431. Notably, over 82% of samples achieved an SSIM above 0.8, with nearly 61% exceeding 0.85, highlighting the significant potential of our approach in improving USCT image reconstruction by efficiently utilizing sparse data.

[2] 2407.14616

Deep Learning-based 3D Coronary Tree Reconstruction from Two 2D Non-simultaneous X-ray Angiography Projections

Cardiovascular diseases (CVDs) are the most common cause of death worldwide. Invasive x-ray coronary angiography (ICA) is one of the most important imaging modalities for the diagnosis of CVDs. ICA typically acquires only two 2D projections, which makes the 3D geometry of coronary vessels difficult to interpret, thus requiring 3D coronary tree reconstruction from two projections. State-of-the-art approaches require significant manual interactions and cannot correct the non-rigid cardiac and respiratory motions between non-simultaneous projections. In this study, we propose a novel deep learning pipeline. We leverage the Wasserstein conditional generative adversarial network with gradient penalty, latent convolutional transformer layers, and a dynamic snake convolutional critic to implicitly compensate for the non-rigid motion and provide 3D coronary tree reconstruction. Through simulating projections from coronary computed tomography angiography (CCTA), we achieve the generalisation of 3D coronary tree reconstruction on real non-simultaneous ICA projections. We incorporate an application-specific evaluation metric to validate our proposed model on both a CCTA dataset and a real ICA dataset, together with Chamfer L1 distance. The results demonstrate the good performance of our model in vessel topology preservation, recovery of missing features, and generalisation ability to real ICA data. To the best of our knowledge, this is the first study that leverages deep learning to achieve 3D coronary tree reconstruction from two real non-simultaneous x-ray angiography projections.

[3] 2407.14625

Benchmarking deep learning models for bearing fault diagnosis using the CWRU dataset: A multi-label approach

This paper proposes a novel approach for modeling the problem of fault diagnosis using the Case Western Reserve University (CWRU) bearing fault dataset. Although the dataset is considered a standard reference for testing new algorithms, the typical dataset division suffers from data leakage, as shown by Hendriks et al. (2022) and Abburi et al. (2023), leading to papers reporting over-optimistic results. While their proposed division significantly mitigates this issue, it does not eliminate it entirely. Moreover, their proposed multi-class classification task can still lead to an unrealistic scenario by excluding the possibility of more than one fault type occurring at the same or different locations. As advocated in this paper, a multi-label formulation (detecting the presence of each type of fault for each location) can solve both issues, leading to a scenario closer to reality. Additionally, this approach mitigates the heavy class imbalance of the CWRU dataset, where faulty cases appear much more frequently than healthy cases, even though the opposite is more likely to occur in practice. A multi-label formulation also enables a more precise evaluation using prevalence-independent evaluation metrics for binary classification, such as the ROC curve. Finally, this paper proposes a more realistic dataset division that allows for more diversity in the training dataset while keeping the division free from data leakage. The results show that this new division can significantly improve performance while enabling a fine-grained error analysis. As an application of our approach, a comparative benchmark is performed using several state-of-the-art deep learning models applied to 1D and 2D signal representations in time and/or frequency domains.

[4] 2407.14651

Improving Representation of High-frequency Components for Medical Foundation Models

Foundation models have recently attracted significant attention for their impressive generalizability across diverse downstream tasks. However, these models are demonstrated to exhibit great limitations in representing high-frequency components and fine-grained details. In many medical imaging tasks, the precise representation of such information is crucial due to the inherently intricate anatomical structures, sub-visual features, and complex boundaries involved. Consequently, the limited representation of prevalent foundation models can result in significant performance degradation or even failure in these tasks. To address these challenges, we propose a novel pretraining strategy, named Frequency-advanced Representation Autoencoder (Frepa). Through high-frequency masking and low-frequency perturbation combined with adversarial learning, Frepa encourages the encoder to effectively represent and preserve high-frequency components in the image embeddings. Additionally, we introduce an innovative histogram-equalized image masking strategy, extending the Masked Autoencoder approach beyond ViT to other architectures such as Swin Transformer and convolutional networks. We develop Frepa across nine medical modalities and validate it on 32 downstream tasks for both 2D images and 3D volume data. Without fine-tuning, Frepa can outperform other self-supervised pretraining methods and, in some cases, even surpasses task-specific trained models. This improvement is particularly significant for tasks involving fine-grained details, such as achieving up to a +15% increase in DSC for retina vessel segmentation and a +7% increase in IoU for lung nodule detection. Further experiments quantitatively reveal that Frepa enables superior high-frequency representations and preservation in the embeddings, underscoring its potential for developing more generalized and universal medical image foundation models.

[5] 2407.14712

Multi-label audio classification with a noisy zero-shot teacher

We propose a novel training scheme using self-label correction and data augmentation methods designed to deal with noisy labels and improve real-world accuracy on a polyphonic audio content detection task. The augmentation method reduces label noise by mixing multiple audio clips and joining their labels, while being compatible with multiple active labels. We additionally show that performance can be improved by a self-label correction method using the same pretrained model. Finally, we show that it is feasible to use a strong zero-shot model such as CLAP to generate labels for unlabeled data and improve the results using the proposed training and label enhancement methods. The resulting model performs similar to CLAP while being an efficient mobile device friendly architecture and can be quickly adapted to unlabeled sound classes.

[6] 2407.14719

Universal Medical Imaging Model for Domain Generalization with Data Privacy

Achieving domain generalization in medical imaging poses a significant challenge, primarily due to the limited availability of publicly labeled datasets in this domain. This limitation arises from concerns related to data privacy and the necessity for medical expertise to accurately label the data. In this paper, we propose a federated learning approach to transfer knowledge from multiple local models to a global model, eliminating the need for direct access to the local datasets used to train each model. The primary objective is to train a global model capable of performing a wide variety of medical imaging tasks. This is done while ensuring the confidentiality of the private datasets utilized during the training of these models. To validate the effectiveness of our approach, extensive experiments were conducted on eight datasets, each corresponding to a different medical imaging application. The client's data distribution in our experiments varies significantly as they originate from diverse domains. Despite this variation, we demonstrate a statistically significant improvement over a state-of-the-art baseline utilizing masked image modeling over a diverse pre-training dataset that spans different body parts and scanning types. This improvement is achieved by curating information learned from clients without accessing any labeled dataset on the server.

[7] 2407.14754

Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures

Accurate segmentation of long and thin tubular structures is required in a wide variety of areas such as biology, medicine, and remote sensing. The complex topology and geometry of such structures often pose significant technical challenges. A fundamental property of such structures is their topological self-similarity, which can be quantified by fractal features such as fractal dimension (FD). In this study, we incorporate fractal features into a deep learning model by extending FD to the pixel-level using a sliding window technique. The resulting fractal feature maps (FFMs) are then incorporated as additional input to the model and additional weight in the loss function to enhance segmentation performance by utilizing the topological self-similarity. Moreover, we extend the U-Net architecture by incorporating an edge decoder and a skeleton decoder to improve boundary accuracy and skeletal continuity of segmentation, respectively. Extensive experiments on five tubular structure datasets validate the effectiveness and robustness of our approach. Furthermore, the integration of FFMs with other popular segmentation models such as HR-Net also yields performance enhancement, suggesting FFM can be incorporated as a plug-in module with different model architectures. Code and data are openly accessible at

[8] 2407.14759

Autonomous Nonlinear Passive Transmit-Receive Switch for Compact IoT Devices: A Three-Port Agile Network

Recent advancements in RF technologies, especially Internet of Things (IoT) devices, require compact and integrated RF circulators or transmit-receive (TR) switches for efficient resource use. Although conventional techniques are crucial in managing signal flow to prevent signal interference to sensitive receiver components, they have some drawbacks, such as limited isolation, low switching speed, complex circuitry, bulkiness, and high cost. This work presents a smart, miniaturized, nonlinear TR switch capable of operating over a wide frequency range (0.8 - 1.3 GHz), making it suitable for IoT frequency bands. The switch achieves high isolation, low insertion loss, and intelligent transitions between transmitter and receiver without requiring external control or bias pins.

[9] 2407.14760

Multidirectional Pixelated Cubic Antenna with Enhanced Isolation for Vehicular Applications

This paper presents a pixelated cubic antenna design with enhanced isolation and diverse radiation pattern for vehicular applications. The design consists of four radiating patches to take advantage of a nearly omnidirectional radiation pattern with enhanced isolation and high gain. The antenna system with four patches has been pixelated and optimized simultaneously to achieve desired performance and high isolation at 5.4 GHz band. The antenna achieved measured isolation of more than -34 dB between antenna elements. The overall isolation improvement obtained by the antenna is about 18 dB compared to a configuration using standard patch antennas. Moreover, isolation improvement is achieved through patch pixelization without additional resonators or elements. The antenna achieved up to 6.9 dB realized gain in each direction. Additionally, the cubic antenna system is equipped with an E-shaped GPS antenna to facilitate connectivity with GPS satellite. Finally, the antenna performance has been investigated using a simulation model of the vehicle roof and roof rack. The reflection coefficient, isolation and radiation patterns of the antenna remains unaffected. The antenna prototype has been fabricated on Rogers substrate and measured to verify the simulation results. The measured results correlate well with the simulation results. The proposed antenna features low-profile, simple design for ease of manufacture, good radiation characteristics with multidirectional property and high isolation, which are well-suited to vehicular applications in different environments.

[10] 2407.14763

Efficient Design of a Pixelated Rectenna for WPT Applications

This paper introduces a highly efficient rectenna (rectifying antenna) using a binary optimization algorithm. A novel pixelated receiving antenna has been developed to match the diode impedance of a rectifier, eliminating the need for a separate matching circuit in the rectenna's rectifier. The receiving antenna configuration is fine-tuned via a binary optimization algorithm. A rectenna is designed using optimization algorithm at 2.5 GHz with 38% RF-DC conversion efficiency when subjected to 0 dBm incident power, with an output voltage of 815mV. The proposed rectenna demonstrates versatility across various low-power WPT (wireless power transfer) applications.

[11] 2407.14775

Phase Re-service in Reinforcement Learning Traffic Signal Control

This article proposes a novel approach to traffic signal control that combines phase re-service with reinforcement learning (RL). The RL agent directly determines the duration of the next phase in a pre-defined sequence. Before the RL agent's decision is executed, we use the shock wave theory to estimate queue expansion at the designated movement allowed for re-service and decide if phase re-service is necessary. If necessary, a temporary phase re-service is inserted before the next regular phase. We formulate the RL problem as a semi-Markov decision process (SMDP) and solve it with proximal policy optimization (PPO). We conducted a series of experiments that showed significant improvements thanks to the introduction of phase re-service. Vehicle delays are reduced by up to 29.95% of the average and up to 59.21% of the standard deviation. The number of stops is reduced by 26.05% on average with 45.77% less standard deviation.

[12] 2407.14784

MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

Medical imaging tasks are very challenging due to the lack of publicly available labeled datasets. Hence, it is difficult to achieve high performance with existing deep-learning models as they require a massive labeled dataset to be trained effectively. An alternative solution is to use pre-trained models and fine-tune them using the medical imaging dataset. However, all existing models are pre-trained using natural images, which is a completely different domain from that of medical imaging, which leads to poor performance due to domain shift. To overcome these problems, we propose a large-scale unlabeled dataset of medical images and a backbone pre-trained using the proposed dataset with a self-supervised learning technique called Masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images. To evaluate the performance of the proposed backbone, we used four different medical imaging tasks. The results are compared with existing pre-trained models. These experiments show the superiority of our proposed backbone in medical imaging tasks.

[13] 2407.14800

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Realistic emotional voice conversion (EVC) aims to enhance emotional diversity of converted audios, making the synthesized voices more authentic and natural. To this end, we propose Emotional Intensity-aware Network (EINet), dynamically adjusting intonation and rhythm by incorporating controllable emotional intensity. To better capture nuances in emotional intensity, we go beyond mere distance measurements among acoustic features. Instead, an emotion evaluator is utilized to precisely quantify speaker's emotional state. By employing an intensity mapper, intensity pseudo-labels are obtained to bridge the gap between emotional speech intensity modeling and run-time conversion. To ensure high speech quality while retaining controllability, an emotion renderer is used for combining linguistic features smoothly with manipulated emotional features at frame level. Furthermore, we employ a duration predictor to facilitate adaptive prediction of rhythm changes condition on specifying intensity value. Experimental results show EINet's superior performance in naturalness and diversity of emotional expression compared to state-of-the-art EVC methods.

[14] 2407.14806

Hybrid PHD-PMB Trajectory Smoothing Using Backward Simulation

The probability hypothesis density (PHD) and Poisson multi-Bernoulli (PMB) filters are two popular set-type multi-object filters. Motivated by the fact that the multi-object filtering density after each update step in the PHD filter is a PMB without approximation, in this paper we present a multi-object smoother involving PHD forward filtering and PMB backward smoothing. This is achieved by first running the PHD filtering recursion in the forward pass and extracting the PMB filtering densities after each update step before the Poisson Point Process approximation, which is inherent in the PHD filter update. Then in the backward pass we apply backward simulation for sets of trajectories to the extracted PMB filtering densities. We call the resulting multi-object smoother hybrid PHD-PMB trajectory smoother. Notably, the hybrid PHD-PMB trajectory smoother can provide smoothed trajectory estimates for the PHD filter without labeling or tagging, which is not possible for existing PHD smoothers. Also, compared to the trajectory PHD filter, which can only estimate alive trajectories, the hybrid PHD-PMB trajectory smoother enables the estimation of the set of all trajectories. Simulation results demonstrate that the hybrid PHD-PMB trajectory smoother outperforms the PHD filter in terms of both state and cardinality estimates, and the trajectory PHD filter in terms of false detections.

[15] 2407.14820

Dreamer: Dual-RIS-aided Imager in Complementary Modes

Reconfigurable intelligent surfaces (RISs) have emerged as a promising auxiliary technology for radio frequency imaging. However, existing works face challenges of faint and intricate back-scattered waves and the restricted field-of-view (FoV), both resulting from complex target structures and a limited number of antennas. The synergistic benefits of multi-RIS-aided imaging hold promise for addressing these challenges. Here, we propose a dual-RIS-aided imaging system, Dreamer, which operates collaboratively in complementary modes (reflection-mode and transmission-mode). Dreamer significantly expands the FoV and enhances perception by deploying dual-RIS across various spatial and measurement patterns. Specifically, we perform a fine-grained analysis of how radio-frequency (RF) signals encode scene information in the scattered object modeling. Based on this modeling, we design illumination strategies to balance spatial resolution and observation scale, and implement a prototype system in a typical indoor environment. Moreover, we design a novel artificial neural network with a CNN-external-attention mechanism to translate RF signals into high-resolution images of human contours. Our approach achieves an impressive SSIM score exceeding 0.83, validating its effectiveness in broadening perception modes and enhancing imaging capabilities. The code to reproduce our results is available at

[16] 2407.14876

Preictal Period Optimization for Deep Learning-Based Epileptic Seizure Prediction

Accurate prediction of epileptic seizures could prove critical for improving patient safety and quality of life in drug-resistant epilepsy. Although deep learning-based approaches have shown promising seizure prediction performance using scalp electroencephalogram (EEG) signals, substantial limitations still impede their clinical adoption. Furthermore, identifying the optimal preictal period (OPP) for labeling EEG segments remains a challenge. Here, we not only develop a competitive deep learning model for seizure prediction but, more importantly, leverage it to demonstrate a methodology to comprehensively evaluate the predictive performance in the seizure prediction task. For this, we introduce a CNN-Transformer deep learning model to detect preictal spatiotemporal dynamics, alongside a novel Continuous Input-Output Performance Ratio (CIOPR) metric to determine the OPP. We trained and evaluated our model on 19 pediatric patients of the open-access CHB-MIT dataset in a subject-specific manner. Using the OPP of each patient, preictal and interictal segments were correctly identified with an average sensitivity of 99.31%, specificity of 95.34%, AUC of 99.35%, and F1- score of 97.46%, while prediction time averaged 76.8 minutes before onset. Notably, our novel CIOPR metric allowed outlining the impact of different preictal period definitions on prediction time, accuracy, output stability, and transition time between interictal and preictal states in a comprehensive and quantitative way and highlighted the importance of considering both inter- and intra-patient variability in seizure prediction.

[17] 2407.14883

Inferring Ingrained Remote Information in AC Power Flows Using Neuromorphic Modality Regime

In this paper, we infer ingrained remote information in AC power flows using spiking neural network (SNN) as edge processors for efficient coordination of power electronic converters. This work unifies power and information as a means of data normalization using a multi-modal regime in the form of spikes using energy-efficient neuromorphic processing and semantics theory. Firstly, we organize the synchronous realvalued measurements at each edge and translate them into asynchronous spike-based events to collect sparse data for training of SNN at each edge. Instead of relying on error-dependent supervised data-driven learning theory, we exploit the latency-driven unsupervised Hebbian learning rule to obtain modulation pulses for switching of power electronic converters that can now communicate among each other. Not only does this philosophy block exogenous path arrival for cyber attackers by dismissing the cyber layer, it also entails converter adaptation to system reconfiguration and parameter mismatch issues. We conclude this work by validating its energy-efficient and effective online learning performance under various scenarios in modified IEEE 14-bus system and under experimental conditions.

[18] 2407.14894

A Holistic Optimization Framework for Energy Efficient UAV-assisted Fog Computing: Attitude Control, Trajectory Planning and Task Assignment

Unmanned Aerial Vehicles (UAVs) have significantly enhanced fog computing by acting as both flexible computation platforms and communication mobile relays. In this paper, we propose a holistic framework that jointly optimizes the total latency and energy consumption for UAV-assisted fog computing in a three-dimensional spatial domain with varying terrain elevations and dynamic task generations. Our proposed framework considers three important and interdependent modules: attitude control, trajectory planning, and task assignment. We first establish a fuzzy proportional-integral-derivative control model to determine the UAV's attitude. Then, we propose an enhanced Ant Colony System (ACS) based algorithm, that includes a safety value and a decoupling mechanism to overcome the convergence issue in classical ACS, to compute the optimal UAV trajectory. Finally, we design an algorithm based on the Particle Swarm Optimization technique, to determine where each offloaded task should be executed. Under our proposed framework, the outcome of one module would affect the decision-making in one other, providing a holistic perspective of the system and thus leading to improved solutions. We demonstrate by extensive simulation results that our proposed framework can significantly improve the overall performance, measured by latency and energy consumption, compared to existing baseline approaches.

[19] 2407.14904

Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning

Forensic pathology is critical in determining the cause and manner of death through post-mortem examinations, both macroscopic and microscopic. The field, however, grapples with issues such as outcome variability, laborious processes, and a scarcity of trained professionals. This paper presents SongCi, an innovative visual-language model (VLM) designed specifically for forensic pathology. SongCi utilizes advanced prototypical cross-modal self-supervised contrastive learning to enhance the accuracy, efficiency, and generalizability of forensic analyses. It was pre-trained and evaluated on a comprehensive multi-center dataset, which includes over 16 million high-resolution image patches, 2,228 vision-language pairs of post-mortem whole slide images (WSIs), and corresponding gross key findings, along with 471 distinct diagnostic outcomes. Our findings indicate that SongCi surpasses existing multi-modal AI models in many forensic pathology tasks, performs comparably to experienced forensic pathologists and significantly better than less experienced ones, and provides detailed multi-modal explainability, offering critical assistance in forensic investigations. To the best of our knowledge, SongCi is the first VLM specifically developed for forensic pathological analysis and the first large-vocabulary computational pathology (CPath) model that directly processes gigapixel WSIs in forensic science.

[20] 2407.14994

Non-Reference Quality Assessment for Medical Imaging: Application to Synthetic Brain MRIs

Generating high-quality synthetic data is crucial for addressing challenges in medical imaging, such as domain adaptation, data scarcity, and privacy concerns. Existing image quality metrics often rely on reference images, are tailored for group comparisons, or are intended for 2D natural images, limiting their efficacy in complex domains like medical imaging. This study introduces a novel deep learning-based non-reference approach to assess brain MRI quality by training a 3D ResNet. The network is designed to estimate quality across six distinct artifacts commonly encountered in MRI scans. Additionally, a diffusion model is trained on diverse datasets to generate synthetic 3D images of high fidelity. The approach leverages several datasets for training and comprehensive quality assessment, benchmarking against state-of-the-art metrics for real and synthetic images. Results demonstrate superior performance in accurately estimating distortions and reflecting image quality from multiple perspectives. Notably, the method operates without reference images, indicating its applicability for evaluating deep generative models. Besides, the quality scores in the [0, 1] range provide an intuitive assessment of image quality across heterogeneous datasets. Evaluation of generated images offers detailed insights into specific artifacts, guiding strategies for improving generative models to produce high-quality synthetic images. This study presents the first comprehensive method for assessing the quality of real and synthetic 3D medical images in MRI contexts without reliance on reference images.

[21] 2407.15045

Efficient Sampling for Data-Driven Frequency Stability Constraint via Forward-Mode Automatic Differentiation

Encoding frequency stability constraints in the operation problem is challenging due to its complex dynamics. Recently, data-driven approaches have been proposed to learn the stability criteria offline with the trained model embedded as a constraint of online optimization. However, random sampling of stationary operation points is less efficient in generating balanced stable and unstable samples. Meanwhile, the performance of such a model is strongly dependent on the quality of the training dataset. Observing this research gap, we propose a gradient-based data generation method via forward-mode automatic differentiation. In this method, the original dynamic system is augmented with new states that represent the dynamic of sensitivities of the original states, which can be solved by invoking any ODE solver for a single time. To compensate for the contradiction between the gradient of various frequency stability criteria, gradient surgery is proposed by projecting the gradient on the normal plane of the other. In the end, we demonstrate the superior performance of the proposed sampling algorithm, compared with the unrolling differentiation and finite difference. All codes are available at

[22] 2407.15054

Enhancing K-user Interference Alignment for Discrete Constellations via Learning

In this paper, we consider a K-user interference channel where interference among the users is neither too strong nor too weak, a scenario that is relatively underexplored in the literature. We propose a novel deep learning-based approach to design the encoder and decoder functions that aim to maximize the sumrate of the interference channel for discrete constellations. We first consider the MaxSINR algorithm, a state-of-the-art linear scheme for Gaussian inputs, as the baseline and then propose a modified version of the algorithm for discrete inputs. We then propose a neural network-based approach that learns a constellation mapping with the objective of maximizing the sumrate. We provide numerical results to show that the constellations learned by the neural network-based approach provide enhanced alignments, not just in beamforming directions but also in terms of the effective constellation at the receiver, thereby leading to improved sum-rate performance.

[23] 2407.15059

The statistical spread of transmission outages on a fast protection time scale based on utility data

When there is a fault, the protection system automatically removes one or more transmission lines on a fast time scale of less than one minute. The outaged lines form a pattern in the transmission network. We extract these patterns from utility outage data, determine some key statistics of these patterns, and then show how to generate new patterns consistent with these statistics. The generated patterns provide a new and easily feasible way to model the overall effect of the protection system at the scale of a large transmission system. This new generative modeling of protection is expected to contribute to simulations of disturbances in large grids so that they can better quantify the risk of blackouts. Analysis of the pattern sizes suggests an index that describes how much outages spread in the transmission network at the fast timescale.

[24] 2407.15113

Robust Secure ISAC: How RSMA and Active RIS Manage Eavesdropper's Spatial Uncertainty

Incorporating rate splitting multiple access (RSMA) into integrated sensing and communication (ISAC) presents a significant security challenge, particularly in scenarios where the location of a potential eavesdropper (Eve) is unidentified. Splitting users' messages into common and private streams exposes them to eavesdropping, with the common stream dedicated for sensing and accessible to multiple users. In response to this challenge, this paper proposes a novel approach that leverages active reconfigurable intelligent surface (RIS) aided beamforming and artificial noise (AN) to enhance the security of RSMA-enabled ISAC. Specifically, we first derive the ergodic private secrecy rate (EPSR) based on mathematical approximation of the average Eve channel gain. An optimization problem is then formulated to maximize the minimum EPSR, while satisfying the minimum required thresholds on ergodic common secrecy rate, radar sensing and RIS power budget. To address this non-convex problem, a novel optimization strategy is developed, whereby we alternatively optimize the transmit beamforming matrix for the common and private streams, rate splitting, AN, RIS reflection coefficient matrix, and radar receive beamformer. Successive convex approximation (SCA) and Majorization-Minimization (MM) are employed to convexify the beamforming and RIS sub-problems. Simulations are conducted to showcase the effectiveness of the proposed framework against established benchmarks.

[25] 2407.15119

Diffusion Models for Unsupervised Anomaly Detection in Fetal Brain Ultrasound

Ultrasonography is an essential tool in mid-pregnancy for assessing fetal development, appreciated for its non-invasive and real-time imaging capabilities. Yet, the interpretation of ultrasound images is often complicated by acoustic shadows, speckle noise, and other artifacts that obscure crucial diagnostic details. To address these challenges, our study presents a novel unsupervised anomaly detection framework specifically designed for fetal ultrasound imaging. This framework incorporates gestational age filtering, precise identification of fetal standard planes, and targeted segmentation of brain regions to enhance diagnostic accuracy. Furthermore, we introduce the use of denoising diffusion probabilistic models in this context, marking a significant innovation in detecting previously unrecognized anomalies. We rigorously evaluated the framework using various diffusion-based anomaly detection methods, noise types, and noise levels. Notably, AutoDDPM emerged as the most effective, achieving an area under the precision-recall curve of 79.8\% in detecting anomalies. This advancement holds promise for improving the tools available for nuanced and effective prenatal diagnostics.

[26] 2407.15122

UAV Active Perception and Motion Control for Improving Navigation Using Low-Cost Sensors

In this study a model pipeline is proposed that combines computer vision with control-theoretic methods and utilizes low cost sensors. The proposed work enables perception-aware motion control for a quadrotor UAV to detect and navigate to objects of interest such as wind turbines and electric towers. The distance to the object of interest was estimated utilizing RGB as the primary sensory input. For the needs of the study, the Microsoft AirSim simulator was used. As a first step, a YOLOv8 model was integrated providing the basic position setpoints towards the detection. From the YOLOv8 inference, a target yaw angle was derived. The subsequent algorithms, combining performant in computational terms computer vision methods and YOLOv8, actively drove the drone to measure the height of the detection. Based on the height, an estimate of the depth was retrieved. In addition to this step, a convolutional neural network was developed, namely ActvePerceptionNet aiming at active YOLOv8 inference. The latter was validated for wind turbines where the rotational motion of the propeller was found to affect object confidence in a near periodical fashion. The results of the simulation experiments conducted in this study showed efficient object height and distance estimation and effective localization.

[27] 2407.15139

An Interface Method for Co-simulation of EMT Model and Shifted Frequency EMT Model Based on Rotational Invariance Techniques

The shifted frequency-based electromagnetic transient (SFEMT) simulation has greatly improved the computational efficiency of traditional electromagnetic transient (EMT) simulation for the ac grid. This letter proposes a novel interface for the co-simulation of the SFEMT model and the traditional EMT model. The general form of SFEMT modeling and the principle of analytical signal construction are first derived. Then, an interface for the co-simulation of EMT and SFEMT simulation is proposed based on rotational invariance techniques. Theoretical analyses and test results demonstrate the effectiveness of the proposed method.

[28] 2407.15169

Back-in-Time Diffusion: Unsupervised Detection of Medical Deepfakes

Recent progress in generative models has made it easier for a wide audience to edit and create image content, raising concerns about the proliferation of deepfakes, especially in healthcare. Despite the availability of numerous techniques for detecting manipulated images captured by conventional cameras, their applicability to medical images is limited. This limitation stems from the distinctive forensic characteristics of medical images, a result of their imaging process. In this work we propose a novel anomaly detector for medical imagery based on diffusion models. Normally, diffusion models are used to generate images. However, we show how a similar process can be used to detect synthetic content by making a model reverse the diffusion on a suspected image. We evaluate our method on the task of detecting fake tumors injected and removed from CT and MRI scans. Our method significantly outperforms other state of the art unsupervised detectors with an increased AUC of 0.9 from 0.79 for injection and of 0.96 from 0.91 for removal on average.

[29] 2407.15177

Measurements of the Safety Function Response Time on a Private 5G and IO-Link Wireless Testbed

In the past few years, there has been a growing significance of interactions between human workers and automated systems throughout the factory floor. Wherever static or mobile robots, such as automated guided vehicles, operate autonomously, a protected environment for personnel and machines must be provided by, e.g., safe, deterministic and low-latency technologies. Another trend in this area is the increased use of wireless communication, offering a high flexibility, modularity, and reduced installation and maintenance efforts. This work presents a testbed implementation that integrates a wireless framework, employing IO-Link Wireless (IOLW) and a private 5G cellular network, to orchestrate a complete example process from sensors and actuators up into the edge, represented by a programmable logic controller (PLC). Latency assessments identify the systems cycle time as well as opportunities for improvement. A worst-case estimation shows the attainable safety function response time for practical applications in the context of functional safety.

[30] 2407.15188

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning

Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this article, we aim to present, from a unique perspective, the developmental history, paradigm shifts, and application domains of speaker modeling technologies within the context of deep representation learning framework. This review is designed to provide a clear reference for researchers in the speaker modeling field, as well as for those who wish to apply speaker modeling techniques to specific downstream tasks.

[31] 2407.15196

Channel Shaping Using Beyond Diagonal Reconfigurable Intelligent Surface: Analysis, Optimization, and Enhanced Flexibility

This paper investigates the capability of a passive Reconfigurable Intelligent Surface (RIS) to redistribute the singular values of point-to-point Multiple-Input Multiple-Output (MIMO) channels for achieving power and rate gains. We depart from the conventional Diagonal (D)-RIS with diagonal phase shift matrix and adopt a Beyond Diagonal (BD) architecture that offers greater wave manipulation flexibility through element-wise connections. Specifically, we first provide shaping insights by characterizing the channel singular value regions attainable by D-RIS and BD-RIS via a novel geodesic optimization. Analytical singular value bounds are then derived to explore their shaping limits in typical deployment scenarios. As a side product, we tackle BD-RIS-aided MIMO rate maximization problem by a local-optimal Alternating Optimization (AO) and a shaping-inspired low-complexity approach. Results show that compared to D-RIS, BD-RIS significantly improves the dynamic range of all channel singular values, the trade-off in manipulating them, and thus the channel power and achievable rate. Those observations become more pronounced when the number of RIS elements and MIMO dimensions increase. Of particular interest, BD-RIS is shown to activate multi-stream transmission at lower transmit power than D-RIS, hence achieving the asymptotic Degrees of Freedom (DoF) at low Signal-to-Noise Ratio (SNR) thanks to its higher flexibility of shaping the distribution of channel singular values.

[32] 2407.15213

More-than-Moore Microacoustics: A Scalable Fabrication Process for Suspended Lamb Wave Resonators

Deep Ultraviolet (DUV) Photolithography is currently used to fabricate mass-scale integrated circuits (ICs). Its high throughput and resolution could benefit large-scale RF MEMS production for the telecommunication market. We present a process flow to fabricate suspended acoustic resonators using DUV Photolithography. This method allows for scalable production of resonators with critical dimensions of 250 nm and alignment accuracy of less than 100 nm. We show how photoresists and anti-reflective coatings integrate with the process, help with deposition quality and resolution, and how Ion Beam Etching allows for vertical sidewalls of the resonators. We measure resonance frequencies (fr) up to 7.5 GHz and electromechanical couplings up to 8%, and we investigate the uniformity of this process by analyzing the deviation of fs over the wafer surface for four main resonance modes. We show that the deviation of the S0 mode can be kept below 1%. These results indicate the suitability of this process for quick scale-up of Lamb wave resonator technology, bridging the gap from research to industry.

[33] 2407.15226

Variation Bayesian Interference for Multiple Extended Targets or Unresolved Group Targets Tracking

In this work, we propose a tracking method for multiple extended targets or unresolvable group targets based on the Variational Bayesian Inference (VBI). Firstly, based on the most commonly used Random Matrix Model (RMM), the joint states of a single target are modeled as a Gamma Gaussian Inverse Wishart (GGIW) distribution, and the multi-target joint association variables are involved in the estimation together as unknown information with a prior distribution. A shape evolution model and VBI are employed to address the shortcomings of the RMM. Through the VBI, we can derive the approximate variational posterior for the exact multi-target posterior. Furthermore, to demonstrate the applicability of the method in real-world tracking scenarios, we present two potential lightweight schemes. The first is based on clustering, which effectively prunes the joint association events. The second is a simplification of the variational posterior through marginal association probabilities. We demonstrate the effectiveness of the proposed method using simulation experiments, and the proposed method outperforms current state-of-the-art methods in terms of accuracy and adaptability. This manuscript is only a preprint version, a completer and more official version will be uploaded as soon as possible

[34] 2407.15270

MedEdit: Counterfactual Diffusion-based Image Editing on Brain MRI

Denoising diffusion probabilistic models enable high-fidelity image synthesis and editing. In biomedicine, these models facilitate counterfactual image editing, producing pairs of images where one is edited to simulate hypothetical conditions. For example, they can model the progression of specific diseases, such as stroke lesions. However, current image editing techniques often fail to generate realistic biomedical counterfactuals, either by inadequately modeling indirect pathological effects like brain atrophy or by excessively altering the scan, which disrupts correspondence to the original images. Here, we propose MedEdit, a conditional diffusion model for medical image editing. MedEdit induces pathology in specific areas while balancing the modeling of disease effects and preserving the integrity of the original scan. We evaluated MedEdit on the Atlas v2.0 stroke dataset using Frechet Inception Distance and Dice scores, outperforming state-of-the-art diffusion-based methods such as Palette (by 45%) and SDEdit (by 61%). Additionally, clinical evaluations by a board-certified neuroradiologist confirmed that MedEdit generated realistic stroke scans indistinguishable from real ones. We believe this work will enable counterfactual image editing research to further advance the development of realistic and clinically useful imaging tools.

[35] 2407.15310

Can all variations within the unified mask-based beamformer framework achieve identical peak extraction performance?

This study investigates mask-based beamformers (BFs), which estimate filters for target sound extraction (TSE) using time-frequency masks. Although multiple mask-based BFs have been proposed, no consensus has been established on the best one for target-extracting performance. Previously, we found that maximum signal-to-noise ratio and minimum mean square error (MSE) BFs can achieve the same extraction performance as the theoretical upper-bound performance, with each BF containing a different optimal mask. However, these remarkable findings left two issues unsolved: only two BFs were covered, excluding the minimum variance distortionless response BF; and ideal scaling (IS) was employed to ideally adjust the output scale, which is not applicable to realistic scenarios. To address these coverage and scaling issues, this study proposes a unified framework for mask-based BFs comprising two processes: filter estimation that can cover all BFs and scaling applicable to realistic scenarios by employing a mask to generate a scaling reference. We also propose a methodology to enumerate all possible BFs and derive 12 variations. Optimal masks for both processes are obtained by minimizing the MSE between the target and BF output. The experimental results using the CHiME-4 dataset suggested that 1) all 12 variations can achieve the theoretical upper-bound performance, and 2) mask-based scaling can behave as IS. These results can be explained by considering the practical parameter count of the masks. These findings contribute to 1) designing a TSE system, 2) estimating the extraction performance of a BF, and 3) improving scaling accuracy combined with mask-based scaling. The contributions also apply to TSE methods based on independent component analysis, as the unified framework covers them too.

[36] 2407.15313

Should we use model-free or model-based control? A case study of battery management systems

Reinforcement learning (RL) and model predictive control (MPC) each offer distinct advantages and limitations when applied to control problems in power and energy systems. Despite various studies on these methods, benchmarks remain lacking and the preference for RL over traditional controls is not well understood. In this work, we put forth a comparative analysis using RL- and MPC-based controllers for optimizing a battery management system (BMS). The BMS problem aims to minimize costs while adhering to operational limits. by adjusting the battery (dis)charging in response to fluctuating electricity prices over a time horizon. The MPC controller uses a learningbased forecast of future demand and price changes to formulate a multi-period linear program, that can be solved using off-the-shelf solvers. Meanwhile, the RL controller requires no timeseries modeling but instead is trained from the sample trajectories using the proximal policy optimization (PPO) algorithm. Numerical tests compare these controllers across optimality, training time, testing time, and robustness, providing a comprehensive evaluation of their efficacy. RL not only yields optimal solutions quickly but also ensures robustness to shifts in customer behavior, such as changes in demand distribution. However, as expected, training the RL agent is more time-consuming than MPC.

[37] 2407.15321

Hierarchical Homogeneity-Based Superpixel Segmentation: Application to Hyperspectral Image Analysis

Hyperspectral image (HI) analysis approaches have recently become increasingly complex and sophisticated. Recently, the combination of spectral-spatial information and superpixel techniques have addressed some hyperspectral data issues, such as the higher spatial variability of spectral signatures and dimensionality of the data. However, most existing superpixel approaches do not account for specific HI characteristics resulting from its high spectral dimension. In this work, we propose a multiscale superpixel method that is computationally efficient for processing hyperspectral data. The Simple Linear Iterative Clustering (SLIC) oversegmentation algorithm, on which the technique is based, has been extended hierarchically. Using a novel robust homogeneity testing, the proposed hierarchical approach leads to superpixels of variable sizes but with higher spectral homogeneity when compared to the classical SLIC segmentation. For validation, the proposed homogeneity-based hierarchical method was applied as a preprocessing step in the spectral unmixing and classification tasks carried out using, respectively, the Multiscale sparse Unmixing Algorithm (MUA) and the CNN-Enhanced Graph Convolutional Network (CEGCN) methods. Simulation results with both synthetic and real data show that the technique is competitive with state-of-the-art solutions.

[38] 2407.15324

Cooperative Salvo Guidance over Leader-Follower Network with Free-Will Arbitrary Time Convergence

A cooperative salvo strategy is proposed in this paper which achieves consensus among the interceptors within a pre-defined arbitrary settling time. Considering non-linear engagement kinematics and a system lag to capture the effect of interceptor autopilot as present in realistic interception scenarios, the guidance schemes use the time-to-go estimates of the interceptors in order to achieve simultaneous interception of a stationary target at a pre-determined impact time. The guidance scheme ensures that consensus among the time-to-go estimates of the interceptors is achieved within a settling time whose upper bound can be pre-specified arbitrarily independent of the initial conditions or design parameters. The efficacy of the proposed guidance strategy is demonstrated using numerical simulations with varied conditions of initial position, velocities and heading angle errors of the interceptors as well as different desired impact times.

[39] 2407.15329

Efficient Multi-disparity Transformer for Light Field Image Super-resolution

This paper presents the Multi-scale Disparity Transformer (MDT), a novel Transformer tailored for light field image super-resolution (LFSR) that addresses the issues of computational redundancy and disparity entanglement caused by the indiscriminate processing of sub-aperture images inherent in conventional methods. MDT features a multi-branch structure, with each branch utilising independent disparity self-attention (DSA) to target specific disparity ranges, effectively reducing computational complexity and disentangling disparities. Building on this architecture, we present LF-MDTNet, an efficient LFSR network. Experimental results demonstrate that LF-MDTNet outperforms existing state-of-the-art methods by 0.37 dB and 0.41 dB PSNR at the 2x and 4x scales, achieving superior performance with fewer parameters and higher speed.

[40] 2407.15330

A Methodology for Power Dispatch Based on Traction Station Clusters in the Flexible Traction Power Supply System

The flexible traction power supply system (FTPSS) eliminates the neutral zone but leads to increased complexity in power flow coordinated control and power mismatch. To address these challenges, the methodology for power dispatch (PD) based on traction station clusters (TSCs) in FTPSS is proposed, in which each TSC with a consistent structure performs independent local phase angle control. First, to simplify the PD problem of TSCs, the system is transformed into an equivalent model with constant topology, resulting in it can be solved by univariate numerical optimization with higher computational performance. Next, the calculation method of the feasible phase angle domain under strict and relaxed power circulation constraints are described, respectively, which ensures that power circulation can be either eliminated or precisely controlled. Finally, the PD method with three unique modes for uncertain train loads is introduced to enhance power flow flexibility: specified power distribution coefficients between traction substations (TSs), constant output power of TSs, and maximum consumption of renewable resources within TSs. In the experimental section, the performance of the TSC methodology for PD is verified through detailed train operation scenarios.

[41] 2407.15335

Addressing Out-of-Distribution Challenges in Image Semantic Communication Systems with Multi-modal Large Language Models

Semantic communication is a promising technology for next-generation wireless networks. However, the out-of-distribution (OOD) problem, where a pre-trained machine learning (ML) model is applied to unseen tasks that are outside the distribution of its training data, may compromise the integrity of semantic compression. This paper explores the use of multi-modal large language models (MLLMs) to address the OOD issue in image semantic communication. We propose a novel "Plan A - Plan B" framework that leverages the broad knowledge and strong generalization ability of an MLLM to assist a conventional ML model when the latter encounters an OOD input in the semantic encoding process. Furthermore, we propose a Bayesian optimization scheme that reshapes the probability distribution of the MLLM's inference process based on the contextual information of the image. The optimization scheme significantly enhances the MLLM's performance in semantic compression by 1) filtering out irrelevant vocabulary in the original MLLM output; and 2) using contextual similarities between prospective answers of the MLLM and the background information as prior knowledge to modify the MLLM's probability distribution during inference. Further, at the receiver side of the communication system, we put forth a "generate-criticize" framework that utilizes the cooperation of multiple MLLMs to enhance the reliability of image reconstruction.

[42] 2407.15358

PRIME: Blind Multispectral Unmixing Using Virtual Quantum Prism and Convex Geometry

Multispectral unmixing (MU) is critical due to the inevitable mixed pixel phenomenon caused by the limited spatial resolution of typical multispectral images in remote sensing. However, MU mathematically corresponds to the underdetermined blind source separation problem, thus highly challenging, preventing researchers from tackling it. Previous MU works all ignore the underdetermined issue, and merely consider scenarios with more bands than sources. This work attempts to resolve the underdetermined issue by further conducting the light-splitting task using a network-inspired virtual prism, and as this task is challenging, we achieve so by incorporating the very advanced quantum feature extraction techniques. We emphasize that the prism is virtual (allowing us to fix the spectral response as a simple deterministic matrix), so the virtual hyperspectral image (HSI) it generates has no need to correspond to some real hyperspectral sensor; in other words, it is good enough as long as the virtual HSI satisfies some fundamental properties of light splitting (e.g., non-negativity and continuity). With the above virtual quantum prism, we know that the virtual HSI is expected to possess some desired simplex structure. This allows us to adopt the convex geometry to unmix the spectra, followed by downsampling the pure spectra back to the multispectral domain, thereby achieving MU. Experimental evidence shows great potential of our MU algorithm, termed as prism-inspired multispectral endmember extraction (PRIME).

[43] 2407.15380

Iterative approach to reconstructing neural disparity fields from light-field data

This study proposes a neural disparity field (NDF) that establishes an implicit, continuous representation of scene disparity based on a neural field and an iterative approach to address the inverse problem of NDF reconstruction from light-field data. NDF enables seamless and precise characterization of disparity variations in three-dimensional scenes and can discretize disparity at any arbitrary resolution, overcoming the limitations of traditional disparity maps that are prone to sampling errors and interpolation inaccuracies. The proposed NDF network architecture utilizes hash encoding combined with multilayer perceptrons to capture detailed disparities in texture levels, thereby enhancing its ability to represent the geometric information of complex scenes. By leveraging the spatial-angular consistency inherent in light-field data, a differentiable forward model to generate a central view image from the light-field data is developed. Based on the forward model, an optimization scheme for the inverse problem of NDF reconstruction using differentiable propagation operators is established. Furthermore, an iterative solution method is adopted to reconstruct the NDF in the optimization scheme, which does not require training datasets and applies to light-field data captured by various acquisition methods. Experimental results demonstrate that high-quality NDF can be reconstructed from light-field data using the proposed method. High-resolution disparity can be effectively recovered by NDF, demonstrating its capability for the implicit, continuous representation of scene disparities.

[44] 2407.15395

FAST-GSC: Fast and Adaptive Semantic Transmission for Generative Semantic Communication

The rapidly evolving field of generative artificial intelligence technology has introduced innovative approaches for developing semantic communication (SemCom) frameworks, leading to the emergence of a new paradigm-generative SemCom (GSC). However, the complex processes involved in semantic extraction and generative inference may result in considerable latency in resource-constrained scenarios. To tackle these issues, we introduce a new GSC framework that involves fast and adaptive semantic transmission (FAST-GSC). This framework incorporates one innovative communication mechanism and two enhancement strategies at the transmitter and receiver, respectively. Aiming to reduce task latency, our communication mechanism enables fast semantic transmission by parallelizing the processes of semantic extraction at the transmitter and inference at the receiver. Preliminary evaluations indicate that while this mechanism effectively reduces task latency, it could potentially compromise task performance. To address this issue, we propose two additional methods for enhancement. First, at the transmitter, we employ reinforcement learning to discern the intrinsic temporal dependencies among the semantic units and design their extraction and transmission sequence accordingly. Second, at the receiver, we design a semantic difference calculation module and propose a sequential conditional denoising approach to alleviate the stringent immediacy requirement for the reception of semantic features. Extensive experiments demonstrate that our proposed architecture achieves a performance score comparable to the conventional GSC architecture while realizing a 52% reduction in residual task latency that extends beyond the fixed inference duration.

[45] 2407.15416

Uplink Transmit Power Optimization for Distributed Massive MIMO Systems with 1-Bit ADCs

This paper addresses the problem of uplink transmit power optimization in distributed massive multiple-input multiple-output systems, where remote radio heads (RRHs) are equipped with 1-bit analog-to-digital converters (ADCs). First, in a scenario where a single RRH serves a single user equipment (UE), the signal-to-noise-and-distortion ratio (SNDR) is shown to be a non-monotonic and unimodal function of the UE transmit power due to the quantization distortion (QD). Upon the introduction of multiple RRHs, adding properly tuned dithering at each RRH is shown to render the SNDR at the output of the joint receiver unimodal. In a scenario with multiple RRHs and UEs, considering the non-monotonic nature of the signal-to-interference-plus-noise-and-distortion ratio (SINDR), both the UE transmit powers and the RRH dithering levels are jointly optimized subject to the min-power and max-min-SINDR criteria, while employing Bussgang-based maximum ratio combining (BMRC) and minimum mean squared error (BMMSE) receivers. To this end, gradient and block coordinate descent methods are introduced to tune the UE transmit powers, whereas a line search coupled with gradient updates is used to adjust the RRH dithering levels. Numerical results demonstrate that jointly optimizing the UE transmit power and the RRH dithering levels can significantly enhance the system performance, thus facilitating joint reception from multiple RRHs across a range of scenarios. Comparing the BMMSE and BMRC receivers, the former offers a better interference and QD alleviation while the latter has a lower computational complexity.

[46] 2407.15423

Integrating IP Broadcasting with Audio Tags- Workflow and Challenges

The broadcasting industry is increasingly adopting IP techniques, revolutionising both live and pre-recorded content production, from news gathering to live music events. IP broadcasting allows for the transport of audio and video signals in an easily configurable way, aligning with modern networking techniques. This shift towards an IP workflow allows for much greater flexibility, not only in routing signals but with the integration of tools using standard web development techniques. One possible tool could include the use of live audio tagging, which has a number of uses in the production of content. These include from automated closed captioning to identifying unwanted sound events within a scene. In this paper, we describe the process of containerising an audio tagging model into a microservice, a small segregated code module that can be integrated into a multitude of different network setups. The goal is to develop a modular, accessible, and flexible tool capable of seamless deployment into broadcasting workflows of all sizes, from small productions to large corporations. Challenges surrounding latency of the selected audio tagging model and its effect on the usefulness of the end product are discussed.

[47] 2407.15433

Spatial-Division Augmented Occupancy Field for Bone Shape Reconstruction from Biplanar X-Rays

Retrieving 3D bone anatomy from biplanar X-ray images is crucial since it can significantly reduce radiation exposure compared to traditional CT-based methods. Although various deep learning models have been proposed to address this complex task, they suffer from two limitations: 1) They employ voxel representation for bone shape and exploit 3D convolutional layers to capture anatomy prior, which are memory-intensive and limit the reconstruction resolution. 2) They overlook the prevalent occlusion effect within X-ray images and directly extract features using a simple loss, which struggles to fully exploit complex X-ray information. To tackle these concerns, we present Spatial-division Augmented Occupancy Field~(SdAOF). SdAOF adopts the continuous occupancy field for shape representation, reformulating the reconstruction problem as a per-point occupancy value prediction task. Its implicit and continuous nature enables memory-efficient training and fine-scale surface reconstruction at different resolutions during the inference. Moreover, we propose a novel spatial-division augmented distillation strategy to provide feature-level guidance for capturing the occlusion relationship. Extensive experiments on the pelvis reconstruction dataset show that SdAOF outperforms state-of-the-art methods and reconstructs fine-scale bone surfaces.The code is available at

[48] 2407.15448

Movable Antenna-Enhanced Wireless Communications: General Architectures and Implementation Methods

Movable antennas (MAs), traditionally explored in antenna design, have recently garnered significant attention in wireless communications due to their ability to dynamically adjust the antenna positions to changes in the propagation environment. However, previous research has primarily focused on characterizing the performance limits of various MA-assisted wireless communication systems, with less emphasis on their practical implementation. To address this gap, in this article, we propose several general MA architectures that extend existing designs by varying several key aspects to cater to different application scenarios and tradeoffs between cost and performance. Additionally, we draw from fields such as antenna design and mechanical control to provide an overview of candidate implementation methods for the proposed MA architectures, utilizing either direct mechanical or equivalent electronic control. Simulation results are finally presented to support our discussion.

[49] 2407.15458

EMO-Codec: A Depth Look at Emotion Preservation Capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations

The neural codec model reduces speech data transmission delay and serves as the foundational tokenizer for speech language models (speech LMs). Preserving emotional information in codecs is crucial for effective communication and context understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objective methods on emotion datasets like IEMOCAP. Our study identifies which codecs best preserve emotional information under various bitrate scenarios. We found that training codec models with both English and Chinese data had limited success in retaining emotional information in Chinese. Additionally, resynthesizing speech through these codecs degrades the performance of speech emotion recognition (SER), particularly for emotions like sadness, depression, fear, and disgust. Human listening tests confirmed these findings. This work guides future speech technology developments to ensure new codecs maintain the integrity of emotional information in speech.

[50] 2407.15473

PyJama: Differentiable Jamming and Anti-Jamming with NVIDIA Sionna

Despite extensive research on jamming attacks on wireless communication systems, the potential of machine learning for amplifying the threat of such attacks, or our ability to mitigate them, remains largely untapped. A key obstacle to such research has been the absence of a suitable framework. To resolve this obstacle, we release PyJama, a fully-differentiable open-source library that adds jamming and anti-jamming functionality to NVIDIA Sionna. We demonstrate the utility of PyJama (i) for realistic MIMO simulations by showing examples that involve forward error correction, OFDM waveforms in time and frequency, realistic channel models, and mobility; and (ii) for learning to jam. Specifically, we use stochastic gradient descent to optimize jamming power allocation over an OFDM resource grid. The learned strategies are non-trivial, intelligible, and effective.

[51] 2407.15485

Subthalamic Nucleus segmentation in high-field Magnetic Resonance data. Is space normalization by template co-registration necessary?

Deep Brain Stimulation (DBS) is one of the most successful methods to diminish late-stage Parkinson's Disease (PD) symptoms. It is a delicate surgical procedure which requires detailed pre-surgical patient's study. High-field Magnetic Resonance Imaging (MRI) has proven its improved capacity of capturing the Subthalamic Nucleus (STN) - the main target of DBS in PD - in greater detail than lower field images. Here, we present a comparison between the performance of two different Deep Learning (DL) automatic segmentation architectures, one based in the registration to a brain template and the other performing the segmentation in in the MRI acquisition native space. The study was based on publicly available high-field 7 Tesla (T) brain MRI datasets of T1-weighted and T2-weighted sequences. nnUNet was used on the segmentation step of both architectures, while the data pre and post-processing pipelines diverged. The evaluation metrics showed that the performance of the segmentation directly in the native space yielded better results for the STN segmentation, despite not showing any advantage over the template-based method for the to other analysed structures: the Red Nucleus (RN) and the Substantia Nigra (SN).

[52] 2407.15496

Securing V2I Backscattering from Eavesdropper

As our cities become more intelligent and more connected with new technologies like 6G, improving communication between vehicles and infrastructure is essential while reducing energy consumption. This study proposes a secure framework for vehicle-to-infrastructure (V2I) backscattering near an eavesdropping vehicle to maximize the sum secrecy rate of V2I backscatter communication over multiple coherence slots. This sustainable framework aims to jointly optimize the reflection coefficients at the backscattering vehicle, carrier emitter power, and artificial noise at the infrastructure, along with the target vehicle's linear trajectory in the presence of an eavesdropping vehicle in the parallel lane. To achieve this optimization, we separated the problem into three parts: backscattering coefficient, power allocation, and trajectory design problems. We respectively adopted parallel computing, fractional programming, and finding all the candidates for the global optimal solution to obtain the global optimal solution for these three problems. Our simulations verified the fast convergence of our alternating optimization algorithm and showed that our proposed secure V2I backscattering outperforms the existing benchmark by over 4.7 times in terms of secrecy rate for 50 slots. Overall, this fundamental research on V2I backscattering provided insights to improve vehicular communication's connectivity, efficiency, and security.

[53] 2407.15530

Pulse Shaping for Random ISAC Signals: The Ambiguity Function Between Symbols Matters

Integrated sensing and communications (ISAC) has emerged as a pivotal enabling technology for next-generation wireless networks. Despite the distinct signal design requirements of sensing and communication (S&C) systems, shifting the symbol-wise pulse shaping (SWiPS) framework from communication-only systems to ISAC poses significant challenges in signal design and processing This paper addresses these challenges by examining the ambiguity function (AF) of the SWiPS ISAC signal and introducing a novel pulse shaping design for single-carrier ISAC transmission. We formulate optimization problems to minimize the average integrated sidelobe level (ISL) of the AF, as well as the weighted ISL (WISL) while satisfying inter-symbol interference (ISI), out-of-band emission (OOBE), and power constraints. Our contributions include establishing the relationship between the AFs of both the random data symbols and signaling pulses, analyzing the statistical characteristics of the AF, and developing algorithmic frameworks for pulse shaping optimization using successive convex approximation (SCA) and alternating direction method of multipliers (ADMM) approaches. Numerical results are provided to validate our theoretical analysis, which demonstrate significant performance improvements in the proposed SWiPS design compared to the root-raised cosine (RRC) pulse shaping for conventional communication systems.

[54] 2407.15555

The Rlign Algorithm for Enhanced Electrocardiogram Analysis through R-Peak Alignment for Explainable Classification and Clustering

Electrocardiogram (ECG) recordings have long been vital in diagnosing different cardiac conditions. Recently, research in the field of automatic ECG processing using machine learning methods has gained importance, mainly by utilizing deep learning methods on raw ECG signals. A major advantage of models like convolutional neural networks (CNNs) is their ability to effectively process biomedical imaging or signal data. However, this strength is tempered by challenges related to their lack of explainability, the need for a large amount of training data, and the complexities involved in adapting them for unsupervised clustering tasks. In addressing these tasks, we aim to reintroduce shallow learning techniques, including support vector machines and principal components analysis, into ECG signal processing by leveraging their semi-structured, cyclic form. To this end, we developed and evaluated a transformation that effectively restructures ECG signals into a fully structured format, facilitating their subsequent analysis using shallow learning algorithms. In this study, we present this adaptive transformative approach that aligns R-peaks across all signals in a dataset and resamples the segments between R-peaks, both with and without heart rate dependencies. We illustrate the substantial benefit of this transformation for traditional analysis techniques in the areas of classification, clustering, and explainability, outperforming commercial software for median beat transformation and CNN approaches. Our approach demonstrates a significant advantage for shallow machine learning methods over CNNs, especially when dealing with limited training data. Additionally, we release a fully tested and publicly accessible code framework, providing a robust alignment pipeline to support future research, available at imi-ms/rlign.

[55] 2407.15615

Vehicle-to-Everything: Looking into the Future of Flexibility Services

The primary aim of this paper is to illuminate potential Vehicle-to-Everything (V2X) flexibility services that can be activated at the three levels of home, community, and grid. To do this, the potential practical services that can be provided by EVs, flexibility requesters, and the required exchange mechanisms at these three levels are identified. At the home level, the two main services that EVs can provide to households are explored. The initial service focuses on cost reduction for homes by employing smart charging and discharging methods, and the second service underscores the capability of an EV equipped with bidirectional chargers to function as a backup resource during grid outages. At the community level, three flexibility services are introduced and outlined: community cost reduction, energy sharing, and backup resources. There is more than one flexibility requester, including the community manager, EV owners, and other end users. Accordingly, at this level, having a fair and transparent market mechanism can optimise the activation of different V2X flexibility services. At the grid level, flexibility providers can offer two main flexibility services, namely load profile adjustment and real-time voltage and frequency control, to a wide range of flexibility requesters, including distribution network/systems operators (DNOs/DSOs) to overcome technical challenges of the distribution network, energy suppliers to manage their energy portfolio, and TSOs to support transmission network. In addition, a review of trial V2X projects and V2X supporting regulations is presented to contextualize the feasibility and regulatory framework for implementing these services. This analysis offers insights into the challenges and opportunities that need to be addressed for the effective integration of V2X flexibility services across the three identified levels.

[56] 2407.15624

DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks

In this paper, we propose a dual-stage architecture for bandwidth extension (BWE) increasing the effective sampling rate of speech signals from 8 kHz to 48 kHz. Unlike existing end-to-end deep learning models, our proposed method explicitly models BWE using excitation and linear time-varying (LTV) filter stages. The excitation stage broadens the spectrum of the input, while the filtering stage properly shapes it based on outputs from an acoustic feature predictor. To this end, an acoustic feature loss term can implicitly promote the excitation subnetwork to produce white spectra in the upper frequency band to be synthesized. Experimental results demonstrate that the added inductive bias provided by our approach can improve upon BWE results using the generators from both SEANet or HiFi-GAN as exciters, and that our means of adapting processing with acoustic feature predictions is more effective than that used in HiFi-GAN-2. Secondary contributions include extensions of the SEANet model to accommodate local conditioning information, as well as the application of HiFi-GAN-2 for the BWE problem.

[57] 2407.15631

A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control

Virtual interventions enable the physics-based simulation of device deployment within coronary arteries. This framework allows for counterfactual reasoning by deploying the same device in different arterial anatomies. However, current methods to create such counterfactual arteries face a trade-off between controllability and realism. In this study, we investigate how Latent Diffusion Models (LDMs) can custom synthesize coronary anatomy for virtual intervention studies based on mid-level anatomic constraints such as topological validity, local morphological shape, and global skeletal structure. We also extend diffusion model guidance strategies to the context of morpho-skeletal conditioning and propose a novel guidance method for continuous attributes that adaptively updates the negative guiding condition throughout sampling. Our framework enables the generation and editing of coronary anatomy in a controllable manner, allowing device designers to derive mechanistic insights regarding anatomic variation and simulated device deployment.

[58] 2407.15636

On-the-fly spectral unmixing based on Kalman filtering

This work introduces an on-the-fly (i.e., online) linear unmixing method which is able to sequentially analyze spectral data acquired on a spectrum-by-spectrum basis. After deriving a sequential counterpart of the conventional linear mixing model, the proposed approach recasts the linear unmixing problem into a linear state-space estimation framework. Under Gaussian noise and state models, the estimation of the pure spectra can be efficiently conducted by resorting to Kalman filtering. Interestingly, it is shown that this Kalman filter can operate in a lower-dimensional subspace while ensuring the nonnegativity constraint inherent to pure spectra. This dimensionality reduction allows significantly lightening the computational burden, while leveraging recent advances related to the representation of essential spectral information. The proposed method is evaluated through extensive numerical experiments conducted on synthetic and real Raman data sets. The results show that this Kalman filter-based method offers a convenient trade-off between unmixing accuracy and computational efficiency, which is crucial for operating in an on-the-fly setting. To the best of the authors' knowledge, this is the first operational method which is able to solve the spectral unmixing problem efficiently in a dynamic fashion. It also constitutes a valuable building block for benefiting from acquisition and processing frameworks recently proposed in the microscopy literature, which are motivated by practical issues such as reducing acquisition time and avoiding potential damages being inflicted to photosensitive samples.

[59] 2407.15641

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

In this paper, we propose and investigate the use of neural audio codec language models for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.

[60] 2407.15653

Mobile-to-Mobile Uncorrelated Scatter Channels

In this paper, we present a complete analytic probability based description of mobile-to-mobile uncorrelated scatter channels. The correlation based description introduced by Bello and Matz is thus complemented by the presented probabilistic description leading to a common theoretical description of uncorrelated scatter channels. Furthermore, we introduce novel twodimensional hybrid characteristic probability density functions, which remain a probability density in one of the variables and a characteristic function in the other variable. Such a probability based description allows us to derive a mathematical model, in which the attenuation of the scattering components is inherently included in these two-dimensional functions. Therefore, there is no need to determine the path loss exponent. Additionally, the Doppler probability density function with the inclusion of the path loss leads to a concave function of the Doppler spectrum, which is quite different from the Jakes and Doppler spectra and can be directly parameterized by the velocity vectors and geometry of the scattering plane. Thus, knowing those parameters permits the theoretical computation of the Doppler spectra and temporal characteristic functions. Finally, we present a comparison between the computed probability based theoretical results and measurement data for a generic mobile-to-mobile channel. The agreement between the two shows the usefulness of the probability based description and confirms new shapes of the Doppler power spectra.

[61] 2407.15689

YOLOv10 for Automated Fracture Detection in Pediatric Wrist Trauma X-rays

Wrist fractures are highly prevalent among children and can significantly impact their daily activities, such as attending school, participating in sports, and performing basic self-care tasks. If not treated properly, these fractures can result in chronic pain, reduced wrist functionality, and other long-term complications. Recently, advancements in object detection have shown promise in enhancing fracture detection, with systems achieving accuracy comparable to, or even surpassing, that of human radiologists. The YOLO series, in particular, has demonstrated notable success in this domain. This study is the first to provide a thorough evaluation of various YOLOv10 variants to assess their performance in detecting pediatric wrist fractures using the GRAZPEDWRI-DX dataset. It investigates how changes in model complexity, scaling the architecture, and implementing a dual-label assignment strategy can enhance detection performance. Experimental results indicate that our trained model achieved mean average precision (mAP@50-95) of 51.9\% surpassing the current YOLOv9 benchmark of 43.3\% on this dataset. This represents an improvement of 8.6\%. The implementation code is publicly available at

[62] 2407.15691

Multi-Objective Distributed Beamforming Using High-Accuracy Synchronization and Localization

We present a multi-node, multi-objective open-loop microwave distributed beamforming system based on high-accuracy wireless synchronization and localization. Distributed beamforming requires accurate coordination of the spatial and electrical states of the individual elements within the array to achieve and maintain coherent beamforming at intended destinations. Of the basic coordination aspects, time synchronization and localization of the elements are among the most critical to support beamforming of modulated waveforms to destinations in both the near-field and far-field of the array. In this work, we demonstrate multi-objective distributed beamforming from a three-node distributed phased array consisting of software-defined radios that leverages high-accuracy wireless time coordination for both time synchronization and two-dimensional localization of the elements. We use a spectrally-sparse two-tone waveform for high-accuracy inter-node range estimation combined with a linear-frequency modulated waveform to mitigate multipath interference. Localization is performed in a centralized format, where one node is designated as the origin and the remaining nodes build the array geometry relative to the origin, from which we obtain localization accuracy of less than 1 cm. We implement a near-field multi-objective beamformer based on the location estimates, which enables the simultaneous steering of a beam and a null to two receiving antennas. Multi-objective beamforming of pulsed waveforms at a carrier frequency of 2.1 GHz is demonstrated in cases where one of the nodes in the distributed antenna array is moved, and where the targets (the two receiving antennas) are moved.

[63] 2407.15728

SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

This paper presents a new approach for effective segmentation of images that can be integrated into any model and methodology; the paradigm that we choose is classification of medical images (3-D chest CT scans) for Covid-19 detection. Our approach includes a combination of vision-language models that segment the CT scans, which are then fed to a deep neural architecture, named RACNet, for Covid-19 detection. In particular, a novel framework, named SAM2CLIP2SAM, is introduced for segmentation that leverages the strengths of both Segment Anything Model (SAM) and Contrastive Language-Image Pre-Training (CLIP) to accurately segment the right and left lungs in CT scans, subsequently feeding these segmented outputs into RACNet for classification of COVID-19 and non-COVID-19 cases. At first, SAM produces multiple part-based segmentation masks for each slice in the CT scan; then CLIP selects only the masks that are associated with the regions of interest (ROIs), i.e., the right and left lungs; finally SAM is given these ROIs as prompts and generates the final segmentation mask for the lungs. Experiments are presented across two Covid-19 annotated databases which illustrate the improved performance obtained when our method has been used for segmentation of the CT scans.

[64] 2407.15730

Neural-based Video Compression on Solar Dynamics Observatory Images

NASA's Solar Dynamics Observatory (SDO) mission collects extensive data to monitor the Sun's daily activity. In the realm of space mission design, data compression plays a crucial role in addressing the challenges posed by limited telemetry rates. The primary objective of data compression is to facilitate efficient data management and transmission to work within the constrained bandwidth, thereby ensuring that essential information is captured while optimizing the utilization of available resources. This paper introduces a neural video compression technique that achieves a high compression ratio for the SDO's image data collection. The proposed approach focuses on leveraging both temporal and spatial redundancies in the data, leading to a more efficient compression. In this work, we introduce an architecture based on the Transformer model, which is specifically designed to capture both local and global information from input images in an effective and efficient manner. Additionally, our network is equipped with an entropy model that can accurately model the probability distribution of the latent representations and improves the speed of the entropy decoding step. The entropy model leverages a channel-dependent approach and utilizes checkerboard-shaped local and global spatial contexts. By combining the Transformer-based video compression network with our entropy model, the proposed compression algorithm demonstrates superior performance over traditional video codecs like H.264 and H.265, as confirmed by our experimental results.

[65] 2407.15749

Robustness of Speech Separation Models for Similar-pitch Speakers

Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments. This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal. Building on earlier findings by Ditter and Gerkmann, which identified a significant performance drop for the 2018 Chimera++ under similar-pitch conditions, our study extends the analysis to more recent and sophisticated Neural Network models. Our experiments reveal that modern models have substantially reduced the performance gap for matched training and testing conditions. However, a substantial performance gap persists under mismatched conditions, with models performing well for large pitch differences but showing worse performance if the speakers' pitches are similar. These findings motivate further research into the generalizability of speech separation models to similar-pitch speakers and unseen data.

[66] 2407.15784

Diffusion Model Based Resource Allocation Strategy in Ultra-Reliable Wireless Networked Control Systems

Diffusion models are vastly used in generative AI, leveraging their capability to capture complex data distributions. However, their potential remains largely unexplored in the field of resource allocation in wireless networks. This paper introduces a novel diffusion model-based resource allocation strategy for Wireless Networked Control Systems (WNCSs) with the objective of minimizing total power consumption through the optimization of the sampling period in the control system, and blocklength and packet error probability in the finite blocklength regime of the communication system. The problem is first reduced to the optimization of blocklength only based on the derivation of the optimality conditions. Then, the optimization theory solution collects a dataset of channel gains and corresponding optimal blocklengths. Finally, the Denoising Diffusion Probabilistic Model (DDPM) uses this collected dataset to train the resource allocation algorithm that generates optimal blocklength values conditioned on the channel state information (CSI). Via extensive simulations, the proposed approach is shown to outperform previously proposed Deep Reinforcement Learning (DRL) based approaches with close to optimal performance regarding total power consumption. Moreover, an improvement of up to eighteen-fold in the reduction of critical constraint violations is observed, further underscoring the accuracy of the solution.

[67] 2407.15799

Adaptive Extensions of Unbiased Risk Estimators for Unsupervised Magnetic Resonance Image Denoising

The application of Deep Neural Networks (DNNs) to image denoising has notably challenged traditional denoising methods, particularly within complex noise scenarios prevalent in medical imaging. Despite the effectiveness of traditional and some DNN-based methods, their reliance on high-quality, noiseless ground truth images limits their practical utility. In response to this, our work introduces and benchmarks innovative unsupervised learning strategies, notably Stein's Unbiased Risk Estimator (SURE), its extension (eSURE), and our novel implementation, the Extended Poisson Unbiased Risk Estimator (ePURE), within medical imaging frameworks. This paper presents a comprehensive evaluation of these methods on MRI data afflicted with Gaussian and Poisson noise types, a scenario typical in medical imaging but challenging for most denoising algorithms. Our main contribution lies in the effective adaptation and implementation of the SURE, eSURE, and particularly the ePURE frameworks for medical images, showcasing their robustness and efficacy in environments where traditional noiseless ground truth cannot be obtained.

[68] 2407.15817

Enhancing Cell Instance Segmentation in Scanning Electron Microscopy Images via a Deep Contour Closing Operator

Accurately segmenting and individualizing cells in SEM images is a highly promising technique for elucidating tissue architecture in oncology. While current AI-based methods are effective, errors persist, necessitating time-consuming manual corrections, particularly in areas where the quality of cell contours in the image is poor and requires gap filling. This study presents a novel AI-driven approach for refining cell boundary delineation to improve instance-based cell segmentation in SEM images, also reducing the necessity for residual manual correction. A CNN COp-Net is introduced to address gaps in cell contours, effectively filling in regions with deficient or absent information. The network takes as input cell contour probability maps with potentially inadequate or missing information and outputs corrected cell contour delineations. The lack of training data was addressed by generating low integrity probability maps using a tailored PDE. We showcase the efficacy of our approach in augmenting cell boundary precision using both private SEM images from PDX hepatoblastoma tissues and publicly accessible images datasets. The proposed cell contour closing operator exhibits a notable improvement in tested datasets, achieving respectively close to 50% (private data) and 10% (public data) increase in the accurately-delineated cell proportion compared to state-of-the-art methods. Additionally, the need for manual corrections was significantly reduced, therefore facilitating the overall digitalization process. Our results demonstrate a notable enhancement in the accuracy of cell instance segmentation, particularly in highly challenging regions where image quality compromises the integrity of cell boundaries, necessitating gap filling. Therefore, our work should ultimately facilitate the study of tumour tissue bioarchitecture in onconanotomy field.

[69] 2407.14553

Machine Learning for Improved Current Density Reconstruction from 2D Vector Magnetic Images

The reconstruction of electrical current densities from magnetic field measurements is an important technique with applications in materials science, circuit design, quality control, plasma physics, and biology. Analytic reconstruction methods exist for planar currents, but break down in the presence of high spatial frequency noise or large standoff distance, restricting the types of systems that can be studied. Here, we demonstrate the use of a deep convolutional neural network for current density reconstruction from two-dimensional (2D) images of vector magnetic fields acquired by a quantum diamond microscope (QDM). Trained network performance significantly exceeds analytic reconstruction for data with high noise or large standoff distances. This machine learning technique can perform quality inversions on lower SNR data, reducing the data collection time by a factor of about 400 and permitting reconstructions of weaker and three-dimensional current sources.

[70] 2407.14638

Metasurface Energy Harvesters: State-of-the-Art Designs and Their Potential for Energy Sustainable Reconfigurable Intelligent Surfaces

Metasurface Energy Harvesters (MEHs) have emerged as a prominent enabler of highly efficient Radio Frequency (RF) energy harvesters. This survey delves into the fundamentals of the MEH technology, providing a comprehensive overview of their working principle, unit cell designs and prototypes over various frequency bands, as well as state-of-the art modes of operation. Inspired by the recent academic and industrial interest on Reconfigurable Intelligent Surfaces (RISs)for the upcoming sixth-Generation (6G) of wireless networks, we study the interplay between this technology and MEHs aiming for energy sustainable RISs power by metasurface-based RF energy harvesting. We present a novel hybrid unit cell design capable of simultaneous energy harvesting and 1-bit tunable reflection whose dual-functional response is validated via full-wave simulations. Then, we conduct a comparative collection of real-world measurements for ambient RF power levels and power consumption budgets of reflective RISs to unveil the potential for a self-sustainable RIS via ambient RF energy harvesting. The paper is concluded with an elaborative discussion on open design challenges and future research directions for MEHs and energy sustainable hybrid RISs.

[71] 2407.14696

A Minibatch Alternating Projections Algorithm for Robust and Efficient Magnitude Least-Squares RF Pulse Design in MRI

A magnitude-least-squares radiofrequency pulse design algorithm is reported which uses interleaved exact and stochastically-generated inexact updates to escape local minima and find low-cost solutions. Inexact updates are performed using a small randomly selected minibatch of the available B1+ measurements to update RF pulse weights, which perturbs the sequence of alternating projections. Applications to RF shimming, parallel transmit spokes RF pulse design, and spectral-spatial RF pulse design are considered. Numerical and simulation studies characterized the optimal minibatch size, which was found to consistently produce lower power and lower RMSE solutions across subjects, coil geometries, B1+ resolutions and orientations. The method was validated in-vivo at 7 Tesla and produced improvements in image quality in a slice-by-slice RF-shimmed imaging sequence. Compared to conventional methods, the pulse design method can more robustly design RF pulses that correct for B1+ inhomogeneities at ultra-high field strengths, and enable pulse designs to be completed with increased computational efficiency

[72] 2407.14700

Composer's Assistant 2: Interactive Multi-Track MIDI Infilling with Fine-Grained User Control

We introduce Composer's Assistant 2, a system for interactive human-computer composition in the REAPER digital audio workstation. Our work upgrades the Composer's Assistant system (which performs multi-track infilling of symbolic music at the track-measure level) with a wide range of new controls to give users fine-grained control over the system's outputs. Controls introduced in this work include two types of rhythmic conditioning controls, horizontal and vertical note onset density controls, several types of pitch controls, and a rhythmic interest control. We train a T5-like transformer model to implement these controls and to serve as the backbone of our system. With these controls, we achieve a dramatic improvement in objective metrics over the original system. We also study how well our model understands the meaning of our controls, and we conduct a listening study that does not find a significant difference between real music and music composed in a co-creative fashion with our system. We release our complete system, consisting of source code, pretrained models, and REAPER scripts.

[73] 2407.14746

Difflare: Removing Image Lens Flare with Latent Diffusion Model

The recovery of high-quality images from images corrupted by lens flare presents a significant challenge in low-level vision. Contemporary deep learning methods frequently entail training a lens flare removing model from scratch. However, these methods, despite their noticeable success, fail to utilize the generative prior learned by pre-trained models, resulting in unsatisfactory performance in lens flare removal. Furthermore, there are only few works considering the physical priors relevant to flare removal. To address these issues, we introduce Difflare, a novel approach designed for lens flare removal. To leverage the generative prior learned by Pre-Trained Diffusion Models (PTDM), we introduce a trainable Structural Guidance Injection Module (SGIM) aimed at guiding the restoration process with PTDM. Towards more efficient training, we employ Difflare in the latent space. To address information loss resulting from latent compression and the stochastic sampling process of PTDM, we introduce an Adaptive Feature Fusion Module (AFFM), which incorporates the Luminance Gradient Prior (LGP) of lens flare to dynamically regulate feature extraction. Extensive experiments demonstrate that our proposed Difflare achieves state-of-the-art performance in real-world lens flare removal, restoring images corrupted by flare with improved fidelity and perceptual quality. The codes will be released soon.

[74] 2407.14793

QoS Aware Mixed-Criticality Task Scheduling in Vehicular Edge Cloud System

Modern-day cars are equipped with numerous cameras and sensors, typically integrated with advanced decision-control systems that enable the vehicle to perceive its surroundings and navigate autonomously. Efficient processing of data from sensors, lidars, radars and cameras is quite computationally intensive and can not be done with good accuracy using less capable onboard resources. In order to deal with this problem, some computation requirements (also referred as tasks) are offloaded to infrastructure or executed in parallel in both autonomous vehicle (AV) and infrastructure to enhance accuracy. The infrastructure comprises base stations, a centralized cloud, and a CS. Base stations (BSs) execute tasks in collaboration with a significantly more powerful centralized cloud, while the centralised scheduler (CS) centrally schedules all the tasks. The base station receives tasks from multiple AVs, each with varying deadlines, criticality, and locations. Our main goal is to maximize the profit of the infrastructure by (a) minimizing the number of drop tasks, (b) minimizing the distance cost for task offloading, and (c) minimizing the energy usage of BSs. In this work, we proposed efficient approaches to schedule the collection of tasks to the BSs, by employing a hybrid scheduling approach where tasks from AVs get allocated to nearby base stations if the nearby BSs are lightly loaded, otherwise AVs send the task to CS for allocation. The CS maximizes the profit by following strategies: (a) selection of BS considering distance and energy consumption, (b) when task load is moderate or low, highly critical tasks run at favourable utilisation, and (c) low-critical tasks are dropped to free up resources for executing high-critical tasks. Based on our experiments, proposed approaches improved the QoS provided by up to 25% compared to the state-of-the-art approach in real-life datasets.

[75] 2407.14815

Unified Far-Field and Near-Field in Holographic MIMO: A Wavenumber-Domain Perspective

This article conceives a unified representation for near-field and far-field holographic multiple-input multiple-output (HMIMO) channels, addressing a practical design dilemma: "Why does the angular-domain representation no longer function effectively?" To answer this question, we pivot from the angular domain to the wavenumber domain and present a succinct overview of its underlying philosophy. In re-examining the Fourier plane-wave series expansion that recasts spherical propagation waves into a series of plane waves represented by Fourier harmonics, we characterize the HMIMO channel employing these Fourier harmonics having different wavenumbers. This approach, referred to as the wavenumebr-domain representation, facilitates a unified view across the far-field and the near-field. Furthermore, the limitations of the DFT basis are demonstrated when identifying the sparsity inherent to the HMIMO channel, motivating the development of a wavenumber-domain basis as an alternative. We then present some preliminary applications of the proposed wavenumber-domain basis in signal processing across both the far-field and near-field, along with several prospects for future HMIMO system designs based on the wavenumber domain.

[76] 2407.14819

A Convex-Nonconvex Framework for Enhancing Minimization Induced Penalties

This paper presents a novel framework for nonconvex enhancement of minimization induced (MI) penalties while preserving the overall convexity of associated regularization models. MI penalties enable the adaptation to certain signal structures via minimization, but often underestimate significant components owing to convexity. To overcome this shortcoming, we design a generalized Moreau enhanced minimization induced (GME-MI) penalty by subtracting from the MI penalty its generalized Moreau envelope. While the proposed GME-MI penalty is nonconvex in general, we derive an overall convexity condition for the GME-MI regularized least-squares model. Moreover, we present a proximal splitting algorithm with guaranteed convergence to a globally optimal solution of the GME-MI model under the overall convexity condition. Numerical examples illustrate the effectiveness of the proposed framework.

[77] 2407.14823

CrossDehaze: Scaling Up Image Dehazing with Cross-Data Vision Alignment and Augmentation

In recent years, as computer vision tasks have increasingly relied on high-quality image inputs, the task of image dehazing has received significant attention. Previously, many methods based on priors and deep learning have been proposed to address the task of image dehazing. Ignoring the domain gap between different data, former de-hazing methods usually adopt multiple datasets for explicit training, which often makes the methods themselves be violated. To address this problem, we propose a novel method of internal and external data augmentation to improve the existing dehazing methodology. By using cross-data external augmentor. The dataset inherits samples from different domains that are firmly aligned, making the model learn more robust and generalizable features. By using the internal data augmentation method, the model can fully exploit local information within the images, thereby obtaining more image details. To demonstrate the effectiveness of our proposed method, we conduct training on both the Natural Image Dataset (NID) and the Remote Sensing Image Dataset (RSID). Experimental results show that our method clearly resolves the domain gap in different dehazing datasets and presents a new pipeline for joint training in the dehazing task. Our approach significantly outperforms other advanced methods in dehazing and produces dehazed images that are closest to real haze-free images. The code will be available at:

[78] 2407.14940

Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues

Interruption in a dialogue occurs when the listener begins their speech before the current speaker finishes speaking. Interruptions can be broadly divided into two groups: cooperative (when the listener wants to support the speaker), and competitive (when the listener tries to take control of the conversation against the speaker's will). A system that automatically classifies interruptions can be used in call centers, specifically in the tasks of customer satisfaction monitoring and agent monitoring. In this study, we developed a text-based interruption classification model by preparing an in-house dataset consisting of ASR-transcribed customer support telephone dialogues in Russian. We fine-tuned Conversational RuBERT on our dataset and optimized hyperparameters, and the model performed well. With further improvements, the proposed model can be applied to automatic monitoring systems.

[79] 2407.14947

A Distributionally Robust Optimization Framework for Stochastic Assessment of Power System Flexibility in Economic Dispatch

Given the complexity of power systems, particularly the high-dimensional variability of net loads, accurately depicting the entire operational range of net loads poses a challenge. To address this, recent methodologies have sought to gauge the maximum range of net load uncertainty across all buses. In this paper, we consider the stochastic nature of the net load and introduce a distributionally robust optimization framework that assesses system flexibility stochastically, accommodating a minimal extent of system violations. We verify the proposed method by solving the flexibility of the real-time economic dispatch problem on four IEEE standard test systems. Compared to traditional deterministic flexibility evaluations, our approach consistently yields less conservative flexibility outcomes.

[80] 2407.14983

Deep Learning CT Image Restoration using System Blur and Noise Models

The restoration of images affected by blur and noise has been widely studied and has broad potential for applications including in medical imaging modalities like computed tomography (CT). Although the blur and noise in CT images can be attributed to a variety of system factors, these image properties can often be modeled and predicted accurately and used in classical restoration approaches for deconvolution and denoising. In classical approaches, simultaneous deconvolution and denoising can be challenging and often represent competing goals. Recently, deep learning approaches have demonstrated the potential to enhance image quality beyond classic limits; however, most deep learning models attempt a blind restoration problem and base their restoration on image inputs alone without direct knowledge of the image noise and blur properties. In this work, we present a method that leverages both degraded image inputs and a characterization of the system blur and noise to combine modeling and deep learning approaches. Different methods to integrate these auxiliary inputs are presented. Namely, an input-variant and a weight-variant approach wherein the auxiliary inputs are incorporated as a parameter vector before and after the convolutional block, respectively, allowing easy integration into any CNN architecture. The proposed model shows superior performance compared to baseline models lacking auxiliary inputs. Evaluations are based on the average Peak Signal-to-Noise Ratio (PSNR), selected examples of good and poor performance for varying approaches, and an input space analysis to assess the effect of different noise and blur on performance. Results demonstrate the efficacy of providing a deep learning model with auxiliary inputs, representing system blur and noise characteristics, to enhance the performance of the model in image restoration tasks.

[81] 2407.14984

Enhancing Microgrid Performance Prediction with Attention-based Deep Learning Models

In this research, an effort is made to address microgrid systems' operational challenges, characterized by power oscillations that eventually contribute to grid instability. An integrated strategy is proposed, leveraging the strengths of convolutional and Gated Recurrent Unit (GRU) layers. This approach is aimed at effectively extracting temporal data from energy datasets to improve the precision of microgrid behavior forecasts. Additionally, an attention layer is employed to underscore significant features within the time-series data, optimizing the forecasting process. The framework is anchored by a Multi-Layer Perceptron (MLP) model, which is tasked with comprehensive load forecasting and the identification of abnormal grid behaviors. Our methodology underwent rigorous evaluation using the Micro-grid Tariff Assessment Tool dataset, with Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (r2-score) serving as the primary metrics. The approach demonstrated exemplary performance, evidenced by a MAE of 0.39, RMSE of 0.28, and an r2-score of 98.89\% in load forecasting, along with near-perfect zero state prediction accuracy (approximately 99.9\%). Significantly outperforming conventional machine learning models such as support vector regression and random forest regression, our model's streamlined architecture is particularly suitable for real-time applications, thereby facilitating more effective and reliable microgrid management.

[82] 2407.14986

Structured Input-Output Modeling and Robust Stability Analysis of Compressible Flows

The recently introduced structured input-output analysis is a powerful method for capturing nonlinear phenomena associated with incompressible flows, and this paper extends that method to the compressible regime. The proposed method relies upon a reformulation of the compressible Navier-Stokes equations, which allows for an exact quadratic formulation of the dynamics of perturbations about a steady base flow. To facilitate the structured input-output analysis, a pseudo-linear model for the quadratic nonlinearity is proposed and the structural information of the nonlinearity is embedded into a structured uncertainty comprising unknown `perturbations'. The structured singular value framework is employed to compute the input-output gain, which provides an estimate of the robust stability margin of the flow perturbations, as well as the forcing and response modes that are consistent with the nonlinearity structure. The analysis is then carried out on a plane, laminar compressible Couette flow over a range of Mach numbers. The structured input-output gains identify an instability mechanism, characterized by a spanwise elongated structure in the streamwise-spanwise wavenumber space at a subsonic Mach number, that evolves into an oblique structure at sonic and supersonic Mach numbers. In addition, the structured input-output forcing and response modes provide insight into the thermodynamic and momentum characteristics associated with a source of instability. Comparisons with a resolvent/unstructured analysis reveal discrepancies in the distribution of input-output gains over the wavenumber space as well as in the modal behavior of an instability, thus highlighting the strong correlation between the structural information of the nonlinearity and the underlying flow physics.

[83] 2407.15060

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online,

[84] 2407.15165

Reinforcement Learning Optimizes Power Dispatch in Decentralized Power Grid

Effective frequency control in power grids has become increasingly important with the increasing demand for renewable energy sources. Here, we propose a novel strategy for resolving this challenge using graph convolutional proximal policy optimization (GC-PPO). The GC-PPO method can optimally determine how much power individual buses dispatch to reduce frequency fluctuations across a power grid. We demonstrate its efficacy in controlling disturbances by applying the GC-PPO to the power grid of the UK. The performance of GC-PPO is outstanding compared to the classical methods. This result highlights the promising role of GC-PPO in enhancing the stability and reliability of power systems by switching lines or decentralizing grid topology.

[85] 2407.15174

TADA: Temporal Adversarial Data Augmentation for Time Series Data

Domain generalization involves training machine learning models to perform robustly on unseen samples from out-of-distribution datasets. Adversarial Data Augmentation (ADA) is a commonly used approach that enhances model adaptability by incorporating synthetic samples, designed to simulate potential unseen samples. While ADA effectively addresses amplitude-related distribution shifts, it falls short in managing temporal shifts, which are essential for time series data. To address this limitation, we propose the Temporal Adversarial Data Augmentation for time teries Data (TADA), which incorporates a time warping technique specifically targeting temporal shifts. Recognizing the challenge of non-differentiability in traditional time warping, we make it differentiable by leveraging phase shifts in the frequency domain. Our evaluations across diverse domains demonstrate that TADA significantly outperforms existing ADA variants, enhancing model performance across time series datasets with varied distributions.

[86] 2407.15216

Explainability Paths for Sustained Artistic Practice with AI

The development of AI-driven generative audio mirrors broader AI trends, often prioritizing immediate accessibility at the expense of explainability. Consequently, integrating such tools into sustained artistic practice remains a significant challenge. In this paper, we explore several paths to improve explainability, drawing primarily from our research-creation practice in training and implementing generative audio models. As practical provisions for improved explainability, we highlight human agency over training materials, the viability of small-scale datasets, the facilitation of the iterative creative process, and the integration of interactive machine learning as a mapping tool. Importantly, these steps aim to enhance human agency over generative AI systems not only during model inference, but also when curating and preprocessing training data as well as during the training phase of models.

[87] 2407.15245

Weyl Calculus and Exactly Solvable Schrödinger Bridges with Quadratic State Cost

Schr\"{o}dinger bridge--a stochastic dynamical generalization of optimal mass transport--exhibits a learning-control duality. Viewed as a stochastic control problem, the Schr\"{o}dinger bridge finds an optimal control policy that steers a given joint state statistics to another while minimizing the total control effort subject to controlled diffusion and deadline constraints. Viewed as a stochastic learning problem, the Schr\"{o}dinger bridge finds the most-likely distribution-valued trajectory connecting endpoint distributional observations, i.e., solves the two point boundary-constrained maximum likelihood problem over the manifold of probability distributions. Recent works have shown that solving the Schr\"{o}dinger bridge problem with state cost requires finding the Markov kernel associated with a reaction-diffusion PDE where the state cost appears as a state-dependent reaction rate. We explain how ideas from Weyl calculus in quantum mechanics, specifically the Weyl operator and the Weyl symbol, can help determine such Markov kernels. We illustrate these ideas by explicitly finding the Markov kernel for the case of quadratic state cost via Weyl calculus, recovering our earlier results but avoiding tedious computation with Hermite polynomials.

[88] 2407.15284

Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing

We delve into the issue of node classification within graphs, specifically reevaluating the concept of neighborhood aggregation, which is a fundamental component in graph neural networks (GNNs). Our analysis reveals conceptual flaws within certain benchmark GNN models when operating under the assumption of edge-independent node labels, a condition commonly observed in benchmark graphs employed for node classification. Approaching neighborhood aggregation from a statistical signal processing perspective, our investigation provides novel insights which may be used to design more efficient GNN models.

[89] 2407.15300

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative accuracy gains for RAVDESS and CREMA-D, respectively. Moreover, SELM can further boost its performance by Few-Shot Learning using a few annotated examples. The results highlight the effectiveness of our SER formulation, especially to improve performance in OOD scenarios.

[90] 2407.15463

Integrated Access and Backhaul (IAB) in Low Altitude Platforms

In this paper, we explore the problem of utilizing Integrated Access and Backhaul (IAB) technology in Non-Terrestrial Networks (NTN), with a particular focus on aerial access networks. We consider an Uncrewed Aerial Vehicle (UAV)-based wireless network comprised of two layers of UAVs: (a) a lower layer consisting a number of flying users and a UAV Base Station (BS) that provides coverage for terrestrial users and, (b) an upper layer designated to provide both wireless access for flying users and backhaul connectivity for UAV BS. By adopting IAB technology, the backhaul and access links collaboratively share their resources, enabling aerial backhauling and the utilization of the same infrastructure and frequency resources for access links. A sum-rate maximization problem is formulated by considering aerial backhaul constraints to optimally allocate the frequency spectrum between aerial and terrestrial networks. We decompose the resulting non-convex optimization problem into two sub-problems of beamforming and spectrum allocation and then propose efficient solutions for each. Numerical results in different scenarios yield insightful findings about the effectiveness of using the IAB technique in aerial networks.

[91] 2407.15570

Hybrid STAR-RIS Enabled Integrated Sensing and Communication

Integrated sensing and communication (ISAC) is recognized as one of the key enabling technologies for sixth-generation (6G) wireless communication networks, facilitating diverse emerging applications and services in an energy and cost-efficient manner. This paper proposes a multi-user multi-target ISAC system to enable full-space coverage for communication and sensing tasks. The proposed system employs a hybrid simultaneous transmission and reflection reconfigurable intelligent surface (STAR-RIS) comprising active transmissive and passive reflective elements. In the proposed scheme, the passive reflective elements support communication and sensing links for nearby communication users and sensing targets, while low-power active transmissive elements are deployed to improve sensing performance and overcome high path attenuation due to multi-hop transmission for remote targets. Moreover, to optimize the transmissive/reflective coefficients of the hybrid STAR-RIS, a semi-definite relaxation (SDR)-based algorithm is proposed. Furthermore, to evaluate sensing performance, signal-to-interference-noise ratio (SINR) and Cramer-Rao bound (CRB) metrics have been derived and investigated via conducting extensive computer simulations.

[92] 2407.15580

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

[93] 2407.15614

Experimenting with Adaptive Bitrate Algorithms for Virtual Reality Streaming over Wi-Fi

Interactive Virtual Reality (VR) streaming over Wi-Fi networks encounters significant challenges due to bandwidth fluctuations caused by channel contention and user mobility. Adaptive BitRate (ABR) algorithms dynamically adjust the video encoding bitrate based on the available network capacity, aiming to maximize image quality while mitigating congestion and preserving the user's Quality of Experience (QoE). In this paper, we experiment with ABR algorithms for VR streaming using Air Light VR (ALVR), an open-source VR streaming solution. We extend ALVR with a comprehensive set of metrics that provide a robust characterization of the network's state, enabling more informed bitrate adjustments. To demonstrate the utility of these performance indicators, we develop and test the Network-aware Step-wise ABR algorithm for VR streaming (NeSt-VR). Results validate the accuracy of the newly implemented network performance metrics and demonstrate NeSt-VR's video bitrate adaptation capabilities.

[94] 2407.15662

How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?

We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common, but unknown, distribution q via applying an unknown permutation to $\mathcal X$. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations $(n_k)_{k\in \mathcal K}$ are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of $O(1/\sqrt{\sum_{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$ and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task.

[95] 2407.15672

Computer Audition: From Task-Specific Machine Learning to Foundation Models

Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-available interaction with human users. Naturally, these promises have created substantial excitement in the audio community, and have led to a wave of early attempts to build new, general-purpose foundation models for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines towards auditory foundation models. Our work highlights the key operating principles that underpin those models, and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.

[96] 2407.15707

Predicting the Best of N Visual Trackers

We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on

[97] 2407.15729

Self-Sustainable Metasurface-Assisted mmWave Indoor Communication System

In the design of a metasurface-assisted system for indoor environments, it is essential to take into account not only the performance gains and coverage extension provided by the metasurface but also the operating costs brought by its reconfigurability, such as powering and cabling. These costs can present challenges, particularly in indoor dense spaces (IDSs). A self-sustainable metasurface (SSM), which retains reconfigurability unlike a static metasurface (SMS), achieves a lower operating cost than a reconfigurable intelligent surface (RIS) by being self-sustainable through power harvesting. In this paper, in order to find a better trade-off between metasurface gain, coverage, and operating cost, the design and performance of an SSM-assisted indoor mmWave communication system are investigated. We first simplify the design of the SSM-assisted system by considering the use of SSMs in a preset-based manner and the formation of coverage groups by associating SSMs with the closest user equipments (UEs). We propose a two-stage iterative algorithm to maximize the minimum data rate in the system by jointly deciding the association between the UEs and the SSMs, the phase-shifts of the SSMs, and allocating time resources for each UE. The non-convexities that exist in the proposed optimization problem are tackled using the feasible point pursuit successive convex approximation method and the concave-convex procedure. To understand the best scenario for using SSM, the resulting performance is compared with that achieved with RIS and SMS. Our numerical results indicate that SSMs are best utilized in a small environment where self-sustainability is easier to achieve when the budget for operating costs is tight.

[98] 2407.15743

Enhanced Achievable DoF Bounds for Cache-Aided MIMO Communication Systems

Integrating coded caching (CC) into multiple-input multiple-output (MIMO) communications may significantly enhance the achievable degrees of freedom (DoF) of the wireless networks. In this paper, we consider a cache-aided MIMO configuration with a CC gain $t$, where a server with $L$ Tx antennas communicates with $K$ users, each with $G$ Rx antennas. In the proposed content-aware MIMO strategy, we carefully adjust the number of users $\Omega$ and the number of parallel streams decoded by each user $\beta$ served in each transmission to maximize the DoF. As a result, we achieve a DoF of ${\max_{\beta, \Omega }}{\Omega \beta}$, where ${\beta \le \mathrm{min}\big(G,\frac{L \binom{\Omega-1}{t}}{1 + (\Omega - t-1)\binom{\Omega-1}{t}}\big)}$. To prove the achievability of the proposed DoF bound, we provide a novel transmission strategy based on the simultaneous unicasting of multiple data streams. In this strategy, the missing data packets are scheduled such that the number of parallel streams per transmission is maximized while the decodability of all useful terms by each target user is guaranteed. Numerical simulations validate the findings, confirming the enhanced DoF and improved performance of the proposed design.

[99] 2407.15752

Broad and Spectral-Efficient Beamforming for the Uni-polarized Reconfigurable Intelligent Surfaces

A reconfigurable intelligent surface (RIS) is composed of low-cost elements that manipulate the propagation environment from a transmitter by intelligently applying phase shifts to incoming signals before they are reflected. This paper explores a uni-polarized RIS with linear shape aimed at transmitting a common signal to multiple user equipments (UEs) spread across a wide angular region. To achieve uniform coverage, the uni-polarized RIS is designed to emit a broad and spectral-efficient beam featuring a spatially flat-like array factor, diverging from the conventional narrow beam approach. To achieve this objective, we start by deriving probabilistic lower and upper bounds for the average spectral efficiency (SE) delivered to the UEs. Leveraging the insights from the lower bound, we focus on optimizing the minimum value of the power domain array factor (PDAF) across a range of azimuth angles from \(-\frac{\pi}{2}\) to \(\frac{\pi}{2}\). We employ the continuous genetic algorithm (CGA) for this optimization task, aiming to improve the SE delivered to the UEs while also creating a wide beam. Extensive simulation experiments are carried out to assess the performance of the proposed code, focusing on key metrics such as the minimum and average values of the PDAF and the SE delivered to the UEs. Our findings demonstrate that the proposed code enhances the minimum SE delivered to the UEs while maintaining the desired attribute of a broad beam. This performance is notably superior to that of established codes, including the Barker, Frank, and Chu codes.

[100] 2407.15782

Reconfigurable Intelligent Surface Empowered Full Duplex Systems: Opportunities and Challenges

Reconfigurable intelligent surfaces (RISs) have emerged as a promising technology in wireless communications. Simultaneously transmitting and reflecting RIS (STAR-RISs) in particular have garnered significant attention due to their dual capabilities of simultaneous transmission and reflection, underscoring their potential applications in critical scenarios within the forthcoming sixth-generation (6G) technology landscape. Moreover, full-duplex (FD) systems have emerged as a breakthrough research direction in wireless transmission technology due to their high spectral efficiency. This paper explores the application potential of STAR-RIS in FD systems for future wireless communications, presenting an innovative technology that provides robust self-interference cancellation (SIC) capabilities for FD systems. We utilize the refraction functionality of STAR-RIS enhances the transmission capacity of FD systems, while its reflection functionality is used to eliminate self interference within the FD system. We delve into the applications of two different types of STAR-RIS in FD systems and compare their performance through simulations. Furthermore, we discuss the performance differences of STAR-RIS empowered FD systems under various configurations in a case study, and demonstrate the superiority of the proposed deep learning-based optimization algorithm. Finally, we discuss possible future research directions for STAR-RIS empowered FD systems.

[101] 2407.15828

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Spoken dialogue plays a crucial role in human-AI interactions, necessitating dialogue-oriented spoken language models (SLMs). To develop versatile SLMs, large-scale and diverse speech datasets are essential. Additionally, to ensure hiqh-quality speech generation, the data must be spontaneous like in-wild data and must be acoustically clean with noise removed. Despite the critical need, no open-source corpus meeting all these criteria has been available. This study addresses this gap by constructing and releasing a large-scale spoken dialogue corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly accessible. Furthermore, this paper presents a language-independent method for corpus construction and describes experiments on dialogue generation using SLMs trained on J-CHAT. Experimental results indicate that the collected data from multiple domains by our method improve the naturalness and meaningfulness of dialogue generation.

[102] 2407.15835

dMel: Speech Tokenization made Simple

Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using a transformer decoder-only architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR), speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.