New articles on Electrical Engineering and Systems Science


[1] 2410.14686

Achieving Generalization in Orchestrating GNSS Interference Monitoring Stations Through Pseudo-Labeling

The accuracy of global navigation satellite system (GNSS) receivers is significantly compromised by interference from jamming devices. Consequently, the detection of these jammers are crucial to mitigating such interference signals. However, robust classification of interference using machine learning (ML) models is challenging due to the lack of labeled data in real-world environments. In this paper, we propose an ML approach that achieves high generalization in classifying interference through orchestrated monitoring stations deployed along highways. We present a semi-supervised approach coupled with an uncertainty-based voting mechanism by combining Monte Carlo and Deep Ensembles that effectively minimizes the requirement for labeled training samples to less than 5% of the dataset while improving adaptability across varying environments. Our method demonstrates strong performance when adapted from indoor environments to real-world scenarios.


[2] 2410.14747

Continuous Wavelet Transformation and VGG16 Deep Neural Network for Stress Classification in PPG Signals

Our research introduces a groundbreaking approach to stress classification through Photoplethysmogram (PPG) signals. By combining Continuous Wavelet Transformation (CWT) with the proven VGG16 classifier, our method enhances stress assessment accuracy and reliability. Previous studies highlighted the importance of physiological signal analysis, yet precise stress classification remains a challenge. Our approach addresses this by incorporating robust data preprocessing with a Kalman filter and a sophisticated neural network architecture. Experimental results showcase exceptional performance, achieving a maximum training accuracy of 98% and maintaining an impressive average training accuracy of 96% across diverse stress scenarios. These results demonstrate the practicality and promise of our method in advancing stress monitoring systems and stress alarm sensors, contributing significantly to stress classification.


[3] 2410.14769

Medical AI for Early Detection of Lung Cancer: A Survey

Lung cancer remains one of the leading causes of morbidity and mortality worldwide, making early diagnosis critical for improving therapeutic outcomes and patient prognosis. Computer-aided diagnosis (CAD) systems, which analyze CT images, have proven effective in detecting and classifying pulmonary nodules, significantly enhancing the detection rate of early-stage lung cancer. Although traditional machine learning algorithms have been valuable, they exhibit limitations in handling complex sample data. The recent emergence of deep learning has revolutionized medical image analysis, driving substantial advancements in this field. This review focuses on recent progress in deep learning for pulmonary nodule detection, segmentation, and classification. Traditional machine learning methods, such as SVM and KNN, have shown limitations, paving the way for advanced approaches like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Generative Adversarial Networks (GAN). The integration of ensemble models and novel techniques is also discussed, emphasizing the latest developments in lung cancer diagnosis. Deep learning algorithms, combined with various analytical techniques, have markedly improved the accuracy and efficiency of pulmonary nodule analysis, surpassing traditional methods, particularly in nodule classification. Although challenges remain, continuous technological advancements are expected to further strengthen the role of deep learning in medical diagnostics, especially for early lung cancer detection and diagnosis. A comprehensive list of lung cancer detection models reviewed in this work is available at https://github.com/CaiGuoHui123/Awesome-Lung-Cancer-Detection


[4] 2410.14833

A novel approach towards the classification of Bone Fracture from Musculoskeletal Radiography images using Attention Based Transfer Learning

Computer-aided diagnosis (CAD) is today considered a vital tool in the field of biological image categorization, segmentation, and other related tasks. The current breakthrough in computer vision algorithms and deep learning approaches has substantially enhanced the effectiveness and precision of apps built to recognize and locate regions of interest inside medical photographs. Among the different disciplines of medical image analysis, bone fracture detection, and classification have exhibited exceptional potential. Although numerous imaging modalities are applied in medical diagnostics, X-rays are particularly significant in this sector due to their broad availability, ease of use, and extensive information extraction capabilities. This research studies bone fracture categorization using the FracAtlas dataset, which comprises 4,083 musculoskeletal radiography pictures. Given the transformational development in transfer learning, particularly its efficacy in medical image processing, we deploy an attention-based transfer learning model to detect bone fractures in X-ray scans. Though the popular InceptionV3 and DenseNet121 deep learning models have been widely used, they still have the potential to be employed in crucial jobs. In this research, alongside transfer learning, a separate attention mechanism is also applied to boost the capabilities of transfer learning techniques. Through rigorous optimization, our model achieves a state-of-the-art accuracy of more than 90\% in fracture classification. This work contributes to the expanding corpus of research focused on the application of transfer learning to medical imaging, notably in the context of X-ray processing, and emphasizes the promise for additional exploration in this domain.


[5] 2410.14877

Coordinated Frequency Regulation in Grid-Forming Storage Network via Safety-Consensus

Inverter-based storages are poised to play a prominent role in future power grids with massive renewable generation. Grid-forming inverters (GFMs) are emerging as a dominant technology with synchronous generators (SG)-like characteristics through primary control loops. Advanced secondary control schemes, e.g., consensus algorithms, allow GFM-interfaced storage units to participate in frequency regulations and restore nominal frequency following grid disturbances. However, it is imperative to ensure transient frequency excursions do not violate critical safety limits while the grid transitions from pre- to post-disturbance operating point. This paper presents a hierarchical safety-enforced consensus method -- combining a device-layer (decentralized) transient safety filter with a secondary-layer (distributed) consensus coordination -- to achieve three distinct objectives: limiting transient frequency excursions to safe limits, minimizing frequency deviations from nominal, and ensuring coordinated power sharing among GFM-storage units. The proposed hierarchical (two-layered) safety-consensus technique is illustrated using a GFM-interfaced storage network on an IEEE 68-bus system under multiple grid transient scenarios.


[6] 2410.14892

Frequency Control and Disturbance Containment Using Grid-Forming Embedded Storage Networks

The paper discusses fast frequency control in bulk power systems using embedded networks of grid-forming energy storage resources. Differing from their traditional roles of regulating reserves, the storage resources in this work operate as fast-acting grid assets shaping transient dynamics. The storage resources in the network are autonomously controlled using local measurements for distributed frequency support during disturbance events. Further, the grid-forming inverter systems interfacing with the storage resources, are augmented with fast-acting safety controls designed to contain frequency transients within a prescribed tolerance band. The control action, derived from the storage network, improves the frequency nadirs in the system and prevents the severity of a disturbance from propagating far from the source. The paper also presents sensitivity studies to evaluate the impacts of storage capacity and inverter controller parameters on the dynamic performance of frequency control and disturbance localization. The performance of the safety-constrained grid-forming control is also compared with the more common grid-following control. The results are illustrated through case studies on an IEEE test system.


[7] 2410.14907

Differential Predictive Control of Residential Building HVACs for Maximizing Renewable Local Consumption and Supporting Fast Voltage Control

High penetration of distributed energy resources in distribution systems, such as rooftop solar PVs, has caused voltage fluctuations which are much faster than typical voltage control devices can react to, leading to increased operation cost and reduced equipment life. Residential buildings consume about 35% of the electricity in U.S. and are co-located with rooftop solar PV. Thus, they present an opportunity to mitigate these fluctuations locally, while benefiting both the grid and building owners. Previous works on DER-aware localized building energy management mostly focus on commercial buildings and analyzing impacts either on buildings or the grid. To fill the gaps, this paper proposes a distributed, differential predictive control scheme for residential HVAC systems for maximizing renewable local consumption. In addition, a detailed controller-building-grid co-simulation platform is developed and utilized for analyzing the potential impacts of the proposed control scheme on both the buildings and distribution system. Our studies show that the proposed method can provide benefits to both the buildings' owners and the distribution system by reducing energy draw from the grid by 12%, voltage violations and fast fluctuations by 20%, and the number of tap changes in voltage regulators by 14%.


[8] 2410.14910

AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup

Self-supervised learning (SSL) leverages large amounts of unlabelled data to learn rich speech representations, fostering improvements in automatic speech recognition (ASR), even when only a small amount of labelled data is available for fine-tuning. Despite the advances in SSL, a significant challenge remains when the data used for pre-training (source domain) mismatches the fine-tuning data (target domain). To tackle this domain mismatch challenge, we propose a new domain adaptation method for low-resource ASR focused on contrastive mixup for joint-embedding architectures named AC-Mix (agnostic contrastive mixup). In this approach, the SSL model is adapted through additional pre-training using mixed data views created by interpolating samples from the source and the target domains. Our proposed adaptation method consistently outperforms the baseline system, using approximately 11 hours of adaptation data and requiring only 1 hour of adaptation time on a single GPU with WavLM-Large.


[9] 2410.14912

Grid-Forming Control of Modular Dynamic Virtual Power Plants

This article explores a flexible and coordinated control design for an aggregation of heterogeneous distributed energy resources (DERs) in a dynamic virtual power plant (DVPP). The control design aims to provide a desired aggregate grid-forming (GFM) response based on the coordination of power contributions between different DERs. Compared to existing DVPP designs with an AC-coupled AC-output configuration, a more generic modular DVPP design is proposed in this article, which comprises four types of basic DVPP modules, involving AC- or DC-coupling and AC- or DC-output, adequately accommodating diverse DER integration setups, such as AC, DC, AC/DC hybrid microgrids and renewable power plants. The control design is first developed for the four basic modules by the aggregation of DERs and the disaggregation of the control objectives, and then extended to modular DVPPs through a systematic top-down approach. The control performance is comprehensively validated through simulation. The modular DVPP design offers scalable and standardizable advanced grid interfaces (AGIs) for building and operating AC/DC hybrid power grids.


[10] 2410.14936

Optimizing Individualized Incentives from Grid Measurements and Limited Knowledge of Agent Behavior

As electrical generation becomes more distributed and volatile, and loads become more uncertain, controllability of distributed energy resources (DERs), regardless of their ownership status, will be necessary for grid reliability. Grid operators lack direct control over end-users' grid interactions, such as energy usage, but incentives can influence behavior -- for example, an end-user that receives a grid-driven incentive may adjust their consumption or expose relevant control variables in response. A key challenge in studying such incentives is the lack of data about human behavior, which usually motivates strong assumptions, such as distributional assumptions on compliance or rational utility-maximization. In this paper, we propose a general incentive mechanism in the form of a constrained optimization problem -- our approach is distinguished from prior work by modeling human behavior (e.g., reactions to an incentive) as an arbitrary unknown function. We propose feedback-based optimization algorithms to solve this problem that each leverage different amounts of information and/or measurements. We show that each converges to an asymptotically stable incentive with (near)-optimality guarantees given mild assumptions on the problem. Finally, we evaluate our proposed techniques in voltage regulation simulations on standard test beds. We test a variety of settings, including those that break assumptions required for theoretical convergence (e.g., convexity, smoothness) to capture realistic settings. In this evaluation, our proposed algorithms are able to find near-optimal incentives even when the reaction to an incentive is modeled by a theoretically difficult (yet realistic) function.


[11] 2410.14965

Non-Invasive to Invasive: Enhancing FFA Synthesis from CFP with a Benchmark Dataset and a Novel Network

Fundus imaging is a pivotal tool in ophthalmology, and different imaging modalities are characterized by their specific advantages. For example, Fundus Fluorescein Angiography (FFA) uniquely provides detailed insights into retinal vascular dynamics and pathology, surpassing Color Fundus Photographs (CFP) in detecting microvascular abnormalities and perfusion status. However, the conventional invasive FFA involves discomfort and risks due to fluorescein dye injection, and it is meaningful but challenging to synthesize FFA images from non-invasive CFP. Previous studies primarily focused on FFA synthesis in a single disease category. In this work, we explore FFA synthesis in multiple diseases by devising a Diffusion-guided generative adversarial network, which introduces an adaptive and dynamic diffusion forward process into the discriminator and adds a category-aware representation enhancer. Moreover, to facilitate this research, we collect the first multi-disease CFP and FFA paired dataset, named the Multi-disease Paired Ocular Synthesis (MPOS) dataset, with four different fundus diseases. Experimental results show that our FFA synthesis network can generate better FFA images compared to state-of-the-art methods. Furthermore, we introduce a paired-modal diagnostic network to validate the effectiveness of synthetic FFA images in the diagnosis of multiple fundus diseases, and the results show that our synthesized FFA images with the real CFP images have higher diagnosis accuracy than that of the compared FFA synthesizing methods. Our research bridges the gap between non-invasive imaging and FFA, thereby offering promising prospects to enhance ophthalmic diagnosis and patient care, with a focus on reducing harm to patients through non-invasive procedures. Our dataset and code will be released to support further research in this field (https://github.com/whq-xxh/FFA-Synthesis).


[12] 2410.14994

Quanta Video Restoration

The proliferation of single-photon image sensors has opened the door to a plethora of high-speed and low-light imaging applications. However, data collected by these sensors are often 1-bit or few-bit, and corrupted by noise and strong motion. Conventional video restoration methods are not designed to handle this situation, while specialized quanta burst algorithms have limited performance when the number of input frames is low. In this paper, we introduce Quanta Video Restoration (QUIVER), an end-to-end trainable network built on the core ideas of classical quanta restoration methods, i.e., pre-filtering, flow estimation, fusion, and refinement. We also collect and publish I2-2000FPS, a high-speed video dataset with the highest temporal resolution of 2000 frames-per-second, for training and testing. On simulated and real data, QUIVER outperforms existing quanta restoration methods by a significant margin. Code and dataset available at https://github.com/chennuriprateek/Quanta_Video_Restoration-QUIVER-


[13] 2410.14996

EDRF: Enhanced Driving Risk Field Based on Multimodal Trajectory Prediction and Its Applications

Driving risk assessment is crucial for both autonomous vehicles and human-driven vehicles. The driving risk can be quantified as the product of the probability that an event (such as collision) will occur and the consequence of that event. However, the probability of events occurring is often difficult to predict due to the uncertainty of drivers' or vehicles' behavior. Traditional methods generally employ kinematic-based approaches to predict the future trajectories of entities, which often yield unrealistic prediction results. In this paper, the Enhanced Driving Risk Field (EDRF) model is proposed, integrating deep learning-based multimodal trajectory prediction results with Gaussian distribution models to quantitatively capture the uncertainty of traffic entities' behavior. The applications of the EDRF are also proposed. It is applied across various tasks (traffic risk monitoring, ego-vehicle risk analysis, and motion and trajectory planning) through the defined concept Interaction Risk (IR). Adequate example scenarios are provided for each application to illustrate the effectiveness of the model.


[14] 2410.15012

Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer

The aggressiveness of prostate cancer, the most common cancer in men worldwide, is primarily assessed based on histopathological data using the Gleason scoring system. While artificial intelligence (AI) has shown promise in accurately predicting Gleason scores, these predictions often lack inherent explainability, potentially leading to distrust in human-machine interactions. To address this issue, we introduce a novel dataset of 1,015 tissue microarray core images, annotated by an international group of 54 pathologists. The annotations provide detailed localized pattern descriptions for Gleason grading in line with international guidelines. Utilizing this dataset, we develop an inherently explainable AI system based on a U-Net architecture that provides predictions leveraging pathologists' terminology. This approach circumvents post-hoc explainability methods while maintaining or exceeding the performance of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 $\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason patterns). By employing soft labels during training, we capture the intrinsic uncertainty in the data, yielding strong results in Gleason pattern segmentation even in the context of high interobserver variability. With the release of this dataset, we aim to encourage further research into segmentation in medical tasks with high levels of subjectivity and to advance the understanding of pathologists' reasoning processes.


[15] 2410.15036

EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices

With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT-based U-networks in medical image segmentation. However, the high computational demands of ViT make it unsuitable for many medical devices and mobile platforms with limited resources, restricting its deployment on resource-constrained and edge devices. To address this, we propose EViT-UNet, an efficient ViT-based segmentation network that reduces computational complexity while maintaining accuracy, making it ideal for resource-constrained medical devices. EViT-UNet is built on a U-shaped architecture, comprising an encoder, decoder, bottleneck layer, and skip connections, combining convolutional operations with self-attention mechanisms to optimize efficiency. Experimental results demonstrate that EViT-UNet achieves high accuracy in medical image segmentation while significantly reducing computational complexity.


[16] 2410.15058

Design and Implementation of Hedge Algebra Controller using Recursive Semantic Values for Cart-pole System

This paper presents a novel approach to designing a Hedge Algebra Controller named Hedge Algebra Controller with Recursive Semantic Values (RS-HAC). This approach incorporates several newly introduced concepts, including Semantically Quantifying Simplified Mapping (SQSM) featuring a recursive algorithm, Infinite General Semantization (IGS), and Infinite General De-semantization (IGDS). These innovations aim to enhance the optimizability, scalability, and flexibility of hedge algebra theory, allowing the design of a hedge algebra-based controller to be carried out more efficiently and straightforward. An application of stabilizing an inverted pendulum on a cart is conducted to illustrate the superiority of the proposed approach. Comparisons are made between RS-HAC and a fuzzy controller of Takagi-Sugeno type (FC), as well as a linear quadratic regulator (LQR). The results indicate that the RS-HAC surpasses the FC by up to 400\% in control efficiency and is marginally better than the LQR regarding transient time in balancing an inverted pendulum on a cart.


[17] 2410.15078

Independent Feature Enhanced Crossmodal Fusion for Match-Mismatch Classification of Speech Stimulus and EEG Response

It is crucial for auditory attention decoding to classify matched and mismatched speech stimuli with corresponding EEG responses by exploring their relationship. However, existing methods often adopt two independent networks to encode speech stimulus and EEG response, which neglect the relationship between these signals from the two modalities. In this paper, we propose an independent feature enhanced crossmodal fusion model (IFE-CF) for match-mismatch classification, which leverages the fusion feature of the speech stimulus and the EEG response to achieve auditory EEG decoding. Specifically, our IFE-CF contains a crossmodal encoder to encode the speech stimulus and the EEG response with a two-branch structure connected via crossmodal attention mechanism in the encoding process, a multi-channel fusion module to fuse features of two modalities by aggregating the interaction feature obtained from the crossmodal encoder and the independent feature obtained from the speech stimulus and EEG response, and a predictor to give the matching result. In addition, the causal mask is introduced to consider the time delay of the speech-EEG pair in the crossmodal encoder, which further enhances the feature representation for match-mismatch classification. Experiments demonstrate our method's effectiveness with better classification accuracy, as compared with the baseline of the Auditory EEG Decoding Challenge 2023.


[18] 2410.15151

A Comparative Analysis of Nigeria's Power Sector with and without Grid-Scale Storage: Future Implications for Emission and Renewable Energy Integration

This research proposes a framework for modeling and comparing two electricity scenarios for Nigeria by 2050, focusing on the inclusion and exclusion of electricity storage technologies. A Central Composite Design (CCD) was used to generate a design matrix for data collection, with EnergyPLAN software used to create energy system simulations on the CCD data for four outputs: total annual cost, CO2 emissions, critical excess electricity production (CEEP), and electricity import. Three machine learning algorithms, support vector regression (SVR), extreme gradient boosting (XGBoost), and multi-layer perceptron (MLP), were tuned using Bayesian optimization to develop models mapping the inputs to outputs. A genetic algorithm was employed for multi-objective optimization to determine the optimal input capacities that minimize the outputs. Results indicated that incorporating electricity storage technologies (EST) leads to a 37% increase in renewable electricity sources (RES) share, resulting in a 19.14% reduction in CO2 emissions. EST such as battery energy storage systems (BESS), pumped hydro storage (PHS), and vehicle-to-grid (V2G) storage allow for the storage of the critical excess electricity that comes with increasing RES share. Integrating EST in Nigeria's 2050 energy landscape is crucial for incorporating more renewable electricity sources into the energy system, thereby reducing CO2 emissions and managing excess electricity production. This study outlines a plan for optimal electricity production to meet Nigeria's 2050 demand, highlighting the need for a balanced approach that combines fossil fuels, renewable energy, nuclear power, and advanced storage solutions to achieve a sustainable and efficient electricity system.


[19] 2410.15158

Automated Segmentation and Analysis of Cone Photoreceptors in Multimodal Adaptive Optics Imaging

Accurate detection and segmentation of cone cells in the retina are essential for diagnosing and managing retinal diseases. In this study, we used advanced imaging techniques, including confocal and non-confocal split detector images from adaptive optics scanning light ophthalmoscopy (AOSLO), to analyze photoreceptors for improved accuracy. Precise segmentation is crucial for understanding each cone cell's shape, area, and distribution. It helps to estimate the surrounding areas occupied by rods, which allows the calculation of the density of cone photoreceptors in the area of interest. In turn, density is critical for evaluating overall retinal health and functionality. We explored two U-Net-based segmentation models: StarDist for confocal and Cellpose for calculated modalities. Analyzing cone cells in images from two modalities and achieving consistent results demonstrates the study's reliability and potential for clinical application.


[20] 2410.15237

Fusion of Time and Angle Measurements for Digital-Twin-Aided Probabilistic 3D Positioning

Previous studies explained how the 2D positioning problem in indoor non line-of-sight environments can be addressed using ray tracing with noisy angle of arrival (AoA) measurements. In this work, we generalize these results on two aspects. First, we outline how to adapt the proposed methods to address the 3D positioning problem. Second, we introduce efficient algorithms for data fusion, where propagation-time or relative propagation-time measurements (obtained via e.g., the time difference of arrival) are used in addition to AoA measurements. Simulation results are provided to illustrate the advantages of the approach.


[21] 2410.15244

Extensions on low-complexity DCT approximations for larger blocklengths based on minimal angle similarity

The discrete cosine transform (DCT) is a central tool for image and video coding because it can be related to the Karhunen-Lo\`eve transform (KLT), which is the optimal transform in terms of retained transform coefficients and data decorrelation. In this paper, we introduce 16-, 32-, and 64-point low-complexity DCT approximations by minimizing individually the angle between the rows of the exact DCT matrix and the matrix induced by the approximate transforms. According to some classical figures of merit, the proposed transforms outperformed the approximations for the DCT already known in the literature. Fast algorithms were also developed for the low-complexity transforms, asserting a good balance between the performance and its computational cost. Practical applications in image encoding showed the relevance of the transforms in this context. In fact, the experiments showed that the proposed transforms had better results than the known approximations in the literature for the cases of 16, 32, and 64 blocklength.


[22] 2410.15260

Multi-class within-day dynamic traffic equilibrium with strategic travel time information

Most research on within-day dynamic traffic equilibrium with information provision implicitly considers travel time information, often assuming information to be perfect or imperfect based on travelers' perception error. However, lacking explicit formulation of information limits insightful analysis of information impact on dynamic traffic equilibrium and the potential benefits of leveraging information provision to improve system-level performance. To address this gap, this paper proposes a within-day dynamic traffic equilibrium model that explicitly formulates strategic information provision as an endogenous element. In the proposed framework, two classes of travelers receive different types of travel time information: one class receives instantaneous travel time reflecting the prevailing traffic conditions, while the other class receives strategic forecasts of travel times, generated by accounting for travelers' reactions to instantaneous information based on strategic thinking from behavioral game theory. The resulting multi-class within-day dynamic equilibrium differs from existing models by explicitly modeling information provision and consideration of information consistency. The inherent dynamics of real-time updated traffic information, traffic conditions, and travelers' choice behaviors are analytically modeled, with the resulting dynamic equilibrium formulated as a fixed-point problem. The theoretical propositions and numerical findings offer rich insights into the impact of information on the traffic network, strategic forecast information penetration, the relationship between the proposed equilibrium and traditional dynamic traffic equilibria, and information accuracy. This research contributes to a deeper understanding of the interplay between information and traffic dynamics, paving the way for more effective traffic management strategies.


[23] 2410.15335

A Distributed Primal-Dual Method for Constrained Multi-agent Reinforcement Learning with General Parameterization

This paper proposes a novel distributed approach for solving a cooperative Constrained Multi-agent Reinforcement Learning (CMARL) problem, where agents seek to minimize a global objective function subject to shared constraints. Unlike existing methods that rely on centralized training or coordination, our approach enables fully decentralized online learning, with each agent maintaining local estimates of both primal and dual variables. Specifically, we develop a distributed primal-dual algorithm based on actor-critic methods, leveraging local information to estimate Lagrangian multipliers. We establish consensus among the Lagrangian multipliers across agents and prove the convergence of our algorithm to an equilibrium point, analyzing the sub-optimality of this equilibrium compared to the exact solution of the unparameterized problem. Furthermore, we introduce a constrained cooperative Cournot game with stochastic dynamics as a test environment to evaluate the algorithm's performance in complex, real-world scenarios.


[24] 2410.15358

A New Adaptive Balanced Augmented Lagrangian Method with Application to ISAC Beamforming Design

In this paper, we consider a class of convex programming problems with linear equality constraints, which finds broad applications in machine learning and signal processing. We propose a new adaptive balanced augmented Lagrangian (ABAL) method for solving these problems. The proposed ABAL method adaptively selects the stepsize parameter and enjoys a low per-iteration complexity, involving only the computation of a proximal mapping of the objective function and the solution of a linear equation. These features make the proposed method well-suited to large-scale problems. We then custom-apply the ABAL method to solve the ISAC beamforming design problem, which is formulated as a nonlinear semidefinite program in a previous work. This customized application requires careful exploitation of the problem's special structure such as the property that all of its signal-to-interference-and-noise-ratio (SINR) constraints hold with equality at the solution and an efficient computation of the proximal mapping of the objective function. Simulation results demonstrate the efficiency of the proposed ABAL method.


[25] 2410.15360

Improving 3D Medical Image Segmentation at Boundary Regions using Local Self-attention and Global Volume Mixing

Volumetric medical image segmentation is a fundamental problem in medical image analysis where the objective is to accurately classify a given 3D volumetric medical image with voxel-level precision. In this work, we propose a novel hierarchical encoder-decoder-based framework that strives to explicitly capture the local and global dependencies for volumetric 3D medical image segmentation. The proposed framework exploits local volume-based self-attention to encode the local dependencies at high resolution and introduces a novel volumetric MLP-mixer to capture the global dependencies at low-resolution feature representations, respectively. The proposed volumetric MLP-mixer learns better associations among volumetric feature representations. These explicit local and global feature representations contribute to better learning of the shape-boundary characteristics of the organs. Extensive experiments on three different datasets reveal that the proposed method achieves favorable performance compared to state-of-the-art approaches. On the challenging Synapse Multi-organ dataset, the proposed method achieves an absolute 3.82\% gain over the state-of-the-art approaches in terms of HD95 evaluation metrics {while a similar improvement pattern is exhibited in MSD Liver and Pancreas tumor datasets}. We also provide a detailed comparison between recent architectural design choices in the 2D computer vision literature by adapting them for the problem of 3D medical image segmentation. Finally, our experiments on the ZebraFish 3D cell membrane dataset having limited training data demonstrate the superior transfer learning capabilities of the proposed vMixer model on the challenging 3D cell instance segmentation task, where accurate boundary prediction plays a vital role in distinguishing individual cell instances.


[26] 2410.15418

A Hybrid Noise Approach to Modelling of Free-Space Satellite Quantum Communication Channel for Continuous-Variable QKD

This paper significantly advances the application of Quantum Key Distribution (QKD) in Free- Space Optics (FSO) satellite-based quantum communication. We propose an innovative satellite quantum channel model and derive the secret quantum key distribution rate achievable through this channel. Unlike existing models that approximate the noise in quantum channels as merely Gaussian distributed, our model incorporates a hybrid noise analysis, accounting for both quantum Poissonian noise and classical Additive-White-Gaussian Noise (AWGN). This hybrid approach acknowledges the dual vulnerability of continuous variables (CV) Gaussian quantum channels to both quantum and classical noise, thereby offering a more realistic assessment of the quantum Secret Key Rate (SKR). This paper delves into the variation of SKR with the Signal-to-Noise Ratio (SNR) under various influencing parameters. We identify and analyze critical factors such as reconciliation efficiency, transmission coefficient, transmission efficiency, the quantum Poissonian noise parameter, and the satellite altitude. These parameters are pivotal in determining the SKR in FSO satellite quantum channels, highlighting the challenges of satellitebased quantum communication. Our work provides a comprehensive framework for understanding and optimizing SKR in satellite-based QKD systems, paving the way for more efficient and secure quantum communication networks.


[27] 2410.15437

AttCDCNet: Attention-enhanced Chest Disease Classification using X-Ray Images

Chest X-rays (X-ray images) have been proven to be effective for the diagnosis of chest diseases, including Pneumonia, Lung Opacity, and COVID-19. However, relying on traditional medical methods for diagnosis from X-ray images is prone to delays and inaccuracies because the medical personnel who evaluate the X-ray images may have preconceived biases. For this reason, researchers have proposed the use of deep learning-based techniques to facilitate the diagnosis process. The preeminent method is the use of sophisticated Convolutional Neural Networks (CNNs). In this paper, we propose a novel detection model named \textbf{AttCDCNet} for the task of X-ray image diagnosis, enhancing the popular DenseNet121 model by adding an attention block to help the model focus on the most relevant regions, using focal loss as a loss function to overcome the imbalance of the dataset problem, and utilizing depth-wise convolution to reduce the parameters to make the model lighter. Through extensive experimental evaluations, the proposed model demonstrates exceptional performance, showing better results than the original DenseNet121. The proposed model achieved an accuracy, precision and recall of 94.94%, 95.14% and 94.53%, respectively, on the COVID-19 Radiography Dataset.


[28] 2410.15441

A Global Coordinate-Free Approach to Invariant Contraction on Homogeneous Manifolds

In this work, we provide a global condition for contraction with respect to an invariant Riemannian metric on reductive homogeneous spaces. Using left-invariant frames, vector fields on the manifold are horizontally lifted to the ambient Lie group, where the Levi-Civita connection is globally characterized as a real matrix multiplication. By linearizing in these left-invariant frames, we characterize contraction using matrix measures on real square matrices, avoiding the use of local charts. Applying this global condition, we provide a necessary condition for a prescribed subset of the manifold to possibly admit a contracting system with respect to an invariant metric. Applied to the sphere, this condition implies that no closed hemisphere can be contained in a contraction region. Finally, we apply our results to compute reachable sets for an attitude control problem.


[29] 2410.15614

Topology-Aware Exploration of Circle of Willis for CTA and MRA: Segmentation, Detection, and Classification

The Circle of Willis (CoW) vessels is critical to connecting major circulations of the brain. The topology of the vascular structure is clinical significance to evaluate the risk, severity of the neuro-vascular diseases. The CoW has two representative angiographic imaging modalities, computed tomography angiography (CTA) and magnetic resonance angiography (MRA). TopCow24 provided 125 paired CTA-MRA dataset for the analysis of CoW. To explore both CTA and MRA images in a unified framework to learn the inherent topology of Cow, we construct the universal dataset via independent intensity preprocess, followed by joint resampling and normarlization. Then, we utilize the topology-aware loss to enhance the topology completeness of the CoW and the discrimination between different classes. A complementary topology-aware refinement is further conducted to enhance the connectivity within the same class. Our method was evaluated on all the three tasks and two modalities, achieving competitive results. In the final test phase of TopCow24 Challenge, we achieved the second place in the CTA-Seg-Task, the third palce in the CTA-Box-Task, the first place in the CTA-Edg-Task, the second place in the MRA-Seg-Task, the third palce in the MRA-Box-Task, the second place in the MRA-Edg-Task.


[30] 2410.15628

Towards Kriging-informed Conditional Diffusion for Regional Sea-Level Data Downscaling

Given coarser-resolution projections from global climate models or satellite data, the downscaling problem aims to estimate finer-resolution regional climate data, capturing fine-scale spatial patterns and variability. Downscaling is any method to derive high-resolution data from low-resolution variables, often to provide more detailed and local predictions and analyses. This problem is societally crucial for effective adaptation, mitigation, and resilience against significant risks from climate change. The challenge arises from spatial heterogeneity and the need to recover finer-scale features while ensuring model generalization. Most downscaling methods \cite{Li2020} fail to capture the spatial dependencies at finer scales and underperform on real-world climate datasets, such as sea-level rise. We propose a novel Kriging-informed Conditional Diffusion Probabilistic Model (Ki-CDPM) to capture spatial variability while preserving fine-scale features. Experimental results on climate data show that our proposed method is more accurate than state-of-the-art downscaling techniques.


[31] 2410.15646

Low-Complexity Minimum BER Precoder Design for ISAC Systems: A Delay-Doppler Perspective

Orthogonal time frequency space (OTFS) modulation is anticipated to be a promising candidate for supporting integrated sensing and communications (ISAC) systems, which is considered as a pivotal technique for realizing next generation wireless networks. In this paper, we develop a minimum bit error rate (BER) precoder design for an OTFS-based ISAC system. In particular, the BER minimization problem takes into account the maximum available transmission power budget and the required sensing performance. Different from prior studies that considered ISAC in the time-frequency (TF) domain, we devise the precoder from the perspective of the delay-Doppler (DD) domain by exploiting the equivalent DD domain channel due to the fact that the DD domain channel generally tends to be sparse and quasi-static, which can facilitate a low-overhead ISAC system design. To address the non-convex optimization design problem, we resort to optimizing the lower bound of the derived average BER by adopting Jensen's inequality. Subsequently, the formulated problem is decoupled into two independent sub-problems via singular value decomposition (SVD) methodology. We then theoretically analyze the feasibility conditions of the proposed problem and present a low-complexity iterative solution via leveraging the Lagrangian duality approach. Simulation results verify the effectiveness of our proposed precoder compared to the benchmark schemes and reveal the interplay between sensing and communication for dual-functional precoder design, indicating a trade-off where transmission efficiency is sacrificed for increasing transmission reliability and sensing accuracy.


[32] 2410.15660

SPARC: Prediction-Based Safe Control for Coupled Controllable and Uncontrollable Agents with Conformal Predictions

We investigate the problem of safe control synthesis for systems operating in environments with uncontrollable agents whose dynamics are unknown but coupled with those of the controlled system. This scenario naturally arises in various applications, such as autonomous driving and human-robot collaboration, where the behavior of uncontrollable agents, like pedestrians, cannot be directly controlled but is influenced by the actions of the autonomous vehicle or robot. In this paper, we present SPARC (Safe Prediction-Based Robust Controller for Coupled Agents), a novel framework designed to ensure safe control in the presence of coupled uncontrollable agents. SPARC leverages conformal prediction to quantify uncertainty in data-driven prediction of agent behavior. Particularly, we introduce a joint distribution-based approach to account for the coupled dynamics of the controlled system and uncontrollable agents. By integrating the control barrier function (CBF) technique, SPARC provides provable safety guarantees at a high confidence level. We illustrate our framework with a case study involving an autonomous driving scenario with walking pedestrians.


[33] 2410.15670

Transforming Blood Cell Detection and Classification with Advanced Deep Learning Models: A Comparative Study

Efficient detection and classification of blood cells are vital for accurate diagnosis and effective treatment of blood disorders. This study utilizes a YOLOv10 model trained on Roboflow data with images resized to 640x640 pixels across varying epochs. The results show that increased training epochs significantly enhance accuracy, precision, and recall, particularly in real-time blood cell detection & classification. The YOLOv10 model outperforms MobileNetV2, ShuffleNetV2, and DarkNet in real-time performance, though MobileNetV2 and ShuffleNetV2 are more computationally efficient, and DarkNet excels in feature extraction for blood cell classification. This research highlights the potential of integrating deep learning models like YOLOv10, MobileNetV2, ShuffleNetV2, and DarkNet into clinical workflows, promising improvements in diagnostic accuracy and efficiency. Additionally, a new, well-annotated blood cell dataset was created and will be open-sourced to support further advancements in automatic blood cell detection and classification. The findings demonstrate the transformative impact of these models in revolutionizing medical diagnostics and enhancing blood disorder management


[34] 2410.15764

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a three-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. The 25Hz version of LSCodec also achieves the lowest bitrate (0.25kbps) of codecs so far with decent quality. Voice conversion evaluations prove the satisfactory speaker disentanglement of LSCodec, and ablation study further verifies the effectiveness of the proposed training framework.


[35] 2410.15803

A Block Quantum Genetic Interference Mitigation Algorithm for Dynamic Metasurface Antennas and Field Trials

This paper proposes a quantum algorithm for Dynamic Metasurface Antennas (DMA) beamforming to suppress interference for an amplify-and-forward relay system in multi-base station environments. This algorithm introduces an efficient dynamic block initialization and overarching block update strategy, which can enhance the Signal-to-Interference-plus-Noise Ratio (SINR) of the target base station (BS) signal without any channel information. Furthermore, we built a relay system with DMA as the receiving antenna and conducted outdoor 5G BS interference suppression tests. To the best of our knowledge, this is the first paper to experiment DMA in commercial 5G networks. The field trial results indicate an SINR improvement of over 10 dB for the signal of the desired BS.


[36] 2410.15812

FusionLungNet: Multi-scale Fusion Convolution with Refinement Network for Lung CT Image Segmentation

Early detection of lung cancer is crucial as it increases the chances of successful treatment. Automatic lung image segmentation assists doctors in identifying diseases such as lung cancer, COVID-19, and respiratory disorders. However, lung segmentation is challenging due to overlapping features like vascular and bronchial structures, along with pixel-level fusion of brightness, color, and texture. New lung segmentation methods face difficulties in identifying long-range relationships between image components, reliance on convolution operations that may not capture all critical features, and the complex structures of the lungs. Furthermore, semantic gaps between feature maps can hinder the integration of relevant information, reducing model accuracy. Skip connections can also limit the decoder's access to complete information, resulting in partial information loss during encoding. To overcome these challenges, we propose a hybrid approach using the FusionLungNet network, which has a multi-level structure with key components, including the ResNet-50 encoder, Channel-wise Aggregation Attention (CAA) module, Multi-scale Feature Fusion (MFF) block, self refinement (SR) module, and multiple decoders. The refinement sub-network uses convolutional neural networks for image post-processing to improve quality. Our method employs a combination of loss functions, including SSIM, IOU, and focal loss, to optimize image reconstruction quality. We created and publicly released a new dataset for lung segmentation called LungSegDB, including 1800 CT images from the LIDC-IDRI dataset (dataset version 1) and 700 images from the Chest CT Cancer Images from Kaggle dataset (dataset version 2). Our method achieved an IOU score of 98.04, outperforming existing methods and demonstrating significant improvements in segmentation accuracy. https://github.com/sadjadrz/FusionLungNet


[37] 2410.15832

Nonlinear Bayesian Filtering with Natural Gradient Gaussian Approximation

Practical Bayes filters often assume the state distribution of each time step to be Gaussian for computational tractability, resulting in the so-called Gaussian filters. When facing nonlinear systems, Gaussian filters such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) typically rely on certain linearization techniques, which can introduce large estimation errors. To address this issue, this paper reconstructs the prediction and update steps of Gaussian filtering as solutions to two distinct optimization problems, whose optimal conditions are found to have analytical forms from Stein's lemma. It is observed that the stationary point for the prediction step requires calculating the first two moments of the prior distribution, which is equivalent to that step in existing moment-matching filters. In the update step, instead of linearizing the model to approximate the stationary points, we propose an iterative approach to directly minimize the update step's objective to avoid linearization errors. For the purpose of performing the steepest descent on the Gaussian manifold, we derive its natural gradient that leverages Fisher information matrix to adjust the gradient direction, accounting for the curvature of the parameter space. Combining this update step with moment matching in the prediction step, we introduce a new iterative filter for nonlinear systems called Natural Gradient Gaussian Approximation filter, or NANO filter for short. We prove that NANO filter locally converges to the optimal Gaussian approximation at each time step. The estimation error is proven exponentially bounded for nearly linear measurement equation and low noise levels through constructing a supermartingale-like inequality across consecutive time steps.


[38] 2410.15836

Simultaneous Communications and Sensing with Hybrid Reconfigurable Intelligent Surfaces

Hybrid Reconfigurable Intelligent Surfaces (HRISs) constitute a new paradigm of truly smart metasurfaces with the additional features of signal reception and processing, which have been primarily considered for channel estimation and self-reconfiguration. In this paper, leveraging the simultaneous tunable reflection and signal absorption functionality of HRIS elements, we present a novel framework for the joint design of transmit beamforming and the HRIS parameters with the goal to maximize downlink communications, while simultaneously illuminating an area of interest for guaranteed localization coverage performance. Our simulation results verify the effectiveness of the proposed scheme and showcase the interplay of the various system parameters on the achievable Integrated Sensing and Communications (ISAC) performance.


[39] 2410.15843

A New Method For Flushing of Subsea Production Systems Prior to Decommissioning or Component Disconnection

This paper outlines a novel subsea flushing system which uses a subsea tool to improve the performance of the flushing operation. The new method outlined in this paper uses a small-diameter, high-pressure supply line and a subsea deployed tool containing a pump which recirculates the cleaning fluid through the component or system to be retrieved. The main benefit of this method when compared against conventional practices is that it allows achieving higher fluid speeds inside the subsea equipment being flushed, while injecting smaller flow rates from the surface vessel. The high fluid speeds are achieved with the recirculation pump. The higher fluid speeds ensure efficient sweeping of hydrocarbons from complex paths. A reduced flow rate from the surface vessel also allows a small diameter high pressure supply line to be used, which allows for reduced weight and storage. The study is a numerical simulation of the method applied to a subsea jumper geometry. The injection flow rates required to achieve an efficient flushing were determined from previous experimental work. Calculations were made to estimate the pressure and power requirements for performing the flushing operation as well as the design requirements for the supply line concerning dimensions, material properties and the storage space needed on the support vessel. The performance of the proposed novel system was compared to that of conventional flushing systems. As environmental concerns increase, the presented method has the potential to make the flushing process more efficient while reducing costs associated with support vessels and the materials needed. The novel system may also be deployed using a low-cost Inspection Maintenance and Repair (IMR) vessel. The subsea tool is connected to the subsea production system, either through dedicated connection ports or using pipe clamp connectors with pipe wall penetrators.


[40] 2410.15851

R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate

The COVID-19 pandemic has underscored the need for low-cost, scalable approaches to measuring contactless vital signs, either during initial triage at a healthcare facility or virtual telemedicine visits. Remote photoplethysmography (rPPG) can accurately estimate heart rate (HR) when applied to close-up videos of healthy volunteers in well-lit laboratory settings. However, results from such highly optimized laboratory studies may not be readily translated to healthcare settings. One significant barrier to the practical application of rPPG in health care is the accurate localization of the region of interest (ROI). Clinical or telemedicine visits may involve sub-optimal lighting, movement artifacts, variable camera angle, and subject distance. This paper presents an rPPG ROI selection method based on 3D facial landmarks and patient head yaw angle. We then demonstrate the robustness of this ROI selection method when coupled to the Plane-Orthogonal-to-Skin (POS) rPPG method when applied to videos of patients presenting to an Emergency Department for respiratory complaints. Our results demonstrate the effectiveness of our proposed approach in improving the accuracy and robustness of rPPG in a challenging clinical environment.


[41] 2410.15873

Variable Rate Learned Wavelet Video Coding with Temporal Layer Adaptivity

Learned wavelet video coders provide an explainable framework by performing discrete wavelet transforms in temporal, horizontal, and vertical dimensions. With a temporal transform based on motion-compensated temporal filtering (MCTF), spatial and temporal scalability is obtained. In this paper, we introduce variable rate support and a mechanism for quality adaption to different temporal layers for a higher coding efficiency. Moreover, we propose a multi-stage training strategy that allows training with multiple temporal layers. Our experiments demonstrate Bj{\o}ntegaard Delta bitrate savings of at least -17% compared to a learned MCTF model without these extensions. Our method also outperforms other learned video coders like DCVC-DC. Training and inference code is available at: https://github.com/FAU-LMS/Learned-pMCTF.


[42] 2410.15899

On the Design and Performance of Machine Learning Based Error Correcting Decoders

This paper analyzes the design and competitiveness of four neural network (NN) architectures recently proposed as decoders for forward error correction (FEC) codes. We first consider the so-called single-label neural network (SLNN) and the multi-label neural network (MLNN) decoders which have been reported to achieve near maximum likelihood (ML) performance. Here, we show analytically that SLNN and MLNN decoders can always achieve ML performance, regardless of the code dimensions -- although at the cost of computational complexity -- and no training is in fact required. We then turn our attention to two transformer-based decoders: the error correction code transformer (ECCT) and the cross-attention message passing transformer (CrossMPT). We compare their performance against traditional decoders, and show that ordered statistics decoding outperforms these transformer-based decoders. The results in this paper cast serious doubts on the application of NN-based FEC decoders in the short and medium block length regime.


[43] 2410.15901

Harnessing single polarization doppler weather radars for tracking Desert Locust Swarms

Desert locusts are notorious agriculture pests prompting billions in losses and global food scarcity concerns. With billions of these locusts invading agrarian lands, this is no longer a thing of the past. This study taps into the existing doppler weather radar (DWR) infrastructure which was originally deployed for meteorological applications. This study demonstrates a systematic approach to distinctly identify and track concentrations of desert locust swarms in near real time using single polarization radars. Findings reveal the potential to establish early warning systems with lead times of around 7 hours and spatial coverage of approximately 100 kilometers. Embracing these technological advancements are crucial to safeguard agricultural landscapes and upload global food security.


[44] 2410.15947

AI-Driven Approaches for Glaucoma Detection -- A Comprehensive Review

The diagnosis of glaucoma plays a critical role in the management and treatment of this vision-threatening disease. Glaucoma is a group of eye diseases that cause blindness by damaging the optic nerve at the back of the eye. Often called "silent thief of sight", it exhibits no symptoms during the early stages. Therefore, early detection is crucial to prevent vision loss. With the rise of Artificial Intelligence (AI), particularly Deep Learning (DL) techniques, Computer-Aided Diagnosis (CADx) systems have emerged as promising tools to assist clinicians in accurately diagnosing glaucoma early. This paper aims to provide a comprehensive overview of AI techniques utilized in CADx systems for glaucoma diagnosis. Through a detailed analysis of current literature, we identify key gaps and challenges in these systems, emphasizing the need for improved safety, reliability, interpretability, and explainability. By identifying research gaps, we aim to advance the field of CADx systems especially for the early diagnosis of glaucoma, in order to prevent any potential loss of vision.


[45] 2410.15984

Lossless optimal transient control for rigid bodies in 3D space

In this letter, we propose a control scheme for rigid bodies designed to optimise transient behaviors. The search space for the optimal control input is parameterized to yield a passive, specifically lossless, nonlinear feedback controller. As a result, it can be combined with other stabilizing controllers without compromising the stability of the closed-loop system. The controller commands torques generating fictitious gyroscopic effects characteristics of 3D rotational rigid body motions, and as such does not inject nor extract kinetic energy from the system. We validate the controller in simulation using a model predictive control (MPC) scheme, successfully combining stability and performance in a stabilization task with obstacle avoidance constraints.


[46] 2410.15995

Sum-Rate Maximization of RIS-Aided Digital and Holographic Beamformers in MU-MISO Systems

Reconfigurable holographic surfaces (RHS) are intrinsically amalgamated with reconfigurable intelligent surfaces (RIS), for beneficially ameliorating the signal propagation environment. This potent architecture significantly improves the system performance in non-line-of-sight scenarios at a low power consumption. Briefly, the RHS technology integrates ultra-thin, lightweight antennas onto the transceiver, for creating sharp, high-gain directional beams. We formulate a user sum-rate maximization problem for our RHS-RIS-based hybrid beamformer. Explicitly, we jointly design the digital, holographic, and passive beamformers for maximizing the sum-rate of all user equipment (UE). To tackle the resultant nonconvex optimization problem, we propose an alternating maximization (AM) framework for decoupling and iteratively solving the subproblems involved. Specifically, we employ the zero-forcing criterion for the digital beamformer, leverage fractional programming to determine the radiation amplitudes of the RHS and utilize the Riemannian conjugate gradient algorithm for optimizing the RIS phase shift matrix of the passive beamformer. Our simulation results demonstrate that the proposed RHS-RIS-based hybrid beamformer outperforms its conventional counterpart operating without an RIS in multi-UE scenarios. The sum-rate improvement attained ranges from 8 bps/Hz to 13 bps/Hz for various transmit powers at the base station (BS) and at the UEs, which is significant.


[47] 2410.16008

Resilient Temporal GCN for Smart Grid State Estimation Under Topology Inaccuracies

State Estimation is a crucial task in power systems. Graph Neural Networks have demonstrated significant potential in state estimation for power systems by effectively analyzing measurement data and capturing the complex interactions and interrelations among the measurements through the system's graph structure. However, the information about the system's graph structure may be inaccurate due to noise, attack or lack of accurate information about the topology of the system. This paper studies these scenarios under topology uncertainties and evaluates the impact of the topology uncertainties on the performance of a Temporal Graph Convolutional Network (TGCN) for state estimation in power systems. In order to make the model resilient to topology uncertainties, modifications in the TGCN model are proposed to incorporate a knowledge graph, generated based on the measurement data. This knowledge graph supports the assumed uncertain system graph. Two variations of the TGCN architecture are introduced to integrate the knowledge graph, and their performances are evaluated and compared to demonstrate improved resilience against topology uncertainties. The evaluation results indicate that while the two proposed architecture show different performance, they both improve the performance of the TGCN state estimation under topology uncertainties.


[48] 2410.16014

Differential Evolution-Based End-Fire Realized Gain Optimization of Active and Parasitic Arrays

We propose a novel approach for boosting the realized gain in enhanced directivity arrays with both active and parasitic dipoles as radiating elements. The optimization process involves two main objectives: maximizing the end-fire gain and minimizing the reflection coefficient to ensure high realized gain. In the first step, the current excitation vector of the fully driven array is selected to maximize the end-fire gain. Then, all but one of the dipoles are reactively loaded according to their input impedance. Following that, the optimization focuses on the inter-element distance, computing the one that offers a favorable balance between the gain and the total efficiency. This multi-objective optimization leverages the differential evolution (DE) algorithm and utilizes a simple wire dipole as the unit element. Full-wave simulations further confirm the accuracy of our theoretical results. Our two- and three-element parasitic arrays achieve realized gain comparable to state-of-the-art designs, without relying on intricate unit elements or resource-intensive simulations. Moreover, our four- and five-element parasitic arrays deliver the highest realized gain values reported in the literature. The simplicity of our approach is validated by significant time savings, with theoretical models completing optimizations much faster than full-wave simulations. Additionally, a sensitivity analysis confirms the robustness of the proposed optimization algorithm, demonstrating that the optimized design parameters remain effective even under small deviations in loads and element positions. Finally, the proposed parasitic arrays are well-suited for base station antennas due to their compact design, reduced power consumption, and simplified hardware requirements, making them ideal for modern communication systems.


[49] 2410.16048

Continuous Speech Synthesis using per-token Latent Diffusion

The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.


[50] 2410.16059

Multi-Level Speaker Representation for Target Speaker Extraction

Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline.


[51] 2410.16125

Blind Equalization using a Variational Autoencoder with Second Order Volterra Channel Model

Existing communication hardware is being exerted to its limits to accommodate for the ever increasing internet usage globally. This leads to non-linear distortion in the communication link that requires non-linear equalization techniques to operate the link at a reasonable bit error rate. This paper addresses the challenge of blind non-linear equalization using a variational autoencoder (VAE) with a second-order Volterra channel model. The VAE framework's costfunction, the evidence lower bound (ELBO), is derived for real-valued constellations and can be evaluated analytically without resorting to sampling techniques. We demonstrate the effectiveness of our approach through simulations on a synthetic Wiener-Hammerstein channel and a simulated intensity modulated direct detection (IM/DD) optical link. The results show significant improvements in equalization performance, compared to a VAE with linear channel assumptions, highlighting the importance of appropriate channel modeling in unsupervised VAE equalizer frameworks.


[52] 2410.16130

Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world application. To systematically evaluate these issues, we propose three distinct tasks: object existence, temporal order, and object attribute within audio. These tasks assess the models' comprehension of critical audio information aspects. Our experimental results reveal limitations in these fundamental tasks, underscoring the need for better models in recognizing specific sound events, determining event sequences, and identifying sound sources. To improve performance in these areas, we introduce a multi-turn chain-of-thought approach, which demonstrates significantly improved model performance across the proposed tasks.


[53] 2410.16143

An Explainable Contrastive-based Dilated Convolutional Network with Transformer for Pediatric Pneumonia Detection

Pediatric pneumonia remains a significant global threat, posing a larger mortality risk than any other communicable disease. According to UNICEF, it is a leading cause of mortality in children under five and requires prompt diagnosis. Early diagnosis using chest radiographs is the prevalent standard, but limitations include low radiation levels in unprocessed images and data imbalance issues. This necessitates the development of efficient, computer-aided diagnosis techniques. To this end, we propose a novel EXplainable Contrastive-based Dilated Convolutional Network with Transformer (XCCNet) for pediatric pneumonia detection. XCCNet harnesses the spatial power of dilated convolutions and the global insights from contrastive-based transformers for effective feature refinement. A robust chest X-ray processing module tackles low-intensity radiographs, while adversarial-based data augmentation mitigates the skewed distribution of chest X-rays in the dataset. Furthermore, we actively integrate an explainability approach through feature visualization, directly aligning it with the attention region that pinpoints the presence of pneumonia or normality in radiographs. The efficacy of XCCNet is comprehensively assessed on four publicly available datasets. Extensive performance evaluation demonstrates the superiority of XCCNet compared to state-of-the-art methods.


[54] 2410.16173

Fast Physics-Informed Model Predictive Control Approximation for Lyapunov Stability

At the forefront of control techniques is Model Predictive Control (MPC). While MPCs are effective, their requisite to recompute an optimal control given a new state leads to sparse response to the system and may make their implementation infeasible in small systems with low computational resources. To address these limitations in stability control, this research presents a small deterministic Physics-Informed MPC Surrogate model (PI-MPCS). PI-MPCS was developed to approximate the control by an MPC while encouraging stability and robustness through the integration of the system dynamics and the formation of a Lyapunov stability profile. Empirical results are presented on the task of 2D quadcopter landing. They demonstrate a rapid and precise MPC approximation on a non-linear system along with an estimated two times speed up on the computational requirements when compared against an MPC. PI-MPCS, in addition, displays a level of stable control for in- and out-of-distribution states as encouraged by the discrete dynamics residual and Lyapunov stability loss functions. PI-MPCS is meant to serve as a surrogate to MPC on situations in which the computational resources are limited.


[55] 2410.16219

PuLsE: Accurate and Robust Ultrasound-based Continuous Heart-Rate Monitoring on a Wrist-Worn IoT Device

This work explores the feasibility of employing ultrasound (US) US technology in a wrist-worn IoT device for low-power, high-fidelity heart-rate (HR) extraction. US offers deep tissue penetration and can monitor pulsatile arterial blood flow in large vessels and the surrounding tissue, potentially improving robustness and accuracy compared to PPG. We present an IoT wearable system prototype utilizing a commercial microcontroller MCU employing the onboard ADC to capture high frequency US signals and an innovative low-power US pulser. An envelope filter lowers the bandwidth of the US signal by a factor of >5x, reducing the system's acquisition requirements without compromising accuracy (correlation coefficient between HR extracted from enveloped and raw signals, r(92)=0.99, p<0.001). The full signal processing pipeline is ported to fixed point arithmetic for increased energy efficiency and runs entirely onboard. The system has an average power consumption of 5.8mW, competitive with PPG based systems, and the HR extraction algorithm requires only 68kB of RAM and 71ms of processing time on an ARM Cortex-M4 MCU. The system is estimated to run continuously for more than 7 days on a smartwatch battery. To accurately evaluate the proposed circuit and algorithm and identify the anatomical location on the wrist with the highest accuracy for HR extraction, we collected a dataset from 10 healthy adults at three different wrist positions. The dataset comprises roughly 5 hours of HR data with an average of 80.6+-16.3 bpm. During recording, we synchronized the established ECG gold standard with our US-based method. The comparisons yields a Pearson correlation coefficient of r(92)=0.99, p<0.001 and a mean error of 0.69+-1.99 bpm in the lateral wrist position near the radial artery. The dataset and code have been open-sourced at https://github.com/mgiordy/Ultrasound-Heart-Rate


[56] 2410.16238

Deep Radiomics Detection of Clinically Significant Prostate Cancer on Multicenter MRI: Initial Comparison to PI-RADS Assessment

Objective: To develop and evaluate a deep radiomics model for clinically significant prostate cancer (csPCa, grade group >= 2) detection and compare its performance to Prostate Imaging Reporting and Data System (PI-RADS) assessment in a multicenter cohort. Materials and Methods: This retrospective study analyzed biparametric (T2W and DW) prostate MRI sequences of 615 patients (mean age, 63.1 +/- 7 years) from four datasets acquired between 2010 and 2020: PROSTATEx challenge, Prostate158 challenge, PCaMAP trial, and an in-house (NTNU/St. Olavs Hospital) dataset. With expert annotations as ground truth, a deep radiomics model was trained, including nnU-Net segmentation of the prostate gland, voxel-wise radiomic feature extraction, extreme gradient boost classification, and post-processing of tumor probability maps into csPCa detection maps. Training involved 5-fold cross-validation using the PROSTATEx (n=199), Prostate158 (n=138), and PCaMAP (n=78) datasets, and testing on the in-house (n=200) dataset. Patient- and lesion-level performance were compared to PI-RADS using area under ROC curve (AUROC [95% CI]), sensitivity, and specificity analysis. Results: On the test data, the radiologist achieved a patient-level AUROC of 0.94 [0.91-0.98] with 94% (75/80) sensitivity and 77% (92/120) specificity at PI-RADS >= 3. The deep radiomics model at a tumor probability cut-off >= 0.76 achieved 0.91 [0.86-0.95] AUROC with 90% (72/80) sensitivity and 73% (87/120) specificity, not significantly different (p = 0.068) from PI-RADS. On the lesion level, PI-RADS cut-off >= 3 had 84% (91/108) sensitivity at 0.2 (40/200) false positives per patient, while deep radiomics attained 68% (73/108) sensitivity at the same false positive rate. Conclusion: Deep radiomics machine learning model achieved comparable performance to PI-RADS assessment in csPCa detection at the patient-level but not at the lesion-level.


[57] 2410.16240

Nonlinear Magnetics Model for Permanent Magnet Synchronous Machines Capturing Saturation and Temperature Effects

This paper proposes a nonlinear magnetics model for Permanent Magnet Synchronous Machines (PMSMs) that accurately captures the effects of magnetic saturation in the machine iron and variations in rotor temperature on the permanent magnet excitation. The proposed model considers the permanent magnet as a current source rather than the more commonly used flux-linkage source. A comparison of the two modelling approaches is conducted using Finite Element Analysis (FEA) for different machine designs as well as experimental validation, where it is shown that the proposed model has substantially better accuracy. The proposed model decouples magnetic saturation and rotor temperature effects in the current/flux-linkage relationship, allowing for adaptive estimation of the PM excitation.


[58] 2410.16262

Characterizing the Effect of Electrode Shift & Sensor Reapplication on Common sEMG Features in Lower Limb Muscles

This study investigates the impact of electrode shift and sensor reapplication on common surface electromyography (sEMG) features in lower limb muscles, factors which have, thus far, precluded clinicians from being able to attribute inter-session changes in sEMG signal properties to physiological changes in patients under the context of stroke recovery monitoring. To explore these inter-session errors, we recruited 12 healthy participants to perform a selection of isometric and dynamic exercises seen within stroke assessment sessions while instrumented with high-density sEMG (HDsEMG) arrays on the gastrocnemius medialis, tibialis anterior, semitendinosus, and tensor fascia latae. Between exercise sets, the electrode arrays were intentionally shifted and reapplied to quantify errors in signal features, using 3D scanning equipment to extract the ground truth shift performed. Results revealed that while frequency-domain features (mean, median, and peak frequency) demonstrated high resilience to the inter-session changes, the time-domain features (integrated EMG and max envelope amplitude) showed a greater, yet predictable, variability. In all, these findings suggest that should we be able to quantify placement shift, this can support direct inter-session feature comparisons, improving the reliability of sEMG-based stroke recovery assessments and offering insights for improving remote stroke rehabilitation technologies.


[59] 2410.14683

Brain-Aware Readout Layers in GNNs: Advancing Alzheimer's early Detection and Neuroimaging

Alzheimer's disease (AD) is a neurodegenerative disorder characterized by progressive memory and cognitive decline, affecting millions worldwide. Diagnosing AD is challenging due to its heterogeneous nature and variable progression. This study introduces a novel brain-aware readout layer (BA readout layer) for Graph Neural Networks (GNNs), designed to improve interpretability and predictive accuracy in neuroimaging for early AD diagnosis. By clustering brain regions based on functional connectivity and node embedding, this layer improves the GNN's capability to capture complex brain network characteristics. We analyzed neuroimaging data from 383 participants, including both cognitively normal and preclinical AD individuals, using T1-weighted MRI, resting-state fMRI, and FBB-PET to construct brain graphs. Our results show that GNNs with the BA readout layer significantly outperform traditional models in predicting the Preclinical Alzheimer's Cognitive Composite (PACC) score, demonstrating higher robustness and stability. The adaptive BA readout layer also offers enhanced interpretability by highlighting task-specific brain regions critical to cognitive functions impacted by AD. These findings suggest that our approach provides a valuable tool for the early diagnosis and analysis of Alzheimer's disease.


[60] 2410.14693

Deep Domain Isolation and Sample Clustered Federated Learning for Semantic Segmentation

Empirical studies show that federated learning exhibits convergence issues in Non Independent and Identically Distributed (IID) setups. However, these studies only focus on label distribution shifts, or concept shifts (e.g. ambiguous tasks). In this paper, we explore for the first time the effect of covariate shifts between participants' data in 2D segmentation tasks, showing an impact way less serious than label shifts but still present on convergence. Moreover, current Personalized (PFL) and Clustered (CFL) Federated Learning methods intrinsically assume the homogeneity of the dataset of each participant and its consistency with future test samples by operating at the client level. We introduce a more general and realistic framework where each participant owns a mixture of multiple underlying feature domain distributions. To diagnose such pathological feature distributions affecting a model being trained in a federated fashion, we develop Deep Domain Isolation (DDI) to isolate image domains directly in the gradient space of the model. A federated Gaussian Mixture Model is fit to the sample gradients of each class, while the results are combined with spectral clustering on the server side to isolate decentralized sample-level domains. We leverage this clustering algorithm through a Sample Clustered Federated Learning (SCFL) framework, performing standard federated learning of several independent models, one for each decentralized image domain. Finally, we train a classifier enabling to associate a test sample to its corresponding domain cluster at inference time, offering a final set of models that are agnostic to any assumptions on the test distribution of each participant. We validate our approach on a toy segmentation dataset as well as different partitionings of a combination of Cityscapes and GTA5 datasets using an EfficientVIT-B0 model, showing a significant performance gain compared to other approaches. Our code is available at https://github.com/MatthisManthe/DDI_SCFL .


[61] 2410.14697

Learning Cortico-Muscular Dependence through Orthonormal Decomposition of Density Ratios

The cortico-spinal neural pathway is fundamental for motor control and movement execution, and in humans it is typically studied using concurrent electroencephalography (EEG) and electromyography (EMG) recordings. However, current approaches for capturing high-level and contextual connectivity between these recordings have important limitations. Here, we present a novel application of statistical dependence estimators based on orthonormal decomposition of density ratios to model the relationship between cortical and muscle oscillations. Our method extends from traditional scalar-valued measures by learning eigenvalues, eigenfunctions, and projection spaces of density ratios from realizations of the signal, addressing the interpretability, scalability, and local temporal dependence of cortico-muscular connectivity. We experimentally demonstrate that eigenfunctions learned from cortico-muscular connectivity can accurately classify movements and subjects. Moreover, they reveal channel and temporal dependencies that confirm the activation of specific EEG channels during movement.


[62] 2410.14707

FACMIC: Federated Adaptative CLIP Model for Medical Image Classification

Federated learning (FL) has emerged as a promising approach to medical image analysis that allows deep model training using decentralized data while ensuring data privacy. However, in the field of FL, communication cost plays a critical role in evaluating the performance of the model. Thus, transferring vision foundation models can be particularly challenging due to the significant resource costs involved. In this paper, we introduce a federated adaptive Contrastive Language Image Pretraining CLIP model designed for classification tasks. We employ a light-weight and efficient feature attention module for CLIP that selects suitable features for each client's data. Additionally, we propose a domain adaptation technique to reduce differences in data distribution between clients. Experimental results on four publicly available datasets demonstrate the superior performance of FACMIC in dealing with real-world and multisource medical imaging data. Our codes are available at https://github.com/AIPMLab/FACMIC.


[63] 2410.14709

A two-stage transliteration approach to improve performance of a multilingual ASR

End-to-end Automatic Speech Recognition (ASR) systems are rapidly claiming to become state-of-art over other modeling methods. Several techniques have been introduced to improve their ability to handle multiple languages. However, due to variation in writing scripts for different languages, while decoding acoustically similar units, they do not always map to an appropriate grapheme in the target language. This restricts the scalability and adaptability of the model while dealing with multiple languages in code-mixing scenarios. This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set obtained by projecting the multilingual grapheme data to the script of a more generic target language. This approach saves the acoustic model from retraining to span over a larger space and can easily be extended to multiple languages. A two-stage transliteration process realizes this approach and proves to minimize speech-class confusion. We performed experiments with an end-to-end multilingual speech recognition system for two Indic Languages, namely Nepali and Telugu. The original grapheme space of these languages is projected to the Devanagari script. We achieved a relative reduction of 20% in the Word Error Rate (WER) and 24% in the Character Error Rate (CER) in the transliterated space, over other language-dependent modeling methods.


[64] 2410.14724

A Phenomenological AI Foundation Model for Physical Signals

The objective of this work is to develop an AI foundation model for physical signals that can generalize across diverse phenomena, domains, applications, and sensing apparatuses. We propose a phenomenological approach and framework for creating and validating such AI foundation models. Based on this framework, we developed and trained a model on 0.59 billion samples of cross-modal sensor measurements, ranging from electrical current to fluid flow to optical sensors. Notably, no prior knowledge of physical laws or inductive biases were introduced into the model. Through several real-world experiments, we demonstrate that a single foundation model could effectively encode and predict physical behaviors, such as mechanical motion and thermodynamics, including phenomena not seen in training. The model also scales across physical processes of varying complexity, from tracking the trajectory of a simple spring-mass system to forecasting large electrical grid dynamics. This work highlights the potential of building a unified AI foundation model for diverse physical world processes.


[65] 2410.14800

Unlocking the Full Potential of High-Density Surface EMG: Novel Non-Invasive High-Yield Motor Unit Decomposition

The decomposition of high-density surface electromyography (HD-sEMG) signals into motor unit discharge patterns has become a powerful tool for investigating the neural control of movement, providing insights into motor neuron recruitment and discharge behavior. However, current algorithms, while very effective under certain conditions, face significant challenges in complex scenarios, as their accuracy and motor unit yield are highly dependent on anatomical differences among individuals. This can limit the number of decomposed motor units, particularly in challenging conditions. To address this issue, we recently introduced Swarm-Contrastive Decomposition (SCD), which dynamically adjusts the separation function based on the distribution of the data and prevents convergence to the same source. Initially applied to intramuscular EMG signals, SCD is here adapted for HD-sEMG signals. We demonstrated its ability to address key challenges faced by existing methods, particularly in identifying low-amplitude motor unit action potentials and effectively handling complex decomposition scenarios, like high-interference signals. We extensively validated SCD using simulated and experimental HD-sEMG recordings and compared it with current state-of-the-art decomposition methods under varying conditions, including different excitation levels, noise intensities, force profiles, sexes, and muscle groups. The proposed method consistently outperformed existing techniques in both the quantity of decoded motor units and the precision of their firing time identification. For instance, under certain experimental conditions, SCD detected more than three times as many motor units compared to previous methods, while also significantly improving accuracy. These advancements represent a major step forward in non-invasive EMG technology for studying motor unit activity in complex scenarios.


[66] 2410.14803

DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents

On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.


[67] 2410.14882

Multi-diseases detection with memristive system on chip

This study presents the first implementation of multilayer neural networks on a memristor/CMOS integrated system on chip (SoC) to simultaneously detect multiple diseases. To overcome limitations in medical data, generative AI techniques are used to enhance the dataset, improving the classifier's robustness and diversity. The system achieves notable performance with low latency, high accuracy (91.82%), and energy efficiency, facilitated by end-to-end execution on a memristor-based SoC with ten 256x256 crossbar arrays and an integrated on-chip processor. This research showcases the transformative potential of memristive in-memory computing hardware in accelerating machine learning applications for medical diagnostics.


[68] 2410.14928

A Novel Approach to Grasping Control of Soft Robotic Grippers based on Digital Twin

This paper has proposed a Digital Twin (DT) framework for real-time motion and pose control of soft robotic grippers. The developed DT is based on an industrial robot workstation, integrated with our newly proposed approach for soft gripper control, primarily based on computer vision, for setting the driving pressure for desired gripper status in real-time. Knowing the gripper motion, the gripper parameters (e.g. curvatures and bending angles, etc.) are simulated by kinematics modelling in Unity 3D, which is based on four-piecewise constant curvature kinematics. The mapping in between the driving pressure and gripper parameters is achieved by implementing OpenCV based image processing algorithms and data fitting. Results show that our DT-based approach can achieve satisfactory performance in real-time control of soft gripper manipulation, which can satisfy a wide range of industrial applications.


[69] 2410.14934

Development of a Simple and Novel Digital Twin Framework for Industrial Robots in Intelligent robotics manufacturing

This paper has proposed an easily replicable and novel approach for developing a Digital Twin (DT) system for industrial robots in intelligent manufacturing applications. Our framework enables effective communication via Robot Web Service (RWS), while a real-time simulation is implemented in Unity 3D and Web-based Platform without any other 3rd party tools. The framework can do real-time visualization and control of the entire work process, as well as implement real-time path planning based on algorithms executed in MATLAB. Results verify the high communication efficiency with a refresh rate of only $17 ms$. Furthermore, our developed web-based platform and Graphical User Interface (GUI) enable easy accessibility and user-friendliness in real-time control.


[70] 2410.14945

ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model

We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial audio codec that maps FOA audio to latent components, a latent diffusion model trained based on various user input types, namely, text prompts, spatial, temporal and environmental acoustic parameters, and optionally a spatial audio and text encoder trained in a Contrastive Language and Audio Pretraining (CLAP) style. We propose metrics to evaluate the quality and spatial adherence of the generated spatial audio. Finally, we assess the model performance in terms of generation quality and spatial conformance, comparing the two proposed modes: ``descriptive", which uses spatial text prompts) and ``parametric", which uses non-spatial text prompts and spatial parameters. Our evaluations demonstrate promising results that are consistent with the user conditions and reflect reliable spatial fidelity.


[71] 2410.14971

BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation

Recent advances in decoding language from brain signals (EEG and MEG) have been significantly driven by pre-trained language models, leading to remarkable progress on publicly available non-invasive EEG/MEG datasets. However, previous works predominantly utilize teacher forcing during text generation, leading to significant performance drops without its use. A fundamental issue is the inability to establish a unified feature space correlating textual data with the corresponding evoked brain signals. Although some recent studies attempt to mitigate this gap using an audio-text pre-trained model, Whisper, which is favored for its signal input modality, they still largely overlook the inherent differences between audio signals and brain signals in directly applying Whisper to decode brain signals. To address these limitations, we propose a new multi-stage strategy for semantic brain signal decoding via vEctor-quantized speCtrogram reconstruction for WHisper-enhanced text generatiOn, termed BrainECHO. Specifically, BrainECHO successively conducts: 1) Discrete autoencoding of the audio spectrogram; 2) Brain-audio latent space alignment; and 3) Semantic text generation via Whisper finetuning. Through this autoencoding--alignment--finetuning process, BrainECHO outperforms state-of-the-art methods under the same data split settings on two widely accepted resources: the EEG dataset (Brennan) and the MEG dataset (GWilliams). The innovation of BrainECHO, coupled with its robustness and superiority at the sentence, session, and subject-independent levels across public datasets, underscores its significance for language-based brain-computer interfaces.


[72] 2410.14990

Audio Processing using Pattern Recognition for Music Genre Classification

This project explores the application of machine learning techniques for music genre classification using the GTZAN dataset, which contains 100 audio files per genre. Motivated by the growing demand for personalized music recommendations, we focused on classifying five genres-Blues, Classical, Jazz, Hip Hop, and Country-using a variety of algorithms including Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, and Artificial Neural Networks (ANN) implemented via Keras. The ANN model demonstrated the best performance, achieving a validation accuracy of 92.44%. We also analyzed key audio features such as spectral roll-off, spectral centroid, and MFCCs, which helped enhance the model's accuracy. Future work will expand the model to cover all ten genres, investigate advanced methods like Long Short-Term Memory (LSTM) networks and ensemble approaches, and develop a web application for real-time genre classification and playlist generation. This research aims to contribute to improving music recommendation systems and content curation.


[73] 2410.14997

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.


[74] 2410.15017

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.


[75] 2410.15062

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification

Audio-Language Models (ALMs) have demonstrated remarkable performance in zero-shot audio classification. In this paper, we introduce PAT (Parameter-free Audio-Text aligner), a simple and training-free method aimed at boosting the zero-shot audio classification performance of CLAP-like ALMs. To achieve this, we propose to improve the cross-modal interaction between audio and language modalities by enhancing the representations for both modalities using mutual feedback. Precisely, to enhance textual representations, we propose a prompt ensemble algorithm that automatically selects and combines the most relevant prompts from a datastore with a large pool of handcrafted prompts and weighs them according to their relevance to the audio. On the other hand, to enhance audio representations, we reweigh the frame-level audio features based on the enhanced textual information. Our proposed method does not require any additional modules or parameters and can be used with any existing CLAP-like ALM to improve zero-shot audio classification performance. We experiment across 18 diverse benchmark datasets and 6 ALMs and show that the PAT outperforms vanilla zero-shot evaluation with significant margins of 0.42%-27.0%. Additionally, we demonstrate that PAT maintains robust performance even when input audio is degraded by varying levels of noise. Our code will be open-sourced upon acceptance.


[76] 2410.15067

A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends

Image restoration (IR) refers to the process of improving visual quality of images while removing degradation, such as noise, blur, weather effects, and so on. Traditional IR methods typically target specific types of degradation, which limits their effectiveness in real-world scenarios with complex distortions. In response to this challenge, the all-in-one image restoration (AiOIR) paradigm has emerged, offering a unified framework that adeptly addresses multiple degradation types. These innovative models enhance both convenience and versatility by adaptively learning degradation-specific features while simultaneously leveraging shared knowledge across diverse corruptions. In this review, we delve into the AiOIR methodologies, emphasizing their architecture innovations and learning paradigm and offering a systematic review of prevalent approaches. We systematically categorize prevalent approaches and critically assess the challenges these models encounter, proposing future research directions to advance this dynamic field. Our paper begins with an introduction to the foundational concepts of AiOIR models, followed by a categorization of cutting-edge designs based on factors such as prior knowledge and generalization capability. Next, we highlight key advancements in AiOIR, aiming to inspire further inquiry and innovation within the community. To facilitate a robust evaluation of existing methods, we collate and summarize commonly used datasets, implementation details, and evaluation metrics. Additionally, we present an objective comparison of open-sourced methods, providing valuable insights for researchers and practitioners alike. This paper stands as the first comprehensive and insightful review of AiOIR. A related repository is available at https://github.com/Harbinzzy/All-in-One-Image-Restoration-Survey.


[77] 2410.15083

Numerical optimal control for distributed delay differential equations: A simultaneous approach based on linearization of the delayed variables

Time delays are ubiquitous in industrial processes, and they must be accounted for when designing control algorithms because they have a significant effect on the process dynamics. Therefore, in this work, we propose a simultaneous approach for numerical optimal control of delay differential equations with distributed time delays. Specifically, we linearize the delayed variables around the current time, and we discretize the resulting implicit differential equations using Euler's implicit method. Furthermore, we transcribe the infinite-dimensional optimal control problem into a finite-dimensional nonlinear program, which we solve using Matlab's fmincon. Finally, we demonstrate the efficacy of the approach using a numerical example involving a molten salt nuclear fission reactor.


[78] 2410.15108

The shape of the brain's connections is predictive of cognitive performance: an explainable machine learning study

The shape of the brain's white matter connections is relatively unexplored in diffusion MRI tractography analysis. While it is known that tract shape varies in populations and across the human lifespan, it is unknown if the variability in dMRI tractography-derived shape may relate to the brain's functional variability across individuals. This work explores the potential of leveraging tractography fiber cluster shape measures to predict subject-specific cognitive performance. We implement machine learning models to predict individual cognitive performance scores. We study a large-scale database from the HCP-YA study. We apply an atlas-based fiber cluster parcellation to the dMRI tractography of each individual. We compute 15 shape, microstructure, and connectivity features for each fiber cluster. Using these features as input, we train a total of 210 models to predict 7 different NIH Toolbox cognitive performance assessments. We apply an explainable AI technique, SHAP, to assess the importance of each fiber cluster for prediction. Our results demonstrate that shape measures are predictive of individual cognitive performance. The studied shape measures, such as irregularity, diameter, total surface area, volume, and branch volume, are as effective for prediction as microstructure and connectivity measures. The overall best-performing feature is a shape feature, irregularity, which describes how different a cluster's shape is from an idealized cylinder. Further interpretation using SHAP values suggest that fiber clusters with features highly predictive of cognitive ability are widespread throughout the brain, including fiber clusters from the superficial association, deep association, cerebellar, striatal, and projection pathways. This study demonstrates the strong potential of shape descriptors to enhance the study of the brain's white matter and its relationship to cognitive function.


[79] 2410.15156

Simulation-Based Optimistic Policy Iteration For Multi-Agent MDPs with Kullback-Leibler Control Cost

This paper proposes an agent-based optimistic policy iteration (OPI) scheme for learning stationary optimal stochastic policies in multi-agent Markov Decision Processes (MDPs), in which agents incur a Kullback-Leibler (KL) divergence cost for their control efforts and an additional cost for the joint state. The proposed scheme consists of a greedy policy improvement step followed by an m-step temporal difference (TD) policy evaluation step. We use the separable structure of the instantaneous cost to show that the policy improvement step follows a Boltzmann distribution that depends on the current value function estimate and the uncontrolled transition probabilities. This allows agents to compute the improved joint policy independently. We show that both the synchronous (entire state space evaluation) and asynchronous (a uniformly sampled set of substates) versions of the OPI scheme with finite policy evaluation rollout converge to the optimal value function and an optimal joint policy asymptotically. Simulation results on a multi-agent MDP with KL control cost variant of the Stag-Hare game validates our scheme's performance in terms of minimizing the cost return.


[80] 2410.15175

Implicit neural representation for free-breathing MR fingerprinting (INR-MRF): co-registered 3D whole-liver water T1, water T2, proton density fat fraction, and R2* mapping

Purpose: To develop an MRI technique for free-breathing 3D whole-liver quantification of water T1, water T2, proton density fat fraction (PDFF), R2*. Methods: An Eight-echo spoiled gradient echo pulse sequence with spiral readout was developed by interleaving inversion recovery and T2 magnetization preparation. We propose a neural network based on a 4D and a 3D implicit neural representation (INR) which simultaneously learns the motion deformation fields and the static reference frame MRI subspace images respectively. Water and fat singular images were separated during network training, with no need of performing retrospective water-fat separation. T1, T2, R2* and proton density fat fraction (PDFF) produced by the proposed method were validated in vivo on 10 healthy subjects, using quantitative maps generated from conventional scans as reference. Results: Our results showed minimal bias and narrow 95% limits of agreement on T1, T2, R2* and PDFF values in the liver compared to conventional breath-holding scans. Conclusions: INR-MRF enabled co-registered 3D whole liver T1, T2, R2* and PDFF mapping in a single free-breathing scan.


[81] 2410.15178

Enhancing Robot Navigation Policies with Task-Specific Uncertainty Management

Robots performing navigation tasks in complex environments face significant challenges due to uncertainty in state estimation. Effectively managing this uncertainty is crucial, but the optimal approach varies depending on the specific details of the task: different tasks require varying levels of precision in different regions of the environment. For instance, a robot navigating a crowded space might need precise localization near obstacles but can operate effectively with less precise state estimates in open areas. This varying need for certainty in different parts of the environment, depending on the task, calls for policies that can adapt their uncertainty management strategies based on task-specific requirements. In this paper, we present a framework for integrating task-specific uncertainty requirements directly into navigation policies. We introduce Task-Specific Uncertainty Map (TSUM), which represents acceptable levels of state estimation uncertainty across different regions of the operating environment for a given task. Using TSUM, we propose Generalized Uncertainty Integration for Decision-Making and Execution (GUIDE), a policy conditioning framework that incorporates these uncertainty requirements into the robot's decision-making process. We find that conditioning policies on TSUMs provides an effective way to express task-specific uncertainty requirements and enables the robot to reason about the context-dependent value of certainty. We show how integrating GUIDE into reinforcement learning frameworks allows the agent to learn navigation policies without the need for explicit reward engineering to balance task completion and uncertainty management. We evaluate GUIDE on a variety of real-world navigation tasks and find that it demonstrates significant improvements in task completion rates compared to baselines. Evaluation videos can be found at https://guided-agents.github.io.


[82] 2410.15214

Relay Incentive Mechanisms Using Wireless Power Transfer in Non-Cooperative Networks

This paper studies the use of a multi-attribute auction in a communication system to bring about efficient relaying in a non-cooperative setting. We consider a system where a source seeks to offload data to an access point (AP) while balancing both the timeliness and energy-efficiency of the transmission. A deep fade in the communication channel (due to, e.g., a line-of-sight blockage) makes direct communication costly, and the source may alternatively rely on non-cooperative UEs to act as relays. We propose a multi-attribute auction to select a UE and to determine the duration and power of the transmission, with payments to the UE taking the form of energy sent via wireless power transfer (WPT). The quality of the channel from a UE to the AP constitutes private information, and bids consist of a transmission time and transmission power. We show that under a second-preferred-offer auction, truthful bidding by all candidate UEs forms a Nash Equilibrium. However, this auction is not incentive compatible, and we present a modified auction in which truthful bidding is in fact a dominant strategy. Extensive numerical experimentation illustrates the efficacy of our approach, which we compare to a cooperative baseline. We demonstrate that with as few as two candidates, our improved mechanism leads to as much as a 76% reduction in energy consumption, and that with as few as three candidates, the transmission time decreases by as much as 55%. Further, we see that as the number of candidates increases, the performance of our mechanism approaches that of the cooperative baseline. Overall, our findings highlight the potential of multi-attribute auctions to enhance the efficiency of data transfer in non-cooperative settings.


[83] 2410.15221

IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning

Despite the popularity of multi-agent reinforcement learning (RL) in simulated and two-player applications, its success in messy real-world applications has been limited. A key challenge lies in its generalizability across problem variations, a common necessity for many real-world problems. Contextual reinforcement learning (CRL) formalizes learning policies that generalize across problem variations. However, the lack of standardized benchmarks for multi-agent CRL has hindered progress in the field. Such benchmarks are desired to be based on real-world applications to naturally capture the many open challenges of real-world problems that affect generalization. To bridge this gap, we propose IntersectionZoo, a comprehensive benchmark suite for multi-agent CRL through the real-world application of cooperative eco-driving in urban road networks. The task of cooperative eco-driving is to control a fleet of vehicles to reduce fleet-level vehicular emissions. By grounding IntersectionZoo in a real-world application, we naturally capture real-world problem characteristics, such as partial observability and multiple competing objectives. IntersectionZoo is built on data-informed simulations of 16,334 signalized intersections derived from 10 major US cities, modeled in an open-source industry-grade microscopic traffic simulator. By modeling factors affecting vehicular exhaust emissions (e.g., temperature, road conditions, travel demand), IntersectionZoo provides one million data-driven traffic scenarios. Using these traffic scenarios, we benchmark popular multi-agent RL and human-like driving algorithms and demonstrate that the popular multi-agent RL algorithms struggle to generalize in CRL settings.


[84] 2410.15224

Robust Low-rank Tensor Train Recovery

Tensor train (TT) decomposition represents an $N$-order tensor using $O(N)$ matrices (i.e., factors) of small dimensions, achieved through products among these factors. Due to its compact representation, TT decomposition has found wide applications, including various tensor recovery problems in signal processing and quantum information. In this paper, we study the problem of reconstructing a TT format tensor from measurements that are contaminated by outliers with arbitrary values. Given the vulnerability of smooth formulations to corruptions, we use an $\ell_1$ loss function to enhance robustness against outliers. We first establish the $\ell_1/\ell_2$-restricted isometry property (RIP) for Gaussian measurement operators, demonstrating that the information in the TT format tensor can be preserved using a number of measurements that grows linearly with $N$. We also prove the sharpness property for the $\ell_1$ loss function optimized over TT format tensors. Building on the $\ell_1/\ell_2$-RIP and sharpness property, we then propose two complementary methods to recover the TT format tensor from the corrupted measurements: the projected subgradient method (PSubGM), which optimizes over the entire tensor, and the factorized Riemannian subgradient method (FRSubGM), which optimizes directly over the factors. Compared to PSubGM, the factorized approach FRSubGM significantly reduces the memory cost at the expense of a slightly slower convergence rate. Nevertheless, we show that both methods, with diminishing step sizes, converge linearly to the ground-truth tensor given an appropriate initialization, which can be obtained by a truncated spectral method.


[85] 2410.15283

TRIZ Method for Urban Building Energy Optimization: GWO-SARIMA-LSTM Forecasting model

With the advancement of global climate change and sustainable development goals, urban building energy consumption optimization and carbon emission reduction have become the focus of research. Traditional energy consumption prediction methods often lack accuracy and adaptability due to their inability to fully consider complex energy consumption patterns, especially in dealing with seasonal fluctuations and dynamic changes. This study proposes a hybrid deep learning model that combines TRIZ innovation theory with GWO, SARIMA and LSTM to improve the accuracy of building energy consumption prediction. TRIZ plays a key role in model design, providing innovative solutions to achieve an effective balance between energy efficiency, cost and comfort by systematically analyzing the contradictions in energy consumption optimization. GWO is used to optimize the parameters of the model to ensure that the model maintains high accuracy under different conditions. The SARIMA model focuses on capturing seasonal trends in the data, while the LSTM model handles short-term and long-term dependencies in the data, further improving the accuracy of the prediction. The main contribution of this research is the development of a robust model that leverages the strengths of TRIZ and advanced deep learning techniques, improving the accuracy of energy consumption predictions. Our experiments demonstrate a significant 15% reduction in prediction error compared to existing models. This innovative approach not only enhances urban energy management but also provides a new framework for optimizing energy use and reducing carbon emissions, contributing to sustainable development.


[86] 2410.15316

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.


[87] 2410.15321

Integrated Design and Control of a Robotic Arm on a Quadcopter for Enhanced Package Delivery

This paper presents a comprehensive design process for the integration of a robotic arm into a quadcopter, emphasizing the physical modeling, system integration, and controller development. Utilizing SolidWorks for mechanical design and MATLAB Simscape for simulation and control, this study addresses the challenges encountered in integrating the robotic arm with the drone, encompassing both mechanical and control aspects. Two types of controllers are developed and analyzed: a Proportional-Integral-Derivative (PID) controller and a Model Reference Adaptive Controller (MRAC). The design and tuning of these controllers are key components of this research, with the focus on their application in package delivery tasks. Extensive simulations demonstrate the performance of each controller, with PID controllers exhibiting superior trajectory tracking and lower Root Mean Square (RMS) errors under various payload conditions. The results underscore the efficacy of PID control for stable flight and precise maneuvering, while highlighting adaptability of MRAC to changing dynamics.


[88] 2410.15342

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.


[89] 2410.15499

Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker's identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.


[90] 2410.15500

Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Speech anonymisation aims to protect speaker identity by changing personal identifiers in speech while retaining linguistic content. Current methods fail to retain prosody and unique speech patterns found in elderly and pathological speech domains, which is essential for remote health monitoring. To address this gap, we propose a voice conversion-based method (DDSP-QbE) using differentiable digital signal processing and query-by-example. The proposed method, trained with novel losses, aids in disentangling linguistic, prosodic, and domain representations, enabling the model to adapt to uncommon speech patterns. Objective and subjective evaluations show that DDSP-QbE significantly outperforms the voice conversion state-of-the-art concerning intelligibility, prosody, and domain preservation across diverse datasets, pathologies, and speakers while maintaining quality and speaker anonymity. Experts validate domain preservation by analysing twelve clinically pertinent domain attributes.


[91] 2410.15532

Construction and Analysis of Impression Caption Dataset for Environmental Sounds

Some datasets with the described content and order of occurrence of sounds have been released for conversion between environmental sound and text. However, there are very few texts that include information on the impressions humans feel, such as "sharp" and "gorgeous," when they hear environmental sounds. In this study, we constructed a dataset with impression captions for environmental sounds that describe the impressions humans have when hearing these sounds. We used ChatGPT to generate impression captions and selected the most appropriate captions for sound by humans. Our dataset consists of 3,600 impression captions for environmental sounds. To evaluate the appropriateness of impression captions for environmental sounds, we conducted subjective and objective evaluations. From our evaluation results, we indicate that appropriate impression captions for environmental sounds can be generated.


[92] 2410.15543

Distributed Thompson sampling under constrained communication

In Bayesian optimization, a black-box function is maximized via the use of a surrogate model. We apply distributed Thompson sampling, using a Gaussian process as a surrogate model, to approach the multi-agent Bayesian optimization problem. In our distributed Thompson sampling implementation, each agent receives sampled points from neighbors, where the communication network is encoded in a graph; each agent utilizes a Gaussian process to model the objective function. We demonstrate a theoretical bound on Bayesian Simple Regret, where the bound depends on the size of the largest complete subgraph of the communication graph. Unlike in batch Bayesian optimization, this bound is applicable in cases where the communication graph amongst agents is constrained. When compared to sequential Thompson sampling, our bound guarantees faster convergence with respect to time as long as there is a fully connected subgraph of at least two agents. We confirm the efficacy of our algorithm with numerical simulations on traditional optimization test functions, illustrating the significance of graph connectivity on improving regret convergence.


[93] 2410.15573

OpenMU: Your Swiss Army Knife for Music Understanding

We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.


[94] 2410.15577

ALDAS: Audio-Linguistic Data Augmentation for Spoofed Audio Detection

Spoofed audio, i.e. audio that is manipulated or AI-generated deepfake audio, is difficult to detect when only using acoustic features. Some recent innovative work involving AI-spoofed audio detection models augmented with phonetic and phonological features of spoken English, manually annotated by experts, led to improved model performance. While this augmented model produced substantial improvements over traditional acoustic features based models, a scalability challenge motivates inquiry into auto labeling of features. In this paper we propose an AI framework, Audio-Linguistic Data Augmentation for Spoofed audio detection (ALDAS), for auto labeling linguistic features. ALDAS is trained on linguistic features selected and extracted by sociolinguistics experts; these auto labeled features are used to evaluate the quality of ALDAS predictions. Findings indicate that while the detection enhancement is not as substantial as when involving the pure ground truth linguistic features, there is improvement in performance while achieving auto labeling. Labels generated by ALDAS are also validated by the sociolinguistics experts.


[95] 2410.15608

Moonshine: Speech Recognition for Live Transcription and Voice Commands

This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny.en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.


[96] 2410.15609

Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding

Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.


[97] 2410.15620

Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. For example, the data may be owned by different curators, and it is not allowed to share with others. In this paper, we propose a novel paradigm to solve salient problems plaguing the ASR field. In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the second phase, two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior model accuracy. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art. Furthermore, we introduce Shapley Value to estimate the contribution score of the trained models, which is useful for evaluating the effectiveness of the data and providing fair incentives to their curators.


[98] 2410.15654

Design and Optimization of a Metamaterial Absorber for Solar Energy Harvesting in the THz Frequency Range

This paper introduces the design and comprehensive characterization of a novel three-layer metamaterial absorber, engineered to exploit the unique optical properties of gold, vanadium dioxide, and silicon dioxide. At the core of this design, silicon dioxide serves as a robust substrate that supports an intricately structured layer of gold and a top layer of vanadium dioxide. This configuration is optimized to harness and enhance absorption capabilities effectively across a broadband terahertz (THz) spectrum. The absorber demonstrates an extensive absorption bandwidth of 3.00 THz, spanning frequencies from 2.414 THz to 5.417 THz. Remarkably, throughout this range, the device maintains a consistently high absorption efficiency, exceeding 90%. This efficiency is characterized by two sharp absorption peaks located at 2.638 THz and 5.158 THz, which signify the precise tuning of the metamaterial structure to interact optimally with specific THz frequencies. The absorbance of the proposed model is almost equal to 99%. This absorber is polarization insensitive. The development of this absorber involved a series of theoretical simulations backed by experimental validations, which helped refine the metamaterial's geometry and material composition. This process illuminated the critical role of the dielectric properties of silicon dioxide and the plasmonic effects induced by gold and vanadium dioxide layers, which collectively contribute to the high-performance metrics observed.


[99] 2410.15659

Decentralized Hybrid Precoding for Massive MU-MIMO ISAC

Integrated sensing and communication (ISAC) is a very promising technology designed to provide both high rate communication capabilities and sensing capabilities. However, in Massive Multi User Multiple-Input Multiple-Output (Massive MU MIMO-ISAC) systems, the dense user access creates a serious multi-user interference (MUI) problem, leading to degradation of communication performance. To alleviate this problem, we propose a decentralized baseband processing (DBP) precoding method. We first model the MUI of dense user scenarios with minimizing Cramer-Rao bound (CRB) as an objective function.Hybrid precoding is an attractive ISAC technique, and hybrid precoding using Partially Connected Structures (PCS) can effectively reduce hardware cost and power consumption. We mitigate the MUI between dense users based on ThomlinsonHarashima Precoding (THP). We demonstrate the effectiveness of the proposed method through simulation experiments. Compared with the existing methods, it can effectively improve the communication data rates and energy efficiency in dense user access scenario, and reduce the hardware complexity of Massive MU MIMO-ISAC systems. The experimental results demonstrate the usefulness of our method for improving the MUI problem in ISAC systems for dense user access scenarios.


[100] 2410.15742

DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network

Growing exploitation of Machine Learning (ML) in safety-critical applications necessitates rigorous safety analysis. Hardware reliability assessment is a major concern with respect to measuring the level of safety. Quantifying the reliability of emerging ML models, including Deep Neural Networks (DNNs), is highly complex due to their enormous size in terms of the number of parameters and computations. Conventionally, Fault Injection (FI) is applied to perform a reliability measurement. However, performing FI on modern-day DNNs is prohibitively time-consuming if an acceptable confidence level is to be achieved. In order to speed up FI for large DNNs, statistical FI has been proposed. However, the run-time for the large DNN models is still considerably long. In this work, we introduce DeepVigor+, a scalable, fast and accurate semi-analytical method as an efficient alternative for reliability measurement in DNNs. DeepVigor+ implements a fault propagation analysis model and attempts to acquire Vulnerability Factors (VFs) as reliability metrics in an optimal way. The results indicate that DeepVigor+ obtains VFs for DNN models with an error less than 1\% and 14.9 up to 26.9 times fewer simulations than the best-known state-of-the-art statistical FI enabling an accurate reliability analysis for emerging DNNs within a few minutes.


[101] 2410.15749

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into multiple layers of discrete codes with uniform time scales. However, this strategy overlooks the differences in information density across various speech features, leading to redundant encoding of sparse information, which limits the performance of these methods at low bitrate. This paper proposes MsCodec, a novel multi-scale neural speech codec that encodes speech into multiple layers of discrete codes, each corresponding to a different time scale. This encourages the model to decouple speech features according to their diverse information densities, consequently enhancing the performance of speech compression. Furthermore, we incorporate mutual information loss to augment the diversity among speech codes across different layers. Experimental results indicate that our proposed method significantly improves codec performance at low bitrate.


[102] 2410.15767

Improving Instance Optimization in Deformable Image Registration with Gradient Projection

Deformable image registration is inherently a multi-objective optimization (MOO) problem, requiring a delicate balance between image similarity and deformation regularity. These conflicting objectives often lead to poor optimization outcomes, such as being trapped in unsatisfactory local minima or experiencing slow convergence. Deep learning methods have recently gained popularity in this domain due to their efficiency in processing large datasets and achieving high accuracy. However, they often underperform during test time compared to traditional optimization techniques, which further explore iterative, instance-specific gradient-based optimization. This performance gap is more pronounced when a distribution shift between training and test data exists. To address this issue, we focus on the instance optimization (IO) paradigm, which involves additional optimization for test-time instances based on a pre-trained model. IO effectively combines the generalization capabilities of deep learning with the fine-tuning advantages of instance-specific optimization. Within this framework, we emphasize the use of gradient projection to mitigate conflicting updates in MOO. This technique projects conflicting gradients into a common space, better aligning the dual objectives and enhancing optimization stability. We validate our method using a state-of-the-art foundation model on the 3D Brain inter-subject registration task (LUMIR) from the Learn2Reg 2024 Challenge. Our results show significant improvements over standard gradient descent, leading to more accurate and reliable registration results.


[103] 2410.15797

Design of a Flexible Robot Arm for Safe Aerial Physical Interaction

This paper introduces a novel compliant mechanism combining lightweight and energy dissipation for aerial physical interaction. Weighting 400~g at take-off, the mechanism is actuated in the forward body direction, enabling precise position control for force interaction and various other aerial manipulation tasks. The robotic arm, structured as a closed-loop kinematic chain, employs two deported servomotors. Each joint is actuated with a single tendon for active motion control in compression of the arm at the end-effector. Its elasto-mechanical design reduces weight and provides flexibility, allowing passive-compliant interactions without impacting the motors' integrity. Notably, the arm's damping can be adjusted based on the proposed inner frictional bulges. Experimental applications showcase the aerial system performance in both free-flight and physical interaction. The presented work may open safer applications for \ac{MAV} in real environments subject to perturbations during interaction.


[104] 2410.15802

Assisted Physical Interaction: Autonomous Aerial Robots with Neural Network Detection, Navigation, and Safety Layers

The paper introduces a novel framework for safe and autonomous aerial physical interaction in industrial settings. It comprises two main components: a neural network-based target detection system enhanced with edge computing for reduced onboard computational load, and a control barrier function (CBF)-based controller for safe and precise maneuvering. The target detection system is trained on a dataset under challenging visual conditions and evaluated for accuracy across various unseen data with changing lighting conditions. Depth features are utilized for target pose estimation, with the entire detection framework offloaded into low-latency edge computing. The CBF-based controller enables the UAV to converge safely to the target for precise contact. Simulated evaluations of both the controller and target detection are presented, alongside an analysis of real-world detection performance.


[105] 2410.15862

Integration of Cobalt Ferromagnetic Control Gates for Electrical and Magnetic Manipulation of Semiconductor Quantum Dots

The rise of electron spin qubit architectures for quantum computing processors has led to a strong interest in designing and integrating ferromagnets to induce stray magnetic fields for electron dipole spin resonance (EDSR). The integration of nanomagnets imposes however strict layout and processing constraints, challenging the arrangement of different gating layers and the control of neighboring qubit frequencies. This work reports a successful integration of nano-sized cobalt control gates into a multi-gate FD-SOI nanowire with nanometer-scale dot-to-magnet pitch, simultaneously exploiting electrical and ferromagnetic properties of the gate stack at nanoscale. The electrical characterization of the multi-gate nanowire exhibits full field effect functionality of all ferromagnetic gates from room temperature to 10 mK, proving quantum dot formation when ferromagnets are operated as barrier gates. The front-end-of-line (FEOL) compatible gate-first integration of cobalt is examined by energy dispersive X-ray spectroscopy and high/low frequency capacitance characterization, confirming the quality of interfaces and control over material diffusion. Insights into the magnetic properties of thin films and patterned control-gates are provided by vibrating sample magnetometry and electron holography measurements. Micromagnetic simulations anticipate that this structure fulfills the requirements for EDSR driving for magnetic fields higher than 1 T, where a homogeneous magnetization along the hard magnetic axis of the Co gates is expected. The FDSOI architecture showcased in this study provides a scalable alternative to micromagnets deposited in the back-end-of-line (BEOL) and middle-of-line (MOL) processes, while bringing technological insights for the FEOL-compatible integration of Co nanostructures in spin qubit devices.


[106] 2410.15869

Robust Loop Closure by Textual Cues in Challenging Environments

Loop closure is an important task in robot navigation. However, existing methods mostly rely on some implicit or heuristic features of the environment, which can still fail to work in common environments such as corridors, tunnels, and warehouses. Indeed, navigating in such featureless, degenerative, and repetitive (FDR) environments would also pose a significant challenge even for humans, but explicit text cues in the surroundings often provide the best assistance. This inspires us to propose a multi-modal loop closure method based on explicit human-readable textual cues in FDR environments. Specifically, our approach first extracts scene text entities based on Optical Character Recognition (OCR), then creates a local map of text cues based on accurate LiDAR odometry and finally identifies loop closure events by a graph-theoretic scheme. Experiment results demonstrate that this approach has superior performance over existing methods that rely solely on visual and LiDAR sensors. To benefit the community, we release the source code and datasets at \url{https://github.com/TongxingJin/TXTLCD}.


[107] 2410.15895

Cryogenic Control and Readout Integrated Circuits for Solid-State Quantum Computing

In the pursuit of quantum computing, solid-state quantum systems, particularly superconducting ones, have made remarkable advancements over the past two decades. However, achieving fault-tolerant quantum computing for next-generation applications necessitates the integration of several million qubits, which presents significant challenges in terms of interconnection complexity and latency that are currently unsolvable with state-of-the-art room-temperature control and readout electronics. Recently, cryogenic integrated circuits (ICs), including CMOS radio-frequency ICs and rapid-single-flux-quantum-logic ICs, have emerged as potential alternatives to room-temperature electronics. Unlike their room-temperature counterparts, these ICs are deployed within cryostats to enhance scalability by reducing the number and length of transmission lines. Additionally, operating at cryogenic temperatures can suppress electronic noise and improve qubit control fidelity. However, for CMOS ICs specifically, circuit design uncertainties arise due to a lack of reliable models for cryogenic field effect transistors as well as issues related to severe fickle noises and power dissipation at cryogenic temperatures. This paper provides a comprehensive review of recent research on both types of cryogenic control and readout ICs but primarily focuses on the more mature CMOS technology. The discussion encompasses principles underlying control and readout techniques employed in cryogenic CMOS ICs along with their architectural designs; characterization and modeling approaches for field effect transistors under cryogenic conditions; as well as fundamental concepts pertaining to rapid single flux quantum circuits.


[108] 2410.15921

Fully distributed and resilient source seeking for robot swarms

We propose a self-contained, resilient and fully distributed solution for locating the maximum of an unknown 3D scalar field using a swarm of robots that travel at constant speeds. Unlike conventional reactive methods relying on gradient information, our methodology enables the swarm to determine an ascending direction so that it approaches the source with arbitrary precision. Our source-seeking solution consists of three algorithms. The first two algorithms run sequentially and distributively at a high frequency providing barycentric coordinates and the ascending direction respectively to the individual robots. The third algorithm is the individual control law for a robot to track the estimated ascending direction. We show that the two algorithms with higher frequency have an exponential convergence to their eventual values since they are based on the standard consensus protocol for first-order dynamical systems; their high frequency depends on how fast the robots travel through the scalar field. The robots are not constrained to any particular geometric formation, and we study both discrete and continuous distributions of robots within swarm shapes. The shape analysis reveals the resiliency of our approach as expected in robot swarms, i.e., by amassing robots we ensure the source-seeking functionality in the event of missing or misplaced individuals or even if the robot network splits into two or more disconnected subnetworks. In addition, we also enhance the robustness of the algorithm by presenting conditions for \emph{optimal} swarm shapes, in the sense that the ascending directions can be closely parallel to the field's gradient. We exploit such an analysis so that the swarm can adapt to unknown environments by morphing its shape and maneuvering while still following an ascending direction.


[109] 2410.15929

Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection

In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.


[110] 2410.15943

Molecular Signal Reception in Complex Vessel Networks: The Role of the Network Topology

The notion of synthetic molecular communication (MC) refers to the transmission of information via molecules and is largely foreseen for use within the human body, where traditional electromagnetic wave (EM)-based communication is impractical. MC is anticipated to enable innovative medical applications, such as early-stage tumor detection, targeted drug delivery, and holistic approaches like the Internet of Bio-Nano Things (IoBNT). Many of these applications involve parts of the human cardiovascular system (CVS), here referred to as networks, posing challenges for MC due to their complex, highly branched vessel structures. To gain a better understanding of how the topology of such branched vessel networks affects the reception of a molecular signal at a target location, e.g., the network outlet, we present a generic analytical end-to-end model that characterizes molecule propagation and reception in linear branched vessel networks (LBVNs). We specialize this generic model to any MC system employing superparamagnetic iron-oxide nanoparticles (SPIONs) as signaling molecules and a planar coil as receiver (RX). By considering components that have been previously established in testbeds, we effectively isolate the impact of the network topology and validate our theoretical model with testbed data. Additionally, we propose two metrics, namely the molecule delay and the multi-path spread, that relate the LBVN topology to the molecule dispersion induced by the network, thereby linking the network structure to the signal-to-noise ratio (SNR) at the target location. This allows the characterization of the SNR at any point in the network solely based on the network topology. Consequently, our framework can, e.g., be exploited for optimal sensor placement in the CVS or identification of suitable testbed topologies for given SNR requirements.


[111] 2410.15946

Neural Predictor for Flight Control with Payload

Aerial robotics for transporting suspended payloads as the form of freely-floating manipulator are growing great interest in recent years. However, the prior information of the payload, such as the mass, is always hard to obtain accurately in practice. The force/torque caused by payload and residual dynamics will introduce unmodeled perturbations to the system, which negatively affects the closed-loop performance. Different from estimation-like methods, this paper proposes Neural Predictor, a learning-based approach to model force/torque caused by payload and residual dynamics as a dynamical system. It results a hybrid model including both the first-principles dynamics and the learned dynamics. This hybrid model is then integrated into a MPC framework to improve closed-loop performance. Effectiveness of proposed framework is verified extensively in both numerical simulations and real-world flight experiments. The results indicate that our approach can capture force/torque caused by payload and residual dynamics accurately, respond quickly to the changes of them and improve the closed-loop performance significantly. In particular, Neural Predictor outperforms a state-of-the-art learning-based estimator and has reduced the force and torque estimation errors by up to 66.15% and 33.33% while using less samples.


[112] 2410.16025

Continuum Robot Shape Estimation Using Magnetic Ball Chains

Shape sensing of medical continuum robots is important both for closed-loop control as well as for enabling the clinician to visualize the robot inside the body. There is a need for inexpensive, but accurate shape sensing technologies. This paper proposes the use of magnetic ball chains as a means of generating shape-specific magnetic fields that can be detected by an external array of Hall effect sensors. Such a ball chain, encased in a flexible polymer sleeve, could be inserted inside the lumen of any continuum robot to provide real-time shape feedback. The sleeve could be removed, as needed, during the procedure to enable use of the entire lumen. To investigate this approach, a shape-sensing model for a steerable catheter tip is derived and an observability and sensitivity analysis are presented. Experiments show maximum estimation errors of 7.1% and mean of 2.9% of the tip position with respect to total length.


[113] 2410.16089

Multi-Sensor Fusion for UAV Classification Based on Feature Maps of Image and Radar Data

The unique cost, flexibility, speed, and efficiency of modern UAVs make them an attractive choice in many applications in contemporary society. This, however, causes an ever-increasing number of reported malicious or accidental incidents, rendering the need for the development of UAV detection and classification mechanisms essential. We propose a methodology for developing a system that fuses already processed multi-sensor data into a new Deep Neural Network to increase its classification accuracy towards UAV detection. The DNN model fuses high-level features extracted from individual object detection and classification models associated with thermal, optronic, and radar data. Additionally, emphasis is given to the model's Convolutional Neural Network (CNN) based architecture that combines the features of the three sensor modalities by stacking the extracted image features of the thermal and optronic sensor achieving higher classification accuracy than each sensor alone.


[114] 2410.16093

Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security

Automating the theory-experiment cycle requires effective distributed workflows that utilize a computing continuum spanning lab instruments, edge sensors, computing resources at multiple facilities, data sets distributed across multiple information sources, and potentially cloud. Unfortunately, the obvious methods for constructing continuum platforms, orchestrating workflow tasks, and curating datasets over time fail to achieve scientific requirements for performance, energy, security, and reliability. Furthermore, achieving the best use of continuum resources depends upon the efficient composition and execution of workflow tasks, i.e., combinations of numerical solvers, data analytics, and machine learning. Pacific Northwest National Laboratory's LDRD "Cloud, High-Performance Computing (HPC), and Edge for Science and Security" (CHESS) has developed a set of interrelated capabilities for enabling distributed scientific workflows and curating datasets. This report describes the results and successes of CHESS from the perspective of open science.


[115] 2410.16104

A Deep Unfolding-Based Scalarization Approach for Power Control in D2D Networks

Optimizing network utility in device-to-device networks is typically formulated as a non-convex optimization problem. This paper addresses the scenario where the optimization variables are from a bounded but continuous set, allowing each device to perform power control. The power at each link is optimized to maximize a desired network utility. Specifically, we consider the weighted-sum-rate. The state of the art benchmark for this problem is fractional programming with quadratic transform, known as FPLinQ. We propose a scalarization approach to transform the weighted-sum-rate, developing an iterative algorithm that depends on step sizes, a reference, and a direction vector. By employing the deep unfolding approach, we optimize these parameters by presenting the iterative algorithm as a finite sequence of steps, enabling it to be trained as a deep neural network. Numerical experiments demonstrate that the unfolded algorithm performs comparably to the benchmark in most cases while exhibiting lower complexity. Furthermore, the unfolded algorithm shows strong generalizability in terms of varying the number of users, the signal-to-noise ratio and arbitrary weights. The weighted-sum-rate maximizer can be integrated into a low-complexity fairness scheduler, updating priority weights via virtual queues and Lyapunov Drift Plus Penalty. This is demonstrated through experiments using proportional and max-min fairness.


[116] 2410.16140

Cooperative Multistatic Target Detection in Cell-Free Communication Networks

In this work, we consider the target detection problem in a multistatic integrated sensing and communication (ISAC) scenario characterized by the cell-free MIMO communication network deployment, where multiple radio units (RUs) in the network cooperate with each other for the sensing task. By exploiting the angle resolution from multiple arrays deployed in the network and the delay resolution from the communication signals, i.e., orthogonal frequency division multiplexing (OFDM) signals, we formulate a cooperative sensing problem with coherent data fusion of multiple RUs' observations and propose a sparse Bayesian learning (SBL)-based method, where the global coordinates of target locations are directly detected. Intensive numerical results indicate promising target detection performance of the proposed SBL-based method. Additionally, a theoretical analysis of the considered cooperative multistatic sensing task is provided using the pairwise error probability (PEP) analysis, which can be used to provide design insights, e.g., illumination and beam patterns, for the considered problem.


[117] 2410.16175

Spiking Neural Networks as a Controller for Emergent Swarm Agents

Drones which can swarm and loiter in a certain area cost hundreds of dollars, but mosquitos can do the same and are essentially worthless. To control swarms of low-cost robots, researchers may end up spending countless hours brainstorming robot configurations and policies to ``organically" create behaviors which do not need expensive sensors and perception. Existing research explores the possible emergent behaviors in swarms of robots with only a binary sensor and a simple but hand-picked controller structure. Even agents in this highly limited sensing, actuation, and computational capability class can exhibit relatively complex global behaviors such as aggregation, milling, and dispersal, but finding the local interaction rules that enable more collective behaviors remains a significant challenge. This paper investigates the feasibility of training spiking neural networks to find those local interaction rules that result in particular emergent behaviors. In this paper, we focus on simulating a specific milling behavior already known to be producible using very simple binary sensing and acting agents. To do this, we use evolutionary algorithms to evolve not only the parameters (the weights, biases, and delays) of a spiking neural network, but also its structure. To create a baseline, we also show an evolutionary search strategy over the parameters for the incumbent hand-picked binary controller structure. Our simulations show that spiking neural networks can be evolved in binary sensing agents to form a mill.


[118] 2410.16227

Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocating bandwidth to meet strict latency SLOs, while maximizing benefit to the car.