New articles on Electrical Engineering and Systems Science


[1] 2406.12931

Automatic Speech Recognition for Biomedical Data in Bengali Language

This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR are encouraging, but a lack of domain-specific data limits the creation of practical healthcare ASR models. This project bridges this gap by developing an ASR system tailored for Bengali medical terms like symptoms, severity levels, and diseases, encompassing two major dialects: Bengali and Sylheti. We train and evaluate two popular ASR frameworks on a comprehensive 46-hour Bengali medical corpus. Our core objective is to create deployable health-domain ASR systems for digital health applications, ultimately increasing accessibility for non-technical users in the healthcare sector.


[2] 2406.12937

Self-Train Before You Transcribe

When there is a mismatch between the training and test domains, current speech recognition systems show significant performance degradation. Self-training methods, such as noisy student teacher training, can help address this and enable the adaptation of models under such domain shifts. However, self-training typically requires a collection of unlabelled target domain data. For settings where this is not practical, we investigate the benefit of performing noisy student teacher training on recordings in the test set as a test-time adaptation approach. Similarly to the dynamic evaluation approach in language modelling, this enables the transfer of information across utterance boundaries and functions as a method of domain adaptation. A range of in-domain and out-of-domain datasets are used for experiments demonstrating large relative gains of up to 32.2%. Interestingly, our method showed larger gains than the typical self-training setup that utilises separate adaptation data.


[3] 2406.12943

A square cross-section FOV rotational CL (SC-CL) and its analytical reconstruction method

Rotational computed laminography (CL) has broad application potential in three-dimensional imaging of plate-like objects, as it only needs x-ray to pass through the tested object in the thickness direction during the imaging process. In this study, a square cross-section FOV rotational CL (SC-CL) was proposed. Then, the FDK-type analytical reconstruction algorithm applicable to the SC-CL was derived. On this basis, the proposed method was validated through numerical experiments.


[4] 2406.12946

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions, enabling the expansion of these models to more languages.


[5] 2406.12998

Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- articulatory encodec. The articulatory encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.


[6] 2406.13059

Learned Compression of Encoding Distributions

The entropy bottleneck introduced by Ball\'e et al. is a common component used in many learned compression models. It encodes a transformed latent representation using a static distribution whose parameters are learned during training. However, the actual distribution of the latent data may vary wildly across different inputs. The static distribution attempts to encompass all possible input distributions, thus fitting none of them particularly well. This unfortunate phenomenon, sometimes known as the amortization gap, results in suboptimal compression. To address this issue, we propose a method that dynamically adapts the encoding distribution to match the latent data distribution for a specific input. First, our model estimates a better encoding distribution for a given input. This distribution is then compressed and transmitted as an additional side-information bitstream. Finally, the decoder reconstructs the encoding distribution and uses it to decompress the corresponding latent data. Our method achieves a Bj{\o}ntegaard-Delta (BD)-rate gain of -7.10% on the Kodak test dataset when applied to the standard fully-factorized architecture. Furthermore, considering computational complexity, the transform used by our method is an order of magnitude cheaper in terms of Multiply-Accumulate (MAC) operations compared to related side-information methods such as the scale hyperprior.


[7] 2406.13139

Audio Fingerprinting with Holographic Reduced Representations

This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convolution and summation, resulting in fewer fingerprints with the same dimensional space as the original. Our search method efficiently finds a combined fingerprint in which a query fingerprint exists. Using HRR's inverse operation, it can recover the relative position within a combined fingerprint, retaining the original time resolution. Experiments show that our method can reduce the number of fingerprints with modest accuracy degradation while maintaining the time resolution, outperforming simple decimation and summation-based aggregation methods.


[8] 2406.13145

Constructing and Evaluating Digital Twins: An Intelligent Framework for DT Development

The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, specifically designed to enhance the accuracy and utility of DTs in testing algorithmic performance. We propose a novel construction methodology that integrates deep learning-based policy gradient techniques to dynamically tune the DT parameters, ensuring high fidelity in the digital replication of physical systems. Moreover, the Mean STate Error (MSTE) is proposed as a robust metric for evaluating the performance of algorithms within these digital space. The efficacy of our framework is demonstrated through extensive simulations that show our DT not only accurately mirrors the physical reality but also provides a reliable platform for algorithm evaluation. This work lays a foundation for future research into DT technologies, highlighting pathways for both theoretical enhancements and practical implementations in various industries.


[9] 2406.13150

MCAD: Multi-modal Conditioned Adversarial Diffusion Model for High-Quality PET Image Reconstruction

Radiation hazards associated with standard-dose positron emission tomography (SPET) images remain a concern, whereas the quality of low-dose PET (LPET) images fails to meet clinical requirements. Therefore, there is great interest in reconstructing SPET images from LPET images. However, prior studies focus solely on image data, neglecting vital complementary information from other modalities, e.g., patients' clinical tabular, resulting in compromised reconstruction with limited diagnostic utility. Moreover, they often overlook the semantic consistency between real SPET and reconstructed images, leading to distorted semantic contexts. To tackle these problems, we propose a novel Multi-modal Conditioned Adversarial Diffusion model (MCAD) to reconstruct SPET images from multi-modal inputs, including LPET images and clinical tabular. Specifically, our MCAD incorporates a Multi-modal conditional Encoder (Mc-Encoder) to extract multi-modal features, followed by a conditional diffusion process to blend noise with multi-modal features and gradually map blended features to the target SPET images. To balance multi-modal inputs, the Mc-Encoder embeds Optimal Multi-modal Transport co-Attention (OMTA) to narrow the heterogeneity gap between image and tabular while capturing their interactions, providing sufficient guidance for reconstruction. In addition, to mitigate semantic distortions, we introduce the Multi-Modal Masked Text Reconstruction (M3TRec), which leverages semantic knowledge extracted from denoised PET images to restore the masked clinical tabular, thereby compelling the network to maintain accurate semantics during reconstruction. To expedite the diffusion process, we further introduce an adversarial diffusive network with a reduced number of diffusion steps. Experiments show that our method achieves the state-of-the-art performance both qualitatively and quantitatively.


[10] 2406.13165

Cardiac Copilot: Automatic Probe Guidance for Echocardiography with World Model

Echocardiography is the only technique capable of real-time imaging of the heart and is vital for diagnosing the majority of cardiac diseases. However, there is a severe shortage of experienced cardiac sonographers, due to the heart's complex structure and significant operational challenges. To mitigate this situation, we present a Cardiac Copilot system capable of providing real-time probe movement guidance to assist less experienced sonographers in conducting freehand echocardiography. This system can enable non-experts, especially in primary departments and medically underserved areas, to perform cardiac ultrasound examinations, potentially improving global healthcare delivery. The core innovation lies in proposing a data-driven world model, named Cardiac Dreamer, for representing cardiac spatial structures. This world model can provide structure features of any cardiac planes around the current probe position in the latent space, serving as an precise navigation map for autonomous plane localization. We train our model with real-world ultrasound data and corresponding probe motion from 110 routine clinical scans with 151K sample pairs by three certified sonographers. Evaluations on three standard planes with 37K sample pairs demonstrate that the world model can reduce navigation errors by up to 33\% and exhibit more stable performance.


[11] 2406.13191

GPU-Accelerated DCOPF using Gradient-Based Optimization

DC Optimal Power Flow (DCOPF) is a key operational tool for power system operators, and it is embedded as a subproblem in many challenging optimization problems (e.g., line switching). However, traditional CPU-based solve routines (e.g., simplex) have saturated in speed and are hard to parallelize. This paper focuses on solving DCOPF problems using gradient-based routines on Graphics Processing Units (GPUs), which have massive parallelization capability. To formulate these problems, we pose a Lagrange dual associated with DCOPF (linear and quadratic cost curves), and then we explicitly solve the inner (primal) minimization problem with a dual norm. The resulting dual problem can be efficiently iterated using projected gradient ascent. After solving the dual problem on both CPUs and GPUs to find tight lower bounds, we benchmark against Gurobi and MOSEK, comparing convergence speed and tightness on the IEEE 2000 and 10000 bus systems. We provide reliable and tight lower bounds for these problems with, at best, 5.4x speedup over a conventional solver.


[12] 2406.13194

A Hybrid Intelligent System for Protection of Transmission Lines Connected to PV Farms based on Linear Trends

Conventional relays face challenges for transmission lines connected to inverter-based resources (IBRs). In this article, a single-ended intelligent protection of the transmission line in the zone between the grid and the PV farm is suggested. The method employs a fuzzy logic and random forest (RF)-based hybrid system to detect faults based on combined linear trend attributes of the 3-phase currents. The fault location is determined and the faulty phase is detected. RF feature selection is used to obtain the optimal linear trend feature. The performance of the methodology is examined for abnormal events such as faults, capacitor and load-switching operations simulated in PSCAD/EMTDC on IEEE 9-bus system obtained by varying various fault and switching parameters. Additionally, when validating the suggested strategy, consideration is given to the effects of conditions such as the presence of double circuit lines, PV capacity, sampling rate, data window length, noise, high impedance faults, CT saturation, compensation devices, evolving and cross-country faults, and far-end and near-end faults. The findings indicate that the suggested strategy can be used to deal with a variety of system configurations and situations while still safeguarding such complex power transmission networks.


[13] 2406.13205

Application of Computer Deep Learning Model in Diagnosis of Pulmonary Nodules

The 3D simulation model of the lung was established by using the reconstruction method. A computer aided pulmonary nodule detection model was constructed. The process iterates over the images to refine the lung nodule recognition model based on neural networks. It is integrated with 3D virtual modeling technology to improve the interactivity of the system, so as to achieve intelligent recognition of lung nodules. A 3D RCNN (Region-based Convolutional Neural Network) was utilized for feature extraction and nodule identification. The LUNA16 large sample database was used as the research dataset. FROC (Free-response Receiver Operating Characteristic) analysis was applied to evaluate the model, calculating sensitivity at various false positive rates to derive the average FROC. Compared with conventional diagnostic methods, the recognition rate was significantly improved. This technique facilitates the detection of pulmonary abnormalities at an initial phase, which holds immense value for the prompt diagnosis of lung malignancies.


[14] 2406.13209

Diffusion Model-based FOD Restoration from High Distortion in dMRI

Fiber orientation distributions (FODs) is a popular model to represent the diffusion MRI (dMRI) data. However, imaging artifacts such as susceptibility-induced distortion in dMRI can cause signal loss and lead to the corrupted reconstruction of FODs, which prohibits successful fiber tracking and connectivity analysis in affected brain regions such as the brain stem. Generative models, such as the diffusion models, have been successfully applied in various image restoration tasks. However, their application on FOD images poses unique challenges since FODs are 4-dimensional data represented by spherical harmonics (SPHARM) with the 4-th dimension exhibiting order-related dependency. In this paper, we propose a novel diffusion model for FOD restoration that can recover the signal loss caused by distortion artifacts. We use volume-order encoding to enhance the ability of the diffusion model to generate individual FOD volumes at all SPHARM orders. Moreover, we add cross-attention features extracted across all SPHARM orders in generating every individual FOD volume to capture the order-related dependency across FOD volumes. We also condition the diffusion model with low-distortion FODs surrounding high-distortion areas to maintain the geometric coherence of the generated FODs. We trained and tested our model using data from the UK Biobank (n = 1315). On a test set with ground truth (n = 43), we demonstrate the high accuracy of the generated FODs in terms of root mean square errors of FOD volumes and angular errors of FOD peaks. We also apply our method to a test set with large distortion in the brain stem area (n = 1172) and demonstrate the efficacy of our method in restoring the FOD integrity and, hence, greatly improving tractography performance in affected brain regions.


[15] 2406.13266

Advancements in Orthopaedic Arm Segmentation: A Comprehensive Review

The most recent advances in medical imaging that have transformed diagnosis, especially in the case of interpreting X-ray images, are actively involved in the healthcare sector. The advent of digital image processing technology and the implementation of deep learning models such as Convolutional Neural Networks (CNNs) have made the analysis of X-rays much more accurate and efficient. In this article, some essential techniques such as edge detection, region-growing technique, and thresholding approach, and the deep learning models such as variants of YOLOv8-which is the best object detection and segmentation framework-are reviewed. We further investigate that the traditional image processing techniques like segmentation are very much simple and provides the alternative to the advanced methods as well. Our review gives useful knowledge on the practical usage of the innovative and traditional approaches of manual X-ray interpretation. The discovered information will help professionals and researchers to gain more profound knowledge in digital interpretation techniques in medical imaging.


[16] 2406.13268

CEC: A Noisy Label Detection Method for Speaker Recognition

Noisy labels are inevitable, even in well-annotated datasets. The detection of noisy labels is of significant importance to enhance the robustness of speaker recognition models. In this paper, we propose a novel noisy label detection approach based on two new statistical metrics: Continuous Inconsistent Counting (CIC) and Total Inconsistent Counting (TIC). These metrics are calculated through Cross-Epoch Counting (CEC) and correspond to the early and late stages of training, respectively. Additionally, we categorize samples based on their prediction results into three categories: inconsistent samples, hard samples, and easy samples. During training, we gradually increase the difficulty of hard samples to update model parameters, preventing noisy labels from being overfitted. Compared to contrastive schemes, our approach not only achieves the best performance in speaker verification but also excels in noisy label detection.


[17] 2406.13312

Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic Convolution

Frequency dynamic convolution (FDY conv) has been a milestone in the sound event detection (SED) field, but it involves a substantial increase in model size due to multiple basis kernels. In this work, we propose partial frequency dynamic convolution (PFD conv), which concatenates static conventional 2D convolution branch output and dynamic FDY conv branch output in order to minimize model size increase while maintaining the performance. Additionally, we propose multi-dilated frequency dynamic convolution (MDFD conv), which integrates multiple dilated frequency dynamic convolution (DFD conv) branches with different dilation size sets and a static branch within a single convolution module, achieving a 3.2% improvement in polyphonic sound detection score (PSDS) over FDY conv. Proposed methods with extensive ablation studies further enhance understanding and usability of FDY conv variants.


[18] 2406.13337

Medical Spoken Named Entity Recognition

Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed


[19] 2406.13374

State Anti-windup: A New Methodology for Tackling State Constraints at Both Synthesis and Implementation Levels

The anti-windup compensation typically addresses strict control limitations in control systems. However, there is a clear need for an equivalent solution for the states/outputs of the system. This paper introduces a novel methodology for the state anti-windup compensator. Unlike state-constrained control methods, which often focus on incorporating soft constraints into the design or fail to react adequately to constraint violations in practical settings, the proposed methodology treats state constraints as implement-oriented soft-hard constraints. This is achieved by integrating a saturation block within the structure of the safety compensator, referred to as the state anti-windup (SANTW) compensator. Similar to input anti-windup schemes, the SANTW design is separated from the nominal controller design. The problem is formulated as a disturbance rejection one to directly minimize the saturation. The paper develops two Hinf optimization frameworks using frequency-domain solutions and linear matrix inequalities. It then addresses constraints on both inputs and states, resulting in a unified Input-State Anti-windup (IS-ANTW) compensator synthesized using non-smooth Hinf optimization. This method also offers the flexibility of having a fixed-order compensator, crucial in many practical applications. Additionally, the study evaluates the proposed compensator's performance in managing current fluctuations from renewable energy sources during grid faults, demonstrating its effectiveness through detailed Electromagnetic Transient (EMT) simulations of grid-connected DC-AC converters.


[20] 2406.13385

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.


[21] 2406.13386

Online Domain-Incremental Learning Approach to Classify Acoustic Scenes in All Locations

In this paper, we propose a method for online domain-incremental learning of acoustic scene classification from a sequence of different locations. Simply training a deep learning model on a sequence of different locations leads to forgetting of previously learned knowledge. In this work, we only correct the statistics of the Batch Normalization layers of a model using a few samples to learn the acoustic scenes from a new location without any excessive training. Experiments are performed on acoustic scenes from 11 different locations, with an initial task containing acoustic scenes from 6 locations and the remaining 5 incremental tasks each representing the acoustic scenes from a different location. The proposed approach outperforms fine-tuning based methods and achieves an average accuracy of 48.8% after learning the last task in sequence without forgetting acoustic scenes from the previously learned locations.


[22] 2406.13396

Safe and Non-Conservative Trajectory Planning for Autonomous Driving Handling Unanticipated Behaviors of Traffic Participants

Trajectory planning for autonomous driving is challenging because the unknown future motion of traffic participants must be accounted for, yielding large uncertainty. Stochastic Model Predictive Control (SMPC)-based planners provide non-conservative planning, but do not rule out a (small) probability of collision. We propose a control scheme that yields an efficient trajectory based on SMPC when the traffic scenario allows, still avoiding that the vehicle causes collisions with traffic participants if the latter move according to the prediction assumptions. If some traffic participant does not behave as anticipated, no safety guarantee can be given. Then, our approach yields a trajectory which minimizes the probability of collision, using Constraint Violation Probability Minimization techniques. Our algorithm can also be adapted to minimize the anticipated harm caused by a collision. We provide a thorough discussion of the benefits of our novel control scheme and compare it to a previous approach through numerical simulations from the CommonRoad database.


[23] 2406.13413

Recurrent Inference Machine for Medical Image Registration

Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input. We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5\%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration.


[24] 2406.13420

The effect of control barrier functions on energy transfers in controlled physical systems

Using a port-Hamiltonian formalism, we show the qualitative and quantitative effect of safety-critical control implemented with control barrier functions (CBFs) on the power balance of controlled physical systems. The presented results will provide novel tools to design CBFs inducing desired energetic behaviors of the closed-loop system, including nontrivial damping injection effects and non-passive control actions, effectively injecting energy in the system in a controlled manner. Simulations validate the stated results.


[25] 2406.13441

Robust Melanoma Thickness Prediction via Deep Transfer Learning enhanced by XAI Techniques

This study focuses on analyzing dermoscopy images to determine the depth of melanomas, which is a critical factor in diagnosing and treating skin cancer. The Breslow depth, measured from the top of the granular layer to the deepest point of tumor invasion, serves as a crucial parameter for staging melanoma and guiding treatment decisions. This research aims to improve the prediction of the depth of melanoma through the use of machine learning models, specifically deep learning, while also providing an analysis of the possible existance of graduation in the images characteristics which correlates with the depth of the melanomas. Various datasets, including ISIC and private collections, were used, comprising a total of 1162 images. The datasets were combined and balanced to ensure robust model training. The study utilized pre-trained Convolutional Neural Networks (CNNs). Results indicated that the models achieved significant improvements over previous methods. Additionally, the study conducted a correlation analysis between model's predictions and actual melanoma thickness, revealing a moderate correlation that improves with higher thickness values. Explainability methods such as feature visualization through Principal Component Analysis (PCA) demonstrated the capability of deep features to distinguish between different depths of melanoma, providing insight into the data distribution and model behavior. In summary, this research presents a dual contribution: enhancing the state-of-the-art classification results through advanced training techniques and offering a detailed analysis of the data and model behavior to better understand the relationship between dermoscopy images and melanoma thickness.


[26] 2406.13462

Design of Phase Locked Loop in 180 nm Technology

The presented paper introduces a design for a phase-locked loop (PLL) that is utilized in frequency synthesis and modulation-demodulation within communication systems and in VLSI applications. The CMOS PLL is designed using 180 nm Fabrication Technology on Cadence Virtuoso Tool with a supply voltage of 1.8 V. The performance is evaluated through simulations and measurements, which demonstrate its ability to track and lock onto the input frequency. The PLL is a frequency synthesizer implemented to generate 2.4 GHz frequency. The input reference clock from a crystal oscillator is 150 MHz square wave. Negative feedback is given by divide-by-16 frequency divider, ensuring the phase and frequency synchronization between the divided signal and the reference signal. The design has essential components such as a phase frequency detector, charge pump, loop filter, current-starved voltage-controlled oscillator (CSVCO), and frequency divider. Through their collaborative operation, the system generates an output frequency that is 16 times the input frequency. The centre frequency of the 3-stage CSVCO is 3.208 GHz at 900 mV input voltage. With an input voltage ranging from 0.4 V to 1.8 V, the VCO offers a tuning range that spans from 1.066 GHz to 3.731 GHz.PLL demonstrates a lock-in range spanning from 70.4 MHz to 173 MHz, with an output frequency range of 1.12 GHz to 2.78 GHz. It achieves a lock time of 260.03 ns and consumes a maximum power of 5.15 mW at 2.4 GHz.


[27] 2406.13464

An Efficient yet High-Performance Method for Precise Radar-Based Imaging of Human Hand Poses

Contactless hand pose estimation requires sensors that provide precise spatial information and low computational complexity for real-time processing. Unlike vision-based systems, radar offers lighting independence and direct motion assessments. Yet, there is limited research balancing real-time constraints, suitable frame rates for motion evaluations, and the need for precise 3D data. To address this, we extend the ultra-efficient two-tone hand imaging method from our prior work to a three-tone approach. Maintaining high frame rates and real-time constraints, this approach significantly enhances reconstruction accuracy and precision. We assess these measures by evaluating reconstruction results for different hand poses obtained by an imaging radar. Accuracy is assessed against ground truth from a spatially calibrated photogrammetry setup, while precision is measured using 3D-printed hand poses. The results emphasize the method's great potential for future radar-based hand sensing.


[28] 2406.13470

Automatic Voice Classification Of Autistic Subjects

Autism Spectrum Disorders (ASD) describe a heterogeneous set of conditions classified as neurodevelopmental disorders. Although the mechanisms underlying ASD are not yet fully understood, more recent literature focused on multiple genetics and/or environmental risk factors. Heterogeneity of symptoms, especially in milder forms of this condition, could be a challenge for the clinician. In this work, an automatic speech classification algorithm is proposed to characterize the prosodic elements that best distinguish autism, to support the traditional diagnosis. The performance of the proposed algorithm is evaluted by testing the classification algorithms on a dataset composed of recorded speeches, collected among both autustic and non autistic subjects.


[29] 2406.13471

Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement

Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminative scores from discriminative models in the first steps of the RDP. These discriminative scores require only one forward pass with the discriminative model for multiple RDP steps, thus greatly reducing computations. This approach also allows for performance improvements. We show that we can trade off between generative and discriminative capabilities as the number of steps with the discriminative score increases. Furthermore, we propose a novel streamable time-domain generative model with an algorithmic latency of 50 ms, which has no significant performance degradation compared to offline models.


[30] 2406.13522

Measured-state conditioned recursive feasibility for stochastic model predictive control

In this paper, we address the problem of designing stochastic model predictive control (MPC) schemes for linear systems affected by unbounded disturbances. The contribution of the paper is twofold. First, motivated by the difficulty of guaranteeing recursive feasibility in this framework, due to the nonzero probability of violating chance-constraints in the case of unbounded noise, we introduce the novel definition of measured-state conditioned recursive feasibility in expectation. Second, we construct a stochastic MPC scheme, based on the introduction of ellipsoidal probabilistic reachable sets, which implements a closed-loop initialization strategy, i.e., the current measured-state is employed for initializing the optimization problem. This new scheme is proven to satisfy the novel definition of recursive feasibility, and its superiority with respect to open-loop initialization schemes, arising from the fact that one never neglects the information brought by the current measurement, is shown through numerical examples.


[31] 2406.13526

Using Geometrical information to Measure the Vibration of A Swaying Millimeter-wave Radar

This paper presents two new, simple yet effective approaches to measure the vibration of a swaying millimeter-wave radar (mmRadar) utilizing geometrical information. Specifically, for the planar vibrations, we firstly establish an equation based on the area difference between the swaying mmRadar and the reference objects at different moments, which enables the quantification of planar displacement. Secondly, volume differences are also utilized with the same idea, achieving the self-vibration measurement of a swaying mmRadar for spatial vibrations. Experimental results confirm the effectiveness of our methods, demonstrating its capability to estimate both the amplitude and a crude direction of the mmRadar's self-vibration.


[32] 2406.13645

Advancing UWF-SLO Vessel Segmentation with Source-Free Active Domain Adaptation and a Novel Multi-Center Dataset

Accurate vessel segmentation in Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) images is crucial for diagnosing retinal diseases. Although recent techniques have shown encouraging outcomes in vessel segmentation, models trained on one medical dataset often underperform on others due to domain shifts. Meanwhile, manually labeling high-resolution UWF-SLO images is an extremely challenging, time-consuming and expensive task. In response, this study introduces a pioneering framework that leverages a patch-based active domain adaptation approach. By actively recommending a few valuable image patches by the devised Cascade Uncertainty-Predominance (CUP) selection strategy for labeling and model-finetuning, our method significantly improves the accuracy of UWF-SLO vessel segmentation across diverse medical centers. In addition, we annotate and construct the first Multi-center UWF-SLO Vessel Segmentation (MU-VS) dataset to promote this topic research, comprising data from multiple institutions. This dataset serves as a valuable resource for cross-center evaluation, verifying the effectiveness and robustness of our approach. Experimental results demonstrate that our approach surpasses existing domain adaptation and active learning methods, considerably reducing the gap between the Upper and Lower bounds with minimal annotations, highlighting our method's practical clinical value. We will release our dataset and code to facilitate relevant research: https://github.com/whq-xxh/SFADA-UWF-SLO.


[33] 2406.13650

Advanced Maximum Adhesion Tracking Strategies in Railway Traction Drives

Modern railway traction systems are often equipped with anti-slip control strategies to comply with performance and safety requirements. A certain amount of slip is needed to increase the torque transferred by the traction motors onto the rail. Commonly, constant slip control is used to limit the slip velocity between the wheel and rail avoiding excessive slippage and vehicle derailment. This is at the price of not fully utilizing the train's traction and braking capabilities. Finding the slip at which maximum traction force occurs is challenging due to the non-linear relationship between slip and wheel-rail adhesion coefficient, as well as to its dependence on rail and wheel conditions. Perturb and observe (P\&O) and steepest gradient (SG) methods have been reported for the Maximum Adhesion Tracking (MAT) search. However, both methods exhibit weaknesses. Two new MAT strategies are proposed in this paper which overcome the limitations of existing methods, using Fuzzy Logic Controller (FLC) and Particle Swarm Optimization (PSO) respectively. Existing and proposed methods are first simulated and further validated experimentally using a scaled roller rig under identical conditions. The results show that the proposed methods improve the traction capability with lower searching time and oscillations compared to existing solutions. Tuning complexity and computational requirements will also be shown to be favorable to the proposed methods.


[34] 2406.13651

CLAMP: Majorized Plug-and-Play for Coherent 3D LIDAR Imaging

Coherent LIDAR uses a chirped laser pulse for 3D imaging of distant targets. However, existing coherent LIDAR image reconstruction methods do not account for the system's aperture, resulting in sub-optimal resolution. Moreover, these methods use majorization-minimization for computational efficiency, but do so without a theoretical treatment of convergence. In this paper, we present Coherent LIDAR Aperture Modeled Plug-and-Play (CLAMP) for multi-look coherent LIDAR image reconstruction. CLAMP uses multi-agent consensus equilibrium (a form of PnP) to combine a neural network denoiser with an accurate physics-based forward model. CLAMP introduces an FFT-based method to account for the effects of the aperture and uses majorization of the forward model for computational efficiency. We also formalize the use of majorization-minimization in consensus optimization problems and prove convergence to the exact consensus equilibrium solution. Finally, we apply CLAMP to synthetic and measured data to demonstrate its effectiveness in producing high-resolution, speckle-free, 3D imagery.


[35] 2406.13674

Rethinking Abdominal Organ Segmentation (RAOS) in the clinical scenario: A robustness evaluation benchmark with challenging cases

Deep learning has enabled great strides in abdominal multi-organ segmentation, even surpassing junior oncologists on common cases or organs. However, robustness on corner cases and complex organs remains a challenging open problem for clinical adoption. To investigate model robustness, we collected and annotated the RAOS dataset comprising 413 CT scans ($\sim$80k 2D images, $\sim$8k 3D organ annotations) from 413 patients each with 17 (female) or 19 (male) labelled organs, manually delineated by oncologists. We grouped scans based on clinical information into 1) diagnosis/radiotherapy (317 volumes), 2) partial excision without the whole organ missing (22 volumes), and 3) excision with the whole organ missing (74 volumes). RAOS provides a potential benchmark for evaluating model robustness including organ hallucination. It also includes some organs that can be very hard to access on public datasets like the rectum, colon, intestine, prostate and seminal vesicles. We benchmarked several state-of-the-art methods in these three clinical groups to evaluate performance and robustness. We also assessed cross-generalization between RAOS and three public datasets. This dataset and comprehensive analysis establish a potential baseline for future robustness research: \url{https://github.com/Luoxd1996/RAOS}.


[36] 2406.13705

EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy

Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DFT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DFT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance.


[37] 2406.13707

Safety-Critical Formation Control of Non-Holonomic Multi-Robot Systems in Communication-Limited Environments

This paper presents a robust estimator-based safety-critical controller for formation control of non-holonomic mobile robots in communication-limited environments. The proposed decentralized framework integrates a robust state estimator with a formation tracking control law that guarantees inter-agent collision avoidance using control barrier functions. String stability is incorporated into the control design to maintain stability against noise from predecessors in leader-follower formations. Rigorous stability analysis using Lyapunov functions ensures the stability of estimation errors and the convergence of the formation to desired configurations. The effectiveness and robustness of the proposed approach are validated through numerical simulations of various maneuvers and realistic Gazebo experiments involving formations in a warehouse environment. The results demonstrate the controller's ability to maintain safety, achieve precise formation control, and mitigate disturbances in scenarios without inter-robot communication.


[38] 2406.13708

Low-rank based motion correction followed by automatic frame selection in DT-CMR

Motivation: Post-processing of in-vivo diffusion tensor CMR (DT-CMR) is challenging due to the low SNR and variation in contrast between frames which makes image registration difficult, and the need to manually reject frames corrupted by motion. Goals: To develop a semi-automatic post-processing pipeline for robust DT-CMR registration and automatic frame selection. Approach: We used low intrinsic rank averaged frames as the reference to register other low-ranked frames. A myocardium-guided frame selection rejected the frames with signal loss, through-plane motion and poor registration. Results: The proposed method outperformed our previous noise-robust rigid registration on helix angle data quality and reduced negative eigenvalues in healthy volunteers.


[39] 2406.13709

A Study on the Effect of Color Spaces in Learned Image Compression

In this work, we present a comparison between color spaces namely YUV, LAB, RGB and their effect on learned image compression. For this we use the structure and color based learned image codec (SLIC) from our prior work, which consists of two branches - one for the luminance component (Y or L) and another for chrominance components (UV or AB). However, for the RGB variant we input all 3 channels in a single branch, similar to most learned image codecs operating in RGB. The models are trained for multiple bitrate configurations in each color space. We report the findings from our experiments by evaluating them on various datasets and compare the results to state-of-the-art image codecs. The YUV model performs better than the LAB variant in terms of MS-SSIM with a Bj{\o}ntegaard delta bitrate (BD-BR) gain of 7.5\% using VTM intra-coding mode as the baseline. Whereas the LAB variant has a better performance than YUV model in terms of CIEDE2000 having a BD-BR gain of 8\%. Overall, the RGB variant of SLIC achieves the best performance with a BD-BR gain of 13.14\% in terms of MS-SSIM and a gain of 17.96\% in CIEDE2000 at the cost of a higher model complexity.


[40] 2406.13750

Empowering Tuberculosis Screening with Explainable Self-Supervised Deep Neural Networks

Tuberculosis persists as a global health crisis, especially in resource-limited populations and remote regions, with more than 10 million individuals newly infected annually. It stands as a stark symbol of inequity in public health. Tuberculosis impacts roughly a quarter of the global populace, with the majority of cases concentrated in eight countries, accounting for two-thirds of all tuberculosis infections. Although a severe ailment, tuberculosis is both curable and manageable. However, early detection and screening of at-risk populations are imperative. Chest x-ray stands as the predominant imaging technique utilized in tuberculosis screening efforts. However, x-ray screening necessitates skilled radiologists, a resource often scarce, particularly in remote regions with limited resources. Consequently, there is a pressing need for artificial intelligence (AI)-powered systems to support clinicians and healthcare providers in swift screening. However, training a reliable AI model necessitates large-scale high-quality data, which can be difficult and costly to acquire. Inspired by these challenges, in this work, we introduce an explainable self-supervised self-train learning network tailored for tuberculosis case screening. The network achieves an outstanding overall accuracy of 98.14% and demonstrates high recall and precision rates of 95.72% and 99.44%, respectively, in identifying tuberculosis cases, effectively capturing clinically significant features.


[41] 2406.13752

COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing

To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer's spatial (dataflow parallelization) and temporal unrolling (execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This paper, therefore, presents COAC, a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings, but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% EDP (energy-delay-product) savings for a set of six neural networks at the expense of a relative area increase of 9.5%.


[42] 2406.13788

Groupwise Deformable Registration of Diffusion Tensor Cardiovascular Magnetic Resonance: Disentangling Diffusion Contrast, Respiratory and Cardiac Motions

Diffusion tensor based cardiovascular magnetic resonance (DT-CMR) offers a non-invasive method to visualize the myocardial microstructure. With the assumption that the heart is stationary, frames are acquired with multiple repetitions for different diffusion encoding directions. However, motion from poor breath-holding and imprecise cardiac triggering complicates DT-CMR analysis, further challenged by its inherently low SNR, varied contrasts, and diffusion-induced textures. Our solution is a novel framework employing groupwise registration with an implicit template to isolate respiratory and cardiac motions, while a tensor-embedded branch preserves diffusion contrast textures. We've devised a loss refinement tailored for non-linear least squares fitting and low SNR conditions. Additionally, we introduce new physics-based and clinical metrics for performance evaluation. Access code and supplementary materials at: https://github.com/Mobbyjj/DTCMRRegistration


[43] 2406.13794

Adaptive Curves for Optimally Efficient Market Making

Automated Market Makers (AMMs) are essential in Decentralized Finance (DeFi) as they match liquidity supply with demand. They function through liquidity providers (LPs) who deposit assets into liquidity pools. However, the asset trading prices in these pools often trail behind those in more dynamic, centralized exchanges, leading to potential arbitrage losses for LPs. This issue is tackled by adapting market maker bonding curves to trader behavior, based on the classical market microstructure model of Glosten and Milgrom. Our approach ensures a zero-profit condition for the market maker's prices. We derive the differential equation that an optimal adaptive curve should follow to minimize arbitrage losses while remaining competitive. Solutions to this optimality equation are obtained for standard Gaussian and Lognormal price models using Kalman filtering. A key feature of our method is its ability to estimate the external market price without relying on price or loss oracles. We also provide an equivalent differential equation for the implied dynamics of canonical static bonding curves and establish conditions for their optimality. Our algorithms demonstrate robustness to changing market conditions and adversarial perturbations, and we offer an on-chain implementation using Uniswap v4 alongside off-chain AI co-processors.


[44] 2406.13815

IG-CFAT: An Improved GAN-Based Framework for Effectively Exploiting Transformers in Real-World Image Super-Resolution

In the field of single image super-resolution (SISR), transformer-based models, have demonstrated significant advancements. However, the potential and efficiency of these models in applied fields such as real-world image super-resolution are less noticed and there are substantial opportunities for improvement. Recently, composite fusion attention transformer (CFAT), outperformed previous state-of-the-art (SOTA) models in classic image super-resolution. This paper extends the CFAT model to an improved GAN-based model called IG-CFAT to effectively exploit the performance of transformers in real-world image super-resolution. IG-CFAT incorporates a semantic-aware discriminator to reconstruct image details more accurately, significantly improving perceptual quality. Moreover, our model utilizes an adaptive degradation model to better simulate real-world degradations. Our methodology adds wavelet losses to conventional loss functions of GAN-based super-resolution models to reconstruct high-frequency details more efficiently. Empirical results demonstrate that IG-CFAT sets new benchmarks in real-world image super-resolution, outperforming SOTA models in both quantitative and qualitative metrics.


[45] 2406.13817

SkyGrid: Energy-Flow Optimization at Harmonized Aerial Intersections

The rapid evolution of urban air mobility (UAM) is reshaping the future of transportation by integrating aerial vehicles into urban transit systems. The design of aerial intersections plays a critical role in the phased development of UAM systems to ensure safe and efficient operations in air corridors. This work adapts the concept of rhythmic control of connected and automated vehicles (CAVs) at unsignalized intersections to address complex traffic control problems. This control framework assigns UAM vehicles to different movement groups and significantly reduces the computation of routing strategies to avoid conflicts. In contrast to ground traffic, the objective is to balance three measures: minimizing energy utilization, maximizing intersection flow (throughput), and maintaining safety distances. This optimization method dynamically directs traffic with various demands, considering path assignment distributions and segment-level trajectory coefficients for straight and curved paths as control variables. To the best of our knowledge, this is the first work to consider a multi-objective optimization approach for unsignalized intersection control in the air and to propose such optimization in a rhythmic control setting with time arrival and UAM operational constraints. A sensitivity analysis with respect to inter-platoon safety and straight/left demand balance demonstrates the effectiveness of our method in handling traffic under various scenarios.


[46] 2406.13895

INFusion: Diffusion Regularized Implicit Neural Representations for 2D and 3D accelerated MRI reconstruction

Implicit Neural Representations (INRs) are a learning-based approach to accelerate Magnetic Resonance Imaging (MRI) acquisitions, particularly in scan-specific settings when only data from the under-sampled scan itself are available. Previous work demonstrates that INRs improve rapid MRI through inherent regularization imposed by neural network architectures. Typically parameterized by fully-connected neural networks, INRs support continuous image representations by taking a physical coordinate location as input and outputting the intensity at that coordinate. Previous work has applied unlearned regularization priors during INR training and have been limited to 2D or low-resolution 3D acquisitions. Meanwhile, diffusion based generative models have received recent attention as they learn powerful image priors decoupled from the measurement model. This work proposes INFusion, a technique that regularizes the optimization of INRs from under-sampled MR measurements with pre-trained diffusion models for improved image reconstruction. In addition, we propose a hybrid 3D approach with our diffusion regularization that enables INR application on large-scale 3D MR datasets. 2D experiments demonstrate improved INR training with our proposed diffusion regularization, and 3D experiments demonstrate feasibility of INR training with diffusion regularization on 3D matrix sizes of 256 by 256 by 80.


[47] 2406.13935

CONMOD: Controllable Neural Frame-based Modulation Effects

Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFO-driven effects in a frame-wise manner, offering control over LFO frequency and feedback parameters. Additionally, the model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs. Our model outperforms previous work while possessing both controllability and universality, presenting opportunities to enhance creativity in modern LFO-driven audio effects.


[48] 2406.13977

Similarity-aware Syncretic Latent Diffusion Model for Medical Image Translation with Representation Learning

Non-contrast CT (NCCT) imaging may reduce image contrast and anatomical visibility, potentially increasing diagnostic uncertainty. In contrast, contrast-enhanced CT (CECT) facilitates the observation of regions of interest (ROI). Leading generative models, especially the conditional diffusion model, demonstrate remarkable capabilities in medical image modality transformation. Typical conditional diffusion models commonly generate images with guidance of segmentation labels for medical modal transformation. Limited access to authentic guidance and its low cardinality can pose challenges to the practical clinical application of conditional diffusion models. To achieve an equilibrium of generative quality and clinical practices, we propose a novel Syncretic generative model based on the latent diffusion model for medical image translation (S$^2$LDM), which can realize high-fidelity reconstruction without demand of additional condition during inference. S$^2$LDM enhances the similarity in distinct modal images via syncretic encoding and diffusing, promoting amalgamated information in the latent space and generating medical images with more details in contrast-enhanced regions. However, syncretic latent spaces in the frequency domain tend to favor lower frequencies, commonly locate in identical anatomic structures. Thus, S$^2$LDM applies adaptive similarity loss and dynamic similarity to guide the generation and supplements the shortfall in high-frequency details throughout the training process. Quantitative experiments confirm the effectiveness of our approach in medical image translation. Our code will release lately.


[49] 2406.13979

Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal Learning

Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics by decomposing the feature subspace of histology images and genomics, reflecting distinct tumour and microenvironment features. To enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting of a cross-modal deformable attention module and a gene-guided consistency strategy. Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordination learning strategy. Extensive experiments demonstrate the effectiveness of the proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis. Our code is available at https://github.com/helenypzhang/Subspace-Multimodal-Learning.


[50] 2406.14028

Reliable State Estimation in a Truck-Semitrailer Combination using an Artificial Neural Network-Aided Extended Kalman Filter

Advanced driver assistance systems are critically dependent on reliable and accurate information regarding a vehicles' driving state. For estimation of unknown quantities, model-based and learning-based methods exist, but both suffer from individual limitations. On the one hand, model-based estimation performance is often limited by the models' accuracy. On the other hand, learning-based estimators usually do not perform well in "unknown" conditions (bad generalization), which is particularly critical for semitrailers as their payload changes significantly in operation. To the best of the authors' knowledge, this work is the first to analyze the capability of state-of-the-art estimators for semitrailers to generalize across "unknown" loading states. Moreover, a novel hybrid Extended Kalman Filter (H-EKF) that takes advantage of accurate Artificial Neural Network (ANN) estimates while preserving reliable generalization capability is presented. It estimates the articulation angle between truck and semitrailer, lateral tire forces and the truck steering angle utilizing sensor data of a standard semitrailer only. An experimental comparison based on a full-scale truck-semitrailer combination indicates the superiority of the H-EKF compared to a state-of-the-art extended Kalman filter and an ANN estimator.


[51] 2406.14052

Perspective+ Unet: Enhancing Segmentation with Bi-Path Fusion and Efficient Non-Local Attention for Superior Receptive Fields

Precise segmentation of medical images is fundamental for extracting critical clinical information, which plays a pivotal role in enhancing the accuracy of diagnoses, formulating effective treatment plans, and improving patient outcomes. Although Convolutional Neural Networks (CNNs) and non-local attention methods have achieved notable success in medical image segmentation, they either struggle to capture long-range spatial dependencies due to their reliance on local features, or face significant computational and feature integration challenges when attempting to address this issue with global attention mechanisms. To overcome existing limitations in medical image segmentation, we propose a novel architecture, Perspective+ Unet. This framework is characterized by three major innovations: (i) It introduces a dual-pathway strategy at the encoder stage that combines the outcomes of traditional and dilated convolutions. This not only maintains the local receptive field but also significantly expands it, enabling better comprehension of the global structure of images while retaining detail sensitivity. (ii) The framework incorporates an efficient non-local transformer block, named ENLTB, which utilizes kernel function approximation for effective long-range dependency capture with linear computational and spatial complexity. (iii) A Spatial Cross-Scale Integrator strategy is employed to merge global dependencies and local contextual cues across model stages, meticulously refining features from various levels to harmonize global and local information. Experimental results on the ACDC and Synapse datasets demonstrate the effectiveness of our proposed Perspective+ Unet. The code is available in the supplementary material.


[52] 2406.14069

Towards Multi-modality Fusion and Prototype-based Feature Refinement for Clinically Significant Prostate Cancer Classification in Transrectal Ultrasound

Prostate cancer is a highly prevalent cancer and ranks as the second leading cause of cancer-related deaths in men globally. Recently, the utilization of multi-modality transrectal ultrasound (TRUS) has gained significant traction as a valuable technique for guiding prostate biopsies. In this study, we propose a novel learning framework for clinically significant prostate cancer (csPCa) classification using multi-modality TRUS. The proposed framework employs two separate 3D ResNet-50 to extract distinctive features from B-mode and shear wave elastography (SWE). Additionally, an attention module is incorporated to effectively refine B-mode features and aggregate the extracted features from both modalities. Furthermore, we utilize few shot segmentation task to enhance the capacity of classification encoder. Due to the limited availability of csPCa masks, a prototype correction module is employed to extract representative prototypes of csPCa. The performance of the framework is assessed on a large-scale dataset consisting of 512 TRUS videos with biopsy-proved prostate cancer. The results demonstrate the strong capability in accurately identifying csPCa, achieving an area under the curve (AUC) of 0.86. Moreover, the framework generates visual class activation mapping (CAM), which can serve as valuable assistance for localizing csPCa. These CAM images may offer valuable guidance during TRUS-guided targeted biopsies, enhancing the efficacy of the biopsy procedure.The code is available at https://github.com/2313595986/SmileCode.


[53] 2406.14107

Efficient Transmission Scheme for LEO Satellite-Based NB-IoT: A Data-Driven Perspective

This study analyses the medium access control (MAC) layer aspects of a low-Earth-orbit (LEO) satellite-based Internet of Things (IoT) network. A transmission scheme based on change detection is proposed to accommodate more users within the network and improve energy efficiency. Machine learning (ML) algorithms are also proposed to reduce the payload size by leveraging the correlation among the sensed parameters. Real-world data from an IoT testbed deployed for a smart city application is utilised to analyse the performance regarding collision probability, effective data received and average battery lifetime. The findings reveal that the traffic pattern, post-implementation of the proposed scheme, differs from the commonly assumed Poisson traffic, thus proving the effectiveness of having IoT data from actual deployment. It is demonstrated that the transmission scheme facilitates accommodating more devices while targeting a specific collision probability. Considering the link budget for a direct access NB-IoT scenario, more data is effectively offloaded to the server within the limited visibility of LEO satellites. The average battery lifetimes are also demonstrated to increase by many folds by using the proposed access schemes and ML algorithms.


[54] 2406.14116

Efficient Design and Implementation of Fast-Convolution-Based Variable-Bandwidth Filters

This paper introduces an efficient design approach for a fast-convolution-based variable-bandwidth (VBW) filter. The proposed approach is based on a hybrid of frequency sampling and optimization (HFSO), that offers significant computational complexity reduction compared to existing solutions for a given performance. The paper provides a design procedure based on minimax optimization to obtain the minimum complexity of the overall filter. A design example includes a comparison of the proposed design-based VBW filter and time-domain designed VBW filters implemented in the time domain and in the frequency domain. It is shown that not only the implementation complexity can be reduced but also the design complexity by excluding any computations when the bandwidth of the filter is adjusted. Moreover, memory requirements are also decreased compared to the existing frequency-domain implementations.


[55] 2406.14118

Prediction and Reference Quality Adaptation for Learned Video Compression

Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they ignore the prediction and reference quality adaptation, which leads to incorrect utilization of temporal prediction and reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With the filters, it is easier for our codec to achieve the target reconstruction quality according to reference qualities, thus reducing the propagation of reconstruction errors. Experimental results show that our codec obtains higher compression performance than the reference software of H.266/VVC and the previous state-of-the-art learned video codecs in both RGB and YUV420 colorspaces.


[56] 2406.14126

Joint Optimization of Switching Point and Power Control in Dynamic TDD Cell-Free Massive MIMO

We consider a cell-free massive multiple-input multiple-output (CFmMIMO) network operating in dynamic time division duplex (DTDD). The switching point between the uplink (UL) and downlink (DL) data transmission phases can be adapted dynamically to the instantaneous quality-of-service (QoS) requirements in order to improve energy efficiency (EE). To this end, we formulate a problem of optimizing the DTDD switching point jointly with the UL and DL power control coefficients, and the large-scale fading decoding (LSFD) weights for EE maximization. Then, we propose an iterative algorithm to solve the formulated challenging problem using successive convex approximation with an approximate stationary solution. Simulation results show that optimizing switching points remarkably improves EE compared with baseline schemes that adjust switching points heuristically.


[57] 2406.14141

Online Learning of Weakly Coupled MDP Policies for Load Balancing and Auto Scaling

Load balancing and auto scaling are at the core of scalable, contemporary systems, addressing dynamic resource allocation and service rate adjustments in response to workload changes. This paper introduces a novel model and algorithms for tuning load balancers coupled with auto scalers, considering bursty traffic arriving at finite queues. We begin by presenting the problem as a weakly coupled Markov Decision Processes (MDP), solvable via a linear program (LP). However, as the number of control variables of such LP grows combinatorially, we introduce a more tractable relaxed LP formulation, and extend it to tackle the problem of online parameter learning and policy optimization using a two-timescale algorithm based on the LP Lagrangian.


[58] 2406.14179

Single Channel-based Motor Imagery Classification using Fisher's Ratio and Pearson Correlation

Motor imagery-based BCI systems have been promising and gaining popularity in rehabilitation and Activities of daily life(ADL). Despite this, the technology is still emerging and has not yet been outside the laboratory constraints. Channel reduction is one contributing avenue to make these systems part of ADL. Although Motor Imagery classification heavily depends on spatial factors, single channel-based classification remains an avenue to be explored thoroughly. Since Fisher's ratio and Pearson Correlation are powerful measures actively used in the domain, we propose an integrated framework (FRPC integrated framework) that integrates Fisher's Ratio to select the best channel and Pearson correlation to select optimal filter banks and extract spectral and temporal features respectively. The framework is tested for a 2-class motor imagery classification on 2 open-source datasets and 1 collected dataset and compared with state-of-art work. Apart from implementing the framework, this study also explores the most optimal channel among all the subjects and later explores classes where the single-channel framework is efficient.


[59] 2406.14186

CriDiff: Criss-cross Injection Diffusion Framework via Generative Pre-train for Prostate Segmentation

Recently, the Diffusion Probabilistic Model (DPM)-based methods have achieved substantial success in the field of medical image segmentation. However, most of these methods fail to enable the diffusion model to learn edge features and non-edge features effectively and to inject them efficiently into the diffusion backbone. Additionally, the domain gap between the images features and the diffusion model features poses a great challenge to prostate segmentation. In this paper, we proposed CriDiff, a two-stage feature injecting framework with a Crisscross Injection Strategy (CIS) and a Generative Pre-train (GP) approach for prostate segmentation. The CIS maximizes the use of multi-level features by efficiently harnessing the complementarity of high and low-level features. To effectively learn multi-level of edge features and non-edge features, we proposed two parallel conditioners in the CIS: the Boundary Enhance Conditioner (BEC) and the Core Enhance Conditioner (CEC), which discriminatively model the image edge regions and non-edge regions, respectively. Moreover, the GP approach eases the inconsistency between the images features and the diffusion model without adding additional parameters. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed method and achieve state-of-the-art performance on four evaluation metrics.


[60] 2406.14210

Self-Supervised Pretext Tasks for Alzheimer's Disease Classification using 3D Convolutional Neural Networks on Large-Scale Synthetic Neuroimaging Dataset

Structural magnetic resonance imaging (MRI) studies have shown that Alzheimer's Disease (AD) induces both localised and widespread neural degenerative changes throughout the brain. However, the absence of segmentation that highlights brain degenerative changes presents unique challenges for training CNN-based classifiers in a supervised fashion. In this work, we evaluated several unsupervised methods to train a feature extractor for downstream AD vs. CN classification. Using the 3D T1-weighted MRI data of cognitive normal (CN) subjects from the synthetic neuroimaging LDM100K dataset, lightweight 3D CNN-based models are trained for brain age prediction, brain image rotation classification, brain image reconstruction and a multi-head task combining all three tasks into one. Feature extractors trained on the LDM100K synthetic dataset achieved similar performance compared to the same model using real-world data. This supports the feasibility of utilising large-scale synthetic data for pretext task training. All the training and testing splits are performed on the subject-level to prevent data leakage issues. Alongside the simple preprocessing steps, the random cropping data augmentation technique shows consistent improvement across all experiments.


[61] 2406.14251

Enhanced Optimal Power Flow Based Droop Control in MMC-MTDC Systems

Optimizing operational set points for modular multilevel converters (MMCs) in Multi-Terminal Direct Current (MTDC) transmission systems is crucial for ensuring efficient power distribution and control. This paper presents an enhanced Optimal Power Flow (OPF) model for MMC-MTDC systems, integrating a novel adaptive voltage droop control strategy. The strategy aims to minimize generation costs and DC voltage deviations while ensuring the stable operation of the MTDC grid by dynamically adjusting the system operation points. The modified Nordic 32 test system with an embedded 4-terminal DC grid is modeled in Julia and the proposed control strategy is applied to the power model. The results demonstrate the feasibility and effectiveness of the proposed droop control strategy, affirming its potential value in enhancing the performance and reliability of hybrid AC-DC power systems.


[62] 2406.14264

Zero-Shot Image Denoising for High-Resolution Electron Microscopy

High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains.


[63] 2406.14287

Segmentation of Non-Small Cell Lung Carcinomas: Introducing DRU-Net and Multi-Lens Distortion

Considering the increased workload in pathology laboratories today, automated tools such as artificial intelligence models can help pathologists with their tasks and ease the workload. In this paper, we are proposing a segmentation model (DRU-Net) that can provide a delineation of human non-small cell lung carcinomas and an augmentation method that can improve classification results. The proposed model is a fused combination of truncated pre-trained DenseNet201 and ResNet101V2 as a patch-wise classifier followed by a lightweight U-Net as a refinement model. We have used two datasets (Norwegian Lung Cancer Biobank and Haukeland University Hospital lung cancer cohort) to create our proposed model. The DRU-Net model achieves an average of 0.91 Dice similarity coefficient. The proposed spatial augmentation method (multi-lens distortion) improved the network performance by 3%. Our findings show that choosing image patches that specifically include regions of interest leads to better results for the patch-wise classifier compared to other sampling methods. The qualitative analysis showed that the DRU-Net model is generally successful in detecting the tumor. On the test set, some of the cases showed areas of false positive and false negative segmentation in the periphery, particularly in tumors with inflammatory and reactive changes.


[64] 2406.14301

Resource Optimization for Tail-Based Control in Wireless Networked Control Systems

Achieving control stability is one of the key design challenges of scalable Wireless Networked Control Systems (WNCS) under limited communication and computing resources. This paper explores the use of an alternative control concept defined as tail-based control, which extends the classical Linear Quadratic Regulator (LQR) cost function for multiple dynamic control systems over a shared wireless network. We cast the control of multiple control systems as a network-wide optimization problem and decouple it in terms of sensor scheduling, plant state prediction, and control policies. Toward this, we propose a solution consisting of a scheduling algorithm based on Lyapunov optimization for sensing, a mechanism based on Gaussian Process Regression (GPR) for state prediction and uncertainty estimation, and a control policy based on Reinforcement Learning (RL) to ensure tail-based control stability. A set of discrete time-invariant mountain car control systems is used to evaluate the proposed solution and is compared against four variants that use state-of-the-art scheduling, prediction, and control methods. The experimental results indicate that the proposed method yields 22% reduction in overall cost in terms of communication and control resource utilization compared to state-of-the-art methods.


[65] 2406.14308

FIESTA: Fourier-Based Semantic Augmentation with Uncertainty Guidance for Enhanced Domain Generalizability in Medical Image Segmentation

Single-source domain generalization (SDG) in medical image segmentation (MIS) aims to generalize a model using data from only one source domain to segment data from an unseen target domain. Despite substantial advances in SDG with data augmentation, existing methods often fail to fully consider the details and uncertain areas prevalent in MIS, leading to mis-segmentation. This paper proposes a Fourier-based semantic augmentation method called FIESTA using uncertainty guidance to enhance the fundamental goals of MIS in an SDG context by manipulating the amplitude and phase components in the frequency domain. The proposed Fourier augmentative transformer addresses semantic amplitude modulation based on meaningful angular points to induce pertinent variations and harnesses the phase spectrum to ensure structural coherence. Moreover, FIESTA employs epistemic uncertainty to fine-tune the augmentation process, improving the ability of the model to adapt to diverse augmented data and concentrate on areas with higher ambiguity. Extensive experiments across three cross-domain scenarios demonstrate that FIESTA surpasses recent state-of-the-art SDG approaches in segmentation performance and significantly contributes to boosting the applicability of the model in medical imaging modalities.


[66] 2406.14351

Automatic Labels are as Effective as Manual Labels in Biomedical Images Classification with Deep Learning

The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to train DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist, however automatic labels can be noisy and it is not completely clear when automatic labels can be adopted to train DL models. This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and over 10000 WSIs, collected from three use cases: celiac disease, lung cancer and colon cancer, which one including respectively binary, multiclass and multilabel data. The results allow identifying 10% as the percentage of noisy labels that lead to train competitive models for the classification of WSIs. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to generate automatic labels leads to performance comparable to the one obtained with manual labels, since it generates a percentage of noisy labels between 2-5%. Automatic labels are as effective as manual ones, reaching solid performance comparable to the one obtained training models with manual labels.


[67] 2406.14355

A tensor model for calibration and imaging with air-coupled ultrasonic sensor arrays

Arrays of ultrasonic sensors are capable of 3D imaging in air and an affordable supplement to other sensing modalities, such as radar, lidar, and camera, i.e. in heterogeneous sensing systems. However, manufacturing tolerances of air-coupled ultrasonic sensors may lead to amplitude and phase deviations. Together with artifacts from imperfect knowledge of the array geometry, there are numerous factors that can impair the imaging performance of an array. We propose a reference-based calibration method to overcome possible limitations. First, we introduce a novel tensor signal model to capture the characteristics of piezoelectric ultrasonic transducers (PUTs) and the underlying multidimensional nature of a multiple-input multiple-output (MIMO) sensor array. Second, we formulate an optimization problem based on the proposed tensor model to obtain the calibrated parameters of the array and solve the problem using a modified block coordinate descent (BCD) method. Third, we assess both our model and the commonly used analytical model using real data from a 3D imaging experiment. The experiment reveals that our array response model we learned with calibration data yields an imaging performance similar to that of the analytical array model, which requires perfect array geometry information.


[68] 2406.14372

Ring-LWE based encrypted controller with unlimited number of recursive multiplications and effect of error growth

In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a polynomial having multiple error coefficients. Such errors accumulate under recursive homomorphic operations, and it has been studied that their effect can be suppressed by the closed-loop stability when dynamic controllers are encrypted using LWE based schemes. We show that this also holds for the proposed controller encrypted using a Ring-LWE based scheme. Specifically, only the constant terms of the error polynomials affect the control performance, and their effect can be arbitrarily bounded even when the noneffective terms diverge. Furthermore, a novel packing algorithm is applied, resulting in reduced computation time and enhanced memory efficiency. Simulation results demonstrate the effectiveness of the proposed method.


[69] 2406.14379

Decoding Vocal Articulations from Acoustic Latent Representations

We present a novel neural encoder system for acoustic-to-articulatory inversion. We leverage the Pink Trombone voice synthesizer that reveals articulatory parameters (e.g tongue position and vocal cord configuration). Our system is designed to identify the articulatory features responsible for producing specific acoustic characteristics contained in a neural latent representation. To generate the necessary latent embeddings, we employed two main methodologies. The first was a self-supervised variational autoencoder trained from scratch to reconstruct the input signal at the decoder stage. We conditioned its bottleneck layer with a subnetwork called the "projector," which decodes the voice synthesizer's parameters. The second methodology utilized two pretrained models: EnCodec and Wav2Vec. They eliminate the need to train the encoding process from scratch, allowing us to focus on training the projector network. This approach aimed to explore the potential of these existing models in the context of acoustic-to-articulatory inversion. By reusing the pretrained models, we significantly simplified the data processing pipeline, increasing efficiency and reducing computational overhead. The primary goal of our project was to demonstrate that these neural architectures can effectively encapsulate both acoustic and articulatory features. This prediction-based approach is much faster than traditional methods focused on acoustic feature-based parameter optimization. We validated our models by predicting six different parameters and evaluating them with objective and ViSQOL subjective-equivalent metric using both synthesizer- and human-generated sounds. The results show that the predicted parameters can generate human-like vowel sounds when input into the synthesizer. We provide the dataset, code, and detailed findings to support future research in this field.


[70] 2406.14421

Learning Binary Color Filter Arrays with Trainable Hard Thresholding

Color Filter Arrays (CFA) are optical filters in digital cameras that capture specific color channels. Current commercial CFAs are hand-crafted patterns with different physical and application-specific considerations. This study proposes a binary CFA learning module based on hard thresholding with a deep learning-based demosaicing network in a joint architecture. Unlike most existing learnable CFAs that learn a channel from the whole color spectrum or linearly combine available digital colors, this method learns a binary channel selection, resulting in CFAs that are practical and physically implementable to digital cameras. The binary selection is based on adapting the hard thresholding operation into neural networks via a straight-through estimator, and therefore it is named HardMax. This paper includes the background on the CFA design problem, the description of the HardMax method, and the performance evaluation results. The evaluation of the proposed method includes tests for different demosaicing models, color configurations, filter sizes, and a comparison with existing methods in various reconstruction metrics. The proposed approach is tested with Kodak and BSDS500 datasets and provides higher reconstruction performance than hand-crafted or alternative learned binary filters.


[71] 2406.14430

Adaptive Deep Neural Network-Based Control Barrier Functions

Safety constraints of nonlinear control systems are commonly enforced through the use of control barrier functions (CBFs). Uncertainties in the dynamic model can disrupt forward invariance guarantees or cause the state to be restricted to an overly conservative subset of the safe set. In this paper, adaptive deep neural networks (DNNs) are combined with CBFs to produce a family of controllers that ensure safety while learning the system's dynamics in real-time without the requirement for pre-training. By basing the least squares adaptation law on a state derivative estimator-based identification error, the DNN parameter estimation error is shown to be uniformly ultimately bounded. The convergent bound on the parameter estimation error is then used to formulate CBF-constraints in an optimization-based controller to guarantee safety despite model uncertainty. Furthermore, the developed method is applicable for use under intermittent loss of state-feedback. Comparative simulation results demonstrate the ability of the developed method to ensure safety in an adaptive cruise control problem and when feedback is lost, unlike baseline methods.


[72] 2406.14440

LLM4CP: Adapting Large Language Models for Channel Prediction

Channel prediction is an effective approach for reducing the feedback or estimation overhead in massive multi-input multi-output (m-MIMO) systems. However, existing channel prediction methods lack precision due to model mismatch errors or network generalization issues. Large language models (LLMs) have demonstrated powerful modeling and generalization abilities, and have been successfully applied to cross-modal tasks, including the time series analysis. Leveraging the expressive power of LLMs, we propose a pre-trained LLM-empowered channel prediction method (LLM4CP) to predict the future downlink channel state information (CSI) sequence based on the historical uplink CSI sequence. We fine-tune the network while freezing most of the parameters of the pre-trained LLM for better cross-modality knowledge transfer. To bridge the gap between the channel data and the feature space of the LLM, preprocessor, embedding, and output modules are specifically tailored by taking into account unique channel characteristics. Simulations validate that the proposed method achieves SOTA prediction performance on full-sample, few-shot, and generalization tests with low training and inference costs.


[73] 2406.14474

Spatio-temporal Patterns between ENSO and Weather-related Power Outages in the Continental United States

El Ni\~no-Southern Oscillation (ENSO) exhibits significant impacts on the frequency of extreme weather events and its socio-economic implications prevail on a global scale. However, a fundamental gap still exists in understanding the relationship between the ENSO and weather-related power outages in the continental United States. Through 24-year (2000-2023) composite and statistical analysis, our study reveals that higher power outage numbers (PONs) are observed from the developing winter to the decaying summer of La Ni\~na phases. In particular, during the decaying spring, high La Ni\~na intensity favors the occurrences of power outage over the west coast and east of the United States, by modulating the frequency of extreme precipitations and heatwaves. Furthermore, projected increasing heatwaves from the Coupled Model Intercomparison Project Phase 6 (CMIP6) indicate that spring-time PONs over the eastern United States occur about 11 times higher for the mid-term future (2041-2060) and almost 26 times higher for the long-term future (2081-2100), compared with 2000-2023. Our study provides a strong recommendation for building a more climate-resilient power system.


[74] 2406.14486

Rule-based outlier detection of AI-generated anatomy segmentations

There is a dire need for medical imaging datasets with accompanying annotations to perform downstream patient analysis. However, it is difficult to manually generate these annotations, due to the time-consuming nature, and the variability in clinical conventions. Artificial intelligence has been adopted in the field as a potential method to annotate these large datasets, however, a lack of expert annotations or ground truth can inhibit the adoption of these annotations. We recently made a dataset publicly available including annotations and extracted features of up to 104 organs for the National Lung Screening Trial using the TotalSegmentator method. However, the released dataset does not include expert-derived annotations or an assessment of the accuracy of the segmentations, limiting its usefulness. We propose the development of heuristics to assess the quality of the segmentations, providing methods to measure the consistency of the annotations and a comparison of results to the literature. We make our code and related materials publicly available at https://github.com/ImagingDataCommons/CloudSegmentatorResults and interactive tools at https://huggingface.co/spaces/ImagingDataCommons/CloudSegmentatorResults.


[75] 2406.14534

Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume Registration

A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations between 2D frames and 3D volumes to be registered, resulting in real-time and accurate cardiac ultrasound frame-to-volume registration being a very challenging task. This paper introduces a lightweight end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided anatomical clues to reinforce the interaction of 2D sparse and 3D dense features, followed by a voxel-wise local-global aggregation of enhanced features, thereby boosting the cross-dimensional matching effectiveness of low-quality ultrasound modalities. We further embed an inter-frame discriminative regularization term within the hybrid supervised learning to increase the distinction between adjacent slices in the same ultrasound volume to ensure registration stability. Experimental results on the reprocessed CAMUS dataset demonstrate that our CU-Reg surpasses existing methods in terms of registration accuracy and efficiency, meeting the guidance requirements of clinical cardiac interventional surgery.


[76] 2405.03262

End-to-End Reinforcement Learning of Curative Curtailment with Partial Measurement Availability

In the course of the energy transition, the expansion of generation and consumption will change, and many of these technologies, such as PV systems, electric cars and heat pumps, will influence the power flow, especially in the distribution grids. Scalable methods that can make decisions for each grid connection are needed to enable congestion-free grid operation in the distribution grids. This paper presents a novel end-to-end approach to resolving congestion in distribution grids with deep reinforcement learning. Our architecture learns to curtail power and set appropriate reactive power to determine a non-congested and, thus, feasible grid state. State-of-the-art methods such as the optimal power flow (OPF) demand high computational costs and detailed measurements of every bus in a grid. In contrast, the presented method enables decisions under sparse information with just some buses observable in the grid. Distribution grids are generally not yet fully digitized and observable, so this method can be used for decision-making on the majority of low-voltage grids. On a real low-voltage grid the approach resolves 100\% of violations in the voltage band and 98.8\% of asset overloads. The results show that decisions can also be made on real grids that guarantee sufficient quality for congestion-free grid operation.


[77] 2406.13006

Weighted Sum of Segmented Correlation: An Efficient Method for Spectra Matching in Hyperspectral Images

Matching a target spectrum with known spectra in a spectral library is a common method for material identification in hyperspectral imaging research. Hyperspectral spectra exhibit precise absorption features across different wavelength segments, and the unique shapes and positions of these absorptions create distinct spectral signatures for each material, aiding in their identification. Therefore, only the specific positions can be considered for material identification. This study introduces the Weighted Sum of Segmented Correlation method, which calculates correlation indices between various segments of a library and a test spectrum, and derives a matching index, favoring positive correlations and penalizing negative correlations using assigned weights. The effectiveness of this approach is evaluated for mineral identification in hyperspectral images from both Earth and Martian surfaces.


[78] 2406.13025

ABNet: Attention BarrierNet for Safe and Scalable Robot Learning

Safe learning is central to AI-enabled robots where a single failure may lead to catastrophic results. Barrier-based method is one of the dominant approaches for safe robot learning. However, this method is not scalable, hard to train, and tends to generate unstable signals under noisy inputs that are challenging to be deployed for robots. To address these challenges, we propose a novel Attention BarrierNet (ABNet) that is scalable to build larger foundational safe models in an incremental manner. Each head of BarrierNet in the ABNet could learn safe robot control policies from different features and focus on specific part of the observation. In this way, we do not need to one-shotly construct a large model for complex tasks, which significantly facilitates the training of the model while ensuring its stable output. Most importantly, we can still formally prove the safety guarantees of the ABNet. We demonstrate the strength of ABNet in 2D robot obstacle avoidance, safe robot manipulation, and vision-based end-to-end autonomous driving, with results showing much better robustness and guarantees over existing models.


[79] 2406.13038

Traffic Prediction considering Multiple Levels of Spatial-temporal Information: A Multi-scale Graph Wavelet-based Approach

Although traffic prediction has been receiving considerable attention with a number of successes in the context of intelligent transportation systems, the prediction of traffic states over a complex transportation network that contains different road types has remained a challenge. This study proposes a multi-scale graph wavelet temporal convolution network (MSGWTCN) to predict the traffic states in complex transportation networks. Specifically, a multi-scale spatial block is designed to simultaneously capture the spatial information at different levels, and the gated temporal convolution network is employed to extract the temporal dependencies of the data. The model jointly learns to mount multiple levels of the spatial interactions by stacking graph wavelets with different scales. Two real-world datasets are used in this study to investigate the model performance, including a highway network in Seattle and a dense road network of Manhattan in New York City. Experiment results show that the proposed model outperforms other baseline models. Furthermore, different scales of graph wavelets are found to be effective in extracting local, intermediate and global information at the same time and thus enable the model to learn a complex transportation network topology with various types of road segments. By carefully customizing the scales of wavelets, the model is able to improve the prediction performance and better adapt to different network configurations.


[80] 2406.13118

Thruster-Assisted Incline Walking

In this study, our aim is to evaluate the effectiveness of thruster-assisted steep slope walking for the Husky Carbon, a quadrupedal robot equipped with custom-designed actuators and plural electric ducted fans, through simulation prior to conducting experimental trials. Thruster-assisted steep slope walking draws inspiration from wing-assisted incline running (WAIR) observed in birds, and intriguingly incorporates posture manipulation and thrust vectoring, a locomotion technique not previously explored in the animal kingdom. Our approach involves developing a reduced-order model of the Husky robot, followed by the application of an optimization-based controller utilizing collocation methods and dynamics interpolation to determine control actions. Through simulation testing, we demonstrate the feasibility of hardware implementation of our controller.


[81] 2406.13179

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting

Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative modules: 1) Global-Local Spiking Convolution (GLSC) module and 2) Bottleneck-PLIF module. Compared to the hand-crafted feature extraction methods, the GLSC module achieves speech feature extraction that is sparser, more energy-efficient, and yields better performance. The Bottleneck-PLIF module further processes the signals from GLSC with the aim to achieve higher accuracy with fewer parameters. Extensive experiments are conducted on the Google Speech Commands Dataset (V1 and V2). The results show our method achieves competitive performance among SNN-based KWS models with fewer parameters.


[82] 2406.13196

Quantum Generative Learning for High-Resolution Medical Image Generation

Integration of quantum computing in generative machine learning models has the potential to offer benefits such as training speed-up and superior feature extraction. However, the existing quantum generative adversarial networks (QGANs) fail to generate high-quality images due to their patch-based, pixel-wise learning approaches. These methods capture only local details, ignoring the global structure and semantic information of images. In this work, we address these challenges by proposing a quantum image generative learning (QIGL) approach for high-quality medical image generation. Our proposed quantum generator leverages variational quantum circuit approach addressing scalability issues by extracting principal components from the images instead of dividing them into patches. Additionally, we integrate the Wasserstein distance within the QIGL framework to generate a diverse set of medical samples. Through a systematic set of simulations on X-ray images from knee osteoarthritis and medical MNIST datasets, our model demonstrates superior performance, achieving the lowest Fr\'echet Inception Distance (FID) scores compared to its classical counterpart and advanced QGAN models reported in the literature.


[83] 2406.13248

Overlay Space-Air-Ground Integrated Networks with SWIPT-Empowered Aerial Communications

In this article, we consider overlay space-air-ground integrated networks (OSAGINs) where a low earth orbit (LEO) satellite communicates with ground users (GUs) with the assistance of an energy-constrained coexisting air-to-air (A2A) network. Particularly, a non-linear energy harvester with a hybrid SWIPT utilizing both power-splitting and time-switching energy harvesting (EH) techniques is employed at the aerial transmitter. Specifically, we take the random locations of the satellite, ground and aerial receivers to investigate the outage performance of both the satellite-to-ground and aerial networks leveraging the stochastic tools. By taking into account the Shadowed-Rician fading for satellite link, the Nakagami-\emph{m} for ground link, and the Rician fading for aerial link, we derive analytical expressions for the outage probability of these networks. For a comprehensive analysis of aerial network, we consider both the perfect and imperfect successive interference cancellation (SIC) scenarios. Through our analysis, we illustrate that, unlike linear EH, the implementation of non-linear EH provides accurate figures for any target rate, underscoring the significance of using non-linear EH models. Additionally, the influence of key parameters is emphasized, providing guidelines for the practical design of an energy-efficient as well as spectrum-efficient future non-terrestrial networks. Monte Carlo simulations validate the accuracy of our theoretical developments.


[84] 2406.13251

Freq-Mip-AA : Frequency Mip Representation for Anti-Aliasing Neural Radiance Fields

Neural Radiance Fields (NeRF) have shown remarkable success in representing 3D scenes and generating novel views. However, they often struggle with aliasing artifacts, especially when rendering images from different camera distances from the training views. To address the issue, Mip-NeRF proposed using volumetric frustums to render a pixel and suggested integrated positional encoding (IPE). While effective, this approach requires long training times due to its reliance on MLP architecture. In this work, we propose a novel anti-aliasing technique that utilizes grid-based representations, usually showing significantly faster training time. In addition, we exploit frequency-domain representation to handle the aliasing problem inspired by the sampling theorem. The proposed method, FreqMipAA, utilizes scale-specific low-pass filtering (LPF) and learnable frequency masks. Scale-specific low-pass filters (LPF) prevent aliasing and prioritize important image details, and learnable masks effectively remove problematic high-frequency elements while retaining essential information. By employing a scale-specific LPF and trainable masks, FreqMipAA can effectively eliminate the aliasing factor while retaining important details. We validated the proposed technique by incorporating it into a widely used grid-based method. The experimental results have shown that the FreqMipAA effectively resolved the aliasing issues and achieved state-of-the-art results in the multi-scale Blender dataset. Our code is available at https://github.com/yi0109/FreqMipAA .


[85] 2406.13269

Investigating Low-Cost LLM Annotation for~Spoken Dialogue Understanding Datasets

In spoken Task-Oriented Dialogue (TOD) systems, the choice of the semantic representation describing the users' requests is key to a smooth interaction. Indeed, the system uses this representation to reason over a database and its domain knowledge to choose its next action. The dialogue course thus depends on the information provided by this semantic representation. While textual datasets provide fine-grained semantic representations, spoken dialogue datasets fall behind. This paper provides insights into automatic enhancement of spoken dialogue datasets' semantic representations. Our contributions are three fold: (1) assess the relevance of Large Language Model fine-tuning, (2) evaluate the knowledge captured by the produced annotations and (3) highlight semi-automatic annotation implications.


[86] 2406.13275

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by -Base (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.


[87] 2406.13292

An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's disease

Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodormal stage known as Mild Cognitive Impairment (MCI), where patients may either progress to AD or remain stable. In this study, we leveraged structural and functional MRI to investigate the disease-induced grey matter and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduce SNPs as a third channel. Given such diverse inputs, missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel deep learning-based classification framework where generative module employing Cycle GANs was adopted to impute missing data within the latent space. Additionally, we adopted an Explainable AI method, Integrated Gradients, to extract input features relevance, enhancing our understanding of the learned representations. Two critical tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our model was able to reach the SOA in the classification of CN/AD reaching an average test accuracy of $0.926\pm0.02$. For the MCI task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN/AD. The interpretability analysis revealed significant grey matter modulations in cortical and subcortical brain areas well known for their association with AD. Moreover, impairments in sensory-motor and visual resting state network connectivity along the disease continuum, as well as mutations in SNPs defining biological processes linked to amyloid-beta and cholesterol formation clearance and regulation, were identified as contributors to the achieved performance. Overall, our integrative deep learning approach shows promise for AD detection and MCI prediction, while shading light on important biological insights.


[88] 2406.13335

AI-Empowered Multiple Access for 6G: A Survey of Spectrum Sensing, Protocol Designs, and Optimizations

With the rapidly increasing number of bandwidth-intensive terminals capable of intelligent computing and communication, such as smart devices equipped with shallow neural network models, the complexity of multiple access for these intelligent terminals is increasing due to the dynamic network environment and ubiquitous connectivity in 6G systems. Traditional multiple access (MA) design and optimization methods are gradually losing ground to artificial intelligence (AI) techniques that have proven their superiority in handling complexity. AI-empowered MA and its optimization strategies aimed at achieving high Quality-of-Service (QoS) are attracting more attention, especially in the area of latency-sensitive applications in 6G systems. In this work, we aim to: 1) present the development and comparative evaluation of AI-enabled MA; 2) provide a timely survey focusing on spectrum sensing, protocol design, and optimization for AI-empowered MA; and 3) explore the potential use cases of AI-empowered MA in the typical application scenarios within 6G systems. Specifically, we first present a unified framework of AI-empowered MA for 6G systems by incorporating various promising machine learning techniques in spectrum sensing, resource allocation, MA protocol design, and optimization. We then introduce AI-empowered MA spectrum sensing related to spectrum sharing and spectrum interference management. Next, we discuss the AI-empowered MA protocol designs and implementation methods by reviewing and comparing the state-of-the-art, and we further explore the optimization algorithms related to dynamic resource management, parameter adjustment, and access scheme switching. Finally, we discuss the current challenges, point out open issues, and outline potential future research directions in this field.


[89] 2406.13340

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https://github.com/amphionspace/SD-Eval.


[90] 2406.13345

Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVs

Visual Inertial Odometry (VIO) is the task of estimating the movement trajectory of an agent from an onboard camera stream fused with additional Inertial Measurement Unit (IMU) measurements. A crucial subtask within VIO is the tracking of features, which can be achieved through Optical Flow (OF). As the calculation of OF is a resource-demanding task in terms of computational load and memory footprint, which needs to be executed at low latency, especially in robotic applications, OF estimation is today performed on powerful CPUs or GPUs. This restricts its use in a broad spectrum of applications where the deployment of such powerful, power-hungry processors is unfeasible due to constraints related to cost, size, and power consumption. On-sensor hardware acceleration is a promising approach to enable low latency VIO even on resource-constrained devices such as nano drones. This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC). By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation, allowing VINS-Mono operation up to 50 FPS instead of 20 FPS on the quad-core ARM Cortex-A72 processor of a Raspberry Pi Compute Module 4.


[91] 2406.13357

Transferable speech-to-text large language model alignment module

By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.


[92] 2406.13358

Multi-scale Restoration of Missing Data in Optical Time-series Images with Masked Spatial-Temporal Attention Network

Due to factors such as thick cloud cover and sensor limitations, remote sensing images often suffer from significant missing data, resulting in incomplete time-series information. Existing methods for imputing missing values in remote sensing images do not fully exploit spatio-temporal auxiliary information, leading to limited accuracy in restoration. Therefore, this paper proposes a novel deep learning-based approach called MS2TAN (Multi-scale Masked Spatial-Temporal Attention Network), for reconstructing time-series remote sensing images. Firstly, we introduce an efficient spatio-temporal feature extractor based on Masked Spatial-Temporal Attention (MSTA), to obtain high-quality representations of the spatio-temporal neighborhood features in the missing regions. Secondly, a Multi-scale Restoration Network consisting of the MSTA-based Feature Extractors, is employed to progressively refine the missing values by exploring spatio-temporal neighborhood features at different scales. Thirdly, we propose a ``Pixel-Structure-Perception'' Multi-Objective Joint Optimization method to enhance the visual effects of the reconstruction results from multiple perspectives and preserve more texture structures. Furthermore, the proposed method reconstructs missing values in all input temporal phases in parallel (i.e., Multi-In Multi-Out), achieving higher processing efficiency. Finally, experimental evaluations on two typical missing data restoration tasks across multiple research areas demonstrate that the proposed method outperforms state-of-the-art methods with an improvement of 0.40dB/1.17dB in mean peak signal-to-noise ratio (mPSNR) and 3.77/9.41 thousandths in mean structural similarity (mSSIM), while exhibiting stronger texture and structural consistency.


[93] 2406.13384

Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.


[94] 2406.13431

Children's Speech Recognition through Discrete Token Enhancement

Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.


[95] 2406.13501

Assessing the 3D resolution of refocused correlation plenoptic images using a general-purpose image quality estimator

Correlation plenoptic imaging (CPI) is emerging as a promising approach to light-field imaging (LFI), a technique enabling simultaneous measurement of light intensity distribution and propagation direction from a scene. LFI allows single-shot 3D sampling, offering fast 3D reconstruction for a wide range of applications. However, the array of micro-lenses typically used in LFI to obtain 3D information limits image resolution, which rapidly declines with enhanced volumetric reconstruction capabilities. CPI addresses this limitation by decoupling light-field information measurement using two photodetectors with spatial resolution, eliminating the need for micro-lenses. 3D information is encoded in a four-dimensional correlation function, which is decoded in post-processing to reconstruct images without the resolution loss seen in conventional LFI. This paper evaluates the tomographic performance of CPI, demonstrating that the refocusing reconstruction method provides axial sectioning capabilities comparable to conventional imaging systems. A general-purpose analytical approach based on image fidelity is proposed to quantitatively study axial and lateral resolution. This analysis fully characterizes the volumetric resolution of any CPI architecture, offering a comprehensive evaluation of its imaging performance.


[96] 2406.13502

ManWav: The First Manchu ASR Model

This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a critically endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. The results of the first Manchu ASR is promising, especially when trained with our augmented data. Wav2Vec2-XLSR-53 fine-tuned with augmented data demonstrates a 0.02 drop in CER and 0.13 drop in WER compared to the same base model fine-tuned with original data.


[97] 2406.13579

Automated Bioacoustic Monitoring for South African Bird Species on Unlabeled Data

Analyses for biodiversity monitoring based on passive acoustic monitoring (PAM) recordings is time-consuming and challenged by the presence of background noise in recordings. Existing models for sound event detection (SED) worked only on certain avian species and the development of further models required labeled data. The developed framework automatically extracted labeled data from available platforms for selected avian species. The labeled data were embedded into recordings, including environmental sounds and noise, and were used to train convolutional recurrent neural network (CRNN) models. The models were evaluated on unprocessed real world data recorded in urban KwaZulu-Natal habitats. The Adapted SED-CRNN model reached a F1 score of 0.73, demonstrating its efficiency under noisy, real-world conditions. The proposed approach to automatically extract labeled data for chosen avian species enables an easy adaption of PAM to other species and habitats for future conservation projects.


[98] 2406.13602

Parameter Training Efficiency Aware Resource Allocation for AIGC in Space-Air-Ground Integrated Networks

With the evolution of artificial intelligence-generated content (AIGC) techniques and the development of space-air-ground integrated networks (SAGIN), there will be a growing opportunity to enhance more users' mobile experience with customized AIGC applications. This is made possible through the use of parameter-efficient fine-tuning (PEFT) training alongside mobile edge computing. In this paper, we formulate the optimization problem of maximizing the parameter training efficiency of the SAGIN system over wireless networks under limited resource constraints. We propose the Parameter training efficiency Aware Resource Allocation (PARA) technique to jointly optimize user association, data offloading, and communication and computational resource allocation. Solid proofs are presented to solve this difficult sum of ratios problem based on quadratically constrained quadratic programming (QCQP), semidefinite programming (SDP), graph theory, and fractional programming (FP) techniques. Our proposed PARA technique is effective in finding a stationary point of this non-convex problem. The simulation results demonstrate that the proposed PARA method outperforms other baselines.


[99] 2406.13612

On Computation of Approximate Solutions to Large-Scale Backstepping Kernel Equations via Continuum Approximation

We provide two methods for computation of continuum backstepping kernels that arise in control of continua (ensembles) of linear hyperbolic PDEs and which can approximate backstepping kernels arising in control of a large-scale, PDE system counterpart (with computational complexity that does not grow with the number of state components of the large-scale system). In the first method, we identify a class of systems for which the solution to the continuum (and hence, also an approximate solution to the respective large-scale) kernel equations can be constructed in closed form. In the second method, we provide explicit formulae for the solution to the continuum kernels PDEs, employing a (triple) power series representation of the continuum kernel and establishing its convergence properties. In this case, we also provide means for reducing computational complexity by properly truncating the power series (in the powers of the ensemble variable). We also present numerical examples to illustrate computational efficiency/accuracy of the approaches, as well as to validate the stabilization properties of the approximate control kernels, constructed based on the continuum.


[100] 2406.13712

Convex-hull Estimation using XPSNR for Versatile Video Coding

As adaptive streaming becomes crucial for delivering high-quality video content across diverse network conditions, accurate metrics to assess perceptual quality are essential. This paper explores using the eXtended Peak Signal-to-Noise Ratio (XPSNR) metric as an alternative to the popular Video Multimethod Assessment Fusion (VMAF) metric for determining optimized bitrate-resolution pairs in the context of Versatile Video Coding (VVC). Our study is rooted in the observation that XPSNR shows a superior correlation with subjective quality scores for VVC-coded Ultra-High Definition (UHD) content compared to VMAF. We predict the average XPSNR of VVC-coded bitstreams using spatiotemporal complexity features of the video and the target encoding configuration and then determine the convex-hull online. On average, the proposed convex-hull using XPSNR (VEXUS) achieves an overall quality improvement of 5.84 dB PSNR and 0.62 dB XPSNR while maintaining the same bitrate, compared to the default UHD encoding using the VVenC encoder, accompanied by an encoding time reduction of 44.43% and a decoding time reduction of 65.46%. This shift towards XPSNR as a guiding metric shall enhance the effectiveness of adaptive streaming algorithms, ensuring an optimal balance between bitrate efficiency and perceptual fidelity with advanced video coding standards.


[101] 2406.13722

Channel Charting in Real-World Coordinates with Distributed MIMO

Channel charting is an emerging self-supervised method that maps channel-state information (CSI) to a low-dimensional latent space (the channel chart) that represents pseudo-positions of user equipments (UEs). While channel charts preserve local geometry, i.e., nearby UEs are nearby in the channel chart (and vice versa), the pseudo-positions are in arbitrary coordinates and global geometry is typically not preserved. In order to embed channel charts in real-world coordinates, we first propose a bilateration loss for distributed multiple-input multiple-output (D-MIMO) wireless systems in which only the access point (AP) positions are known. The idea behind this loss is to compare the received power at pairs of APs to determine whether a UE should be placed closer to one AP or the other in the channel chart. Second, we propose a line-of-sight (LoS) bounding-box loss that places the UE in a predefined LoS area of each AP that is estimated to have a LoS path to the UE. We demonstrate the efficacy of combining both of these loss functions with neural-network-based channel charting using ray-tracing-based and measurement-based channel vectors. Our approach outperforms several baselines and maintains the self-supervised nature of channel charting as it does not rely on geometrical propagation models or require ground-truth UE position information.


[102] 2406.13842

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

Utilizing air-traffic control (ATC) data for downstream natural-language processing tasks requires preprocessing steps. Key steps are the transcription of the data via automatic speech recognition (ASR) and speaker diarization, respectively speaker role detection (SRD) to divide the transcripts into pilot and air-traffic controller (ATCO) transcripts. While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. Our study shows in which cases our joint system can outperform the two traditional approaches and in which cases the other architectures are preferable. We additionally evaluate how acoustic and lexical differences influence all architectures and show how to overcome them for our joint architecture.


[103] 2406.13982

Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio

RemixIT and Remixed2Remixed are domain adaptation-based speech enhancement (DASE) methods that use a teacher model trained in full supervision to generate pseudo-paired data by remixing the outputs of the teacher model. The student model for enhancing real-world recorded signals is trained using the pseudo-paired data without ground truth. Since the noisy signals are recorded in natural environments, the dataset inevitably suffers data imbalance in some acoustic properties, leading to subpar performance for the underrepresented data. The signal-to-noise ratio (SNR), inherently balanced in supervised learning, is a prime example. In this paper, we provide empirical evidence that the SNR of pseudo data has a significant impact on model performance using the dataset of the CHiME-7 UDASE task, highlighting the importance of balanced SNR in DASE. Furthermore, we propose adopting curriculum learning to encompass a broad range of SNRs to boost performance for underrepresented data.


[104] 2406.13992

Robust Cooperative Multi-Agent Reinforcement Learning:A Mean-Field Type Game Perspective

In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm.


[105] 2406.14000

Robust nonlinear state-feedback control of second-order systems

This note proposes a novel nonlinear state feedback controller for perturbed second-order systems. In analogy to a linear proportional-derivative (PD) output feedback control, the proposed nonlinear scheme uses the output state of interest and its time derivative for a robust finite-time regulation. The control has only one free design parameter, and the closed-loop system is shown to be uniformly asymptotically stable in the presence of matched disturbances. We derive a strict Lyapunov function for the closed control loop with a bounded exogenous perturbation, and use it for both the control tuning and analysis of the finite-time convergence. Apart from the numerical results, a revealing experimental example is also shown in favor of the proposed control and in comparison with PD and sub-optimal nonlinear damping regulators.


[106] 2406.14011

Primal-Dual Strategy (PDS) for Composite Optimization Over Directed graphs

We investigate the distributed multi-agent sharing optimization problem in a directed graph, with a composite objective function consisting of a smooth function plus a convex (possibly non-smooth) function shared by all agents. While adhering to the network connectivity structure, the goal is to minimize the sum of smooth local functions plus a non-smooth function. The proposed Primal-Dual algorithm (PD) is similar to a previous algorithm \cite{b27}, but it has additional benefits. To begin, we investigate the problem in directed graphs, where agents can only communicate in one direction and the combination matrix is not symmetric. Furthermore, the combination matrix is changing over time, and the condition coefficient weights are produced using an adaptive approach. The strong convexity assumption, adaptive coefficient weights, and a new upper bound on step-sizes are used to demonstrate that linear convergence is possible. New upper bounds on step-sizes are derived under the strong convexity assumption and adaptive coefficient weights that are time-varying in the presence of both smooth and non-smooth terms. Simulation results show the efficacy of the proposed algorithm compared to some other algorithms.


[107] 2406.14064

PAPR Reduction with Pre-chirp Selection for Affine Frequency Division Multiple

Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique based on discrete affine Fourier transform (DAFT). By properly tuning pre-chirp parameter and post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely avoid overlap of different paths, thus constitutes a full representation of delay-Doppler profile, which significantly improves the system performance in high mobility scenarios. However, AFDM has the crucial problem of high peak-to-average power ratio (PAPR) caused by phase randomness of modulated symbols. In this letter, an algorithm named grouped pre-chirp selection (GPS) is proposed to reduce the PAPR by changing the value of pre-chirp parameter on sub-carriers group by group. Specifically, it is demonstrated first that the important properties of AFDM system are maintained when implementing GPS. Secondly, we elaborate the operation steps of GPS algorithm, illustrating its effect on PAPR reduction and its advantage in terms of computational complexity compared with the ungrouped approach. Finally, simulation results of PAPR reduction in the form of complementary cumulative distribution function (CCDF) show the effectiveness of the proposed GPS algorithm.


[108] 2406.14067

A microwave photonic prototype for concurrent radar detection and spectrum sensing over an 8 to 40 GHz bandwidth

In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz. The IF LFM signal is converted to the optical domain via an intensity modulator and then filtered by a fiber Bragg grating (FBG) to generate only two 2nd-order optical LFM sidebands. In radar detection, the two optical LFM sidebands beat with each other to generate a frequency-and-bandwidth-quadrupled LFM signal, which is used for ranging, radial velocity measurement, and imaging. By changing the center frequency of the IF LFM signal, the radar function can be operated within 8 to 40 GHz. In spectrum sensing, one 2nd-order optical LFM sideband is selected by another FBG, which then works in conjunction with the stimulated Brillouin scattering gain spectrum to map the frequency of the signal under test to time with an instantaneous measurement bandwidth of 2 GHz. By using a frequency shift module to adjust the pump frequency, the frequency measurement range can be adjusted from 0 to 40 GHz. The prototype is comprehensively studied and tested, which is capable of achieving a range resolution of 3.75 cm, a range error of less than $\pm$ 2 cm, a radial velocity error within $\pm$ 1 cm/s, delivering clear imaging of multiple small targets, and maintaining a frequency measurement error of less than $\pm$ 7 MHz and a frequency resolution of better than 20 MHz.


[109] 2406.14082

FLoCoRA: Federated learning compression with low-rank adaptation

Low-Rank Adaptation (LoRA) methods have gained popularity in efficient parameter fine-tuning of models containing hundreds of billions of parameters. In this work, instead, we demonstrate the application of LoRA methods to train small-vision models in Federated Learning (FL) from scratch. We first propose an aggregation-agnostic method to integrate LoRA within FL, named FLoCoRA, showing that the method is capable of reducing communication costs by 4.8 times, while having less than 1% accuracy degradation, for a CIFAR-10 classification task with a ResNet-8. Next, we show that the same method can be extended with an affine quantization scheme, dividing the communication cost by 18.6 times, while comparing it with the standard method, with still less than 1% of accuracy loss, tested with on a ResNet-18 model. Our formulation represents a strong baseline for message size reduction, even when compared to conventional model compression works, while also reducing the training memory requirements due to the low-rank adaptation.


[110] 2406.14092

Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages. Applied to mHuBERT, we investigate their effectiveness on speech re-synthesis task. Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1.6 and the relative value of WER reduced up to 61.72%. Also, our preservation strategies ensure that the performance on both existed and new languages remains intact.


[111] 2406.14176

A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.


[112] 2406.14177

SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation

This paper describes the FBK's participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year's submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used "off-the-shelf" and its simultaneous inference is enabled through the adoption of AlignAtt, a SimulST policy based on cross-attention that can be applied without any retraining or adaptation of the underlying model for the simultaneous task. We participated in all the Shared Task languages (English->{German, Japanese, Chinese}, and Czech->English), achieving acceptable or even better results compared to last year's submissions. SimulSeamless, covering more than 143 source languages and 200 target languages, is released at: https://github.com/hlt-mt/FBK-fairseq/.


[113] 2406.14234

Zero field active shielding

Ambient field suppression is critical for accurate magnetic field measurements, and a requirement for certain low-field sensors to operate. The difference in magnitude between noise and signal (up to 10$^9$) makes the problem challenging, and solutions such as passive shielding, post-hoc processing, and most active shielding designs do not address it completely. Zero field active shielding (ZFS) achieves accurate field suppression with a feed-forward structure in which correction coils are fed by reference sensors via a matrix found using data-driven methods. Requirements are a sufficient number of correction coils and reference sensors to span the ambient field at the sensors, and to zero out the coil-to-reference sensor coupling. The solution assumes instantaneous propagation and mixing, but it can be extended to handle convolutional effects. Precise calculations based on sensor and coil geometries are not necessary, other than to improve efficiency and usability. The solution is simulated here but not implemented in hardware.


[114] 2406.14294

DASB -- Discrete Audio and Speech Benchmark

Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.


[115] 2406.14329

Adaptive Adversarial Cross-Entropy Loss for Sharpness-Aware Minimization

Recent advancements in learning algorithms have demonstrated that the sharpness of the loss surface is an effective measure for improving the generalization gap. Building upon this concept, Sharpness-Aware Minimization (SAM) was proposed to enhance model generalization and achieved state-of-the-art performance. SAM consists of two main steps, the weight perturbation step and the weight updating step. However, the perturbation in SAM is determined by only the gradient of the training loss, or cross-entropy loss. As the model approaches a stationary point, this gradient becomes small and oscillates, leading to inconsistent perturbation directions and also has a chance of diminishing the gradient. Our research introduces an innovative approach to further enhancing model generalization. We propose the Adaptive Adversarial Cross-Entropy (AACE) loss function to replace standard cross-entropy loss for SAM's perturbation. AACE loss and its gradient uniquely increase as the model nears convergence, ensuring consistent perturbation direction and addressing the gradient diminishing issue. Additionally, a novel perturbation-generating function utilizing AACE loss without normalization is proposed, enhancing the model's exploratory capabilities in near-optimum stages. Empirical testing confirms the effectiveness of AACE, with experiments demonstrating improved performance in image classification tasks using Wide ResNet and PyramidNet across various datasets. The reproduction code is available online


[116] 2406.14333

LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation

As online music consumption increasingly shifts towards playlist-based listening, the task of playlist continuation, in which an algorithm suggests songs to extend a playlist in a personalized and musically cohesive manner, has become vital to the success of music streaming. Currently, many existing playlist continuation approaches rely on collaborative filtering methods to perform recommendation. However, such methods will struggle to recommend songs that lack interaction data, an issue known as the cold-start problem. Current approaches to this challenge design complex mechanisms for extracting relational signals from sparse collaborative data and integrating them into content representations. However, these approaches leave content representation learning out of scope and utilize frozen, pre-trained content models that may not be aligned with the distribution or format of a specific musical setting. Furthermore, even the musical state-of-the-art content modules are either (1) incompatible with the cold-start setting or (2) unable to effectively integrate cross-modal and relational signals. In this paper, we introduce LARP, a multi-modal cold-start playlist continuation model, to effectively overcome these limitations. LARP is a three-stage contrastive learning framework that integrates both multi-modal and relational signals into its learned representations. Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss. Experimental results on two publicly available datasets demonstrate the efficacy of LARP over uni-modal and multi-modal models for playlist continuation in a cold-start setting. Code and dataset are released at: https://github.com/Rsalganik1123/LARP.


[117] 2406.14338

Adaptive Robust Controller for handling Unknown Uncertainty of Robotic Manipulators

The ability to achieve precise and smooth trajectory tracking is crucial for ensuring the successful execution of various tasks involving robotic manipulators. State-of-the-art techniques require accurate mathematical models of the robot dynamics, and robustness to model uncertainties is achieved by relying on precise bounds on the model mismatch. In this paper, we propose a novel adaptive robust feedback linearization scheme able to compensate for model uncertainties without any a-priori knowledge on them, and we provide a theoretical proof of convergence under mild assumptions. We evaluate the method on a simulated RR robot. First, we consider a nominal model with known model mismatch, which allows us to compare our strategy with state-of-the-art uncertainty-aware methods. Second, we implement the proposed control law in combination with a learned model, for which uncertainty bounds are not available. Results show that our method leads to performance comparable to uncertainty-aware methods while requiring less prior knowledge.


[118] 2406.14361

Robustness Analysis of AI Models in Critical Energy Systems

This paper analyzes the robustness of state-of-the-art AI-based models for power grid operations under the $N-1$ security criterion. While these models perform well in regular grid settings, our results highlight a significant loss in accuracy following the disconnection of a line.%under this security criterion. Using graph theory-based analysis, we demonstrate the impact of node connectivity on this loss. Our findings emphasize the need for practical scenario considerations in developing AI methodologies for critical infrastructure.


[119] 2406.14458

Centimeter Positioning Accuracy using AI/ML for 6G Applications

This research looks at using AI/ML to achieve centimeter-level user positioning in 6G applications such as the Industrial Internet of Things (IIoT). Initial results show that our AI/ML-based method can estimate user positions with an accuracy of 17 cm in an indoor factory environment. In this proposal, we highlight our approaches and future directions.


[120] 2406.14464

A Review of Common Online Speaker Diarization Methods

Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.


[121] 2406.14485

Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)

This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.


[122] 2406.14559

Disentangled Representation Learning for Environment-agnostic Speaker Recognition

This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation - used as the refined embedding - condenses only the speaker characteristics. We show the versatility of our framework through its compatibility with any existing speaker embedding extractor, requiring no structural modifications or adaptations for integration. We validate the effectiveness of our framework by incorporating it into two popularly used embedding extractors and conducting experiments across various benchmarks. The results show a performance improvement of up to 16%. We release our code for this work to be available https://github.com/kaistmm/voxceleb-disentangler