New articles on Electrical Engineering and Systems Science

[1] 2402.18575

DiffuseRAW: End-to-End Generative RAW Image Processing for Low-Light Images

Imaging under extremely low-light conditions presents a significant challenge and is an ill-posed problem due to the low signal-to-noise ratio (SNR) caused by minimal photon capture. Previously, diffusion models have been used for multiple kinds of generative tasks and image-to-image tasks, however, these models work as a post-processing step. These diffusion models are trained on processed images and learn on processed images. However, such approaches are often not well-suited for extremely low-light tasks. Unlike the task of low-light image enhancement or image-to-image enhancement, we tackle the task of learning the entire image-processing pipeline, from the RAW image to a processed image. For this task, a traditional image processing pipeline often consists of multiple specialized parts that are overly reliant on the downstream tasks. Unlike these, we develop a new generative ISP that relies on fine-tuning latent diffusion models on RAW images and generating processed long-exposure images which allows for the apt use of the priors from large text-to-image generation models. We evaluate our approach on popular end-to-end low-light datasets for which we see promising results and set a new SoTA on the See-in-Dark (SID) dataset. Furthermore, with this work, we hope to pave the way for more generative and diffusion-based image processing and other problems on RAW data.

[2] 2402.18600

Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina

Diabetes mellitus (DM) predisposes patients to vascular complications. Retinal images and vasculature reflect the body's micro- and macrovascular health. They can be used to diagnose DM complications, including diabetic retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular disease, as well as forecast the risk of cardiovascular events. Artificial intelligence (AI)-enabled systems developed for high-throughput detection of DR using digitized retinal images have become clinically adopted. Beyond DR screening, AI integration also holds immense potential to address challenges associated with the holistic care of the patient with DM. In this work, we aim to comprehensively review the literature for studies on AI applications based on retinal images related to DM diagnosis, prognostication, and management. We will describe the findings of holistic AI-assisted diabetes care, including but not limited to DR screening, and discuss barriers to implementing such systems, including issues concerning ethics, data privacy, equitable access, and explainability. With the ability to evaluate the patient's health status vis a vis DM complication as well as risk prognostication of future cardiovascular complications, AI-assisted retinal image analysis has the potential to become a central tool for modern personalized medicine in patients with DM.

[3] 2402.18615

Unsupervised Airway Tree Clustering with Deep Learning: The Multi-Ethnic Study of Atherosclerosis (MESA) Lung Study

High-resolution full lung CT scans now enable the detailed segmentation of airway trees up to the 6th branching generation. The airway binary masks display very complex tree structures that may encode biological information relevant to disease risk and yet remain challenging to exploit via traditional methods such as meshing or skeletonization. Recent clinical studies suggest that some variations in shape patterns and caliber of the human airway tree are highly associated with adverse health outcomes, including all-cause mortality and incident COPD. However, quantitative characterization of variations observed on CT segmented airway tree remain incomplete, as does our understanding of the clinical and developmental implications of such. In this work, we present an unsupervised deep-learning pipeline for feature extraction and clustering of human airway trees, learned directly from projections of 3D airway segmentations. We identify four reproducible and clinically distinct airway sub-types in the MESA Lung CT cohort.

[4] 2402.18709

Nonlinear identification algorithm for online and offline study of pulmonary mechanical ventilation

This work presents an algorithm for determining the parameters of a nonlinear dynamic model of the respiratory system in patients undergoing assisted ventilation. Using the pressure and flow signals measured at the mouth, the model's quadratic pressure-volume characteristic is fit to this data in each respiratory cycle by appropriate estimates of the model parameters. Parameter changes during ventilation can thus also be detected. The algorithm is first refined and assessed using data derived from simulated patients represented through a sigmoidal pressure-volume characteristic with hysteresis. As satisfactory results are achieved with the simulated data, the algorithm is evaluated with real data obtained from actual patients undergoing assisted ventilation. The proposed nonlinear dynamic model and associated parameter estimation algorithm yield closer fits than the static linear models computed by respiratory machines, with only a minor increase in computation. They also provide more information to the physician, such as the pressure-volume (P-V) curvature and the condition of the lung (whether normal, under-inflated, or over-inflated). This information can be used to provide safer ventilation for patients, for instance by ventilating them in the linear region of the respiratory system.

[5] 2402.18719

MaxCUCL: Max-Consensus with Deterministic Convergence in Networks with Unreliable Communication

In this paper, we present a novel distributed algorithm (herein called MaxCUCL) designed to guarantee that max-consensus is reached in networks characterized by unreliable communication links (i.e., links suffering from packet drops). Our proposed algorithm is the first algorithm that achieves max-consensus in a deterministic manner (i.e., nodes always calculate the maximum of their states regardless of the nature of the probability distribution of the packet drops). Furthermore, it allows nodes to determine whether convergence has been achieved (enabling them to transition to subsequent tasks). The operation of MaxCUCL relies on the deployment of narrowband error-free feedback channels used for acknowledging whether a packet transmission between nodes was successful. We analyze the operation of our algorithm and show that it converges after a finite number of time steps. Finally, we demonstrate our algorithm's effectiveness and practical applicability by applying it to a sensor network deployed for environmental monitoring.

[6] 2402.18744

Timer-Based Coverage Control for Mobile Sensors

This work studies the coverage control problem over a static, bounded, and convex workspace and develops a hybrid extension of the continuous-time Lloyd algorithm. Each agent in a multi-agent system (MAS) is equipped with a timer that generates intermittent sampling events, which may occur asynchronously between agents. At each sampling event, the corresponding agents update their controllers, which are otherwise held constant. These controllers are shown to drive the MAS into a neighborhood of the configurations corresponding to a centroidal Voronoi tessellation, that is, a local minimizer of the standard locational cost. The result is a distributed control strategy that leverages intermittent and asynchronous position measurements to disperse the agents within the workspace. The combination of continuous-time dynamics with intermittently updated control inputs is modeled as a hybrid system. The coverage control objective is posed as a set stabilization problem for hybrid systems, where an invariance based convergence analysis yields sufficient conditions that ensure all maximal solutions of the hybrid system asymptotically converge to a desired set. A brief simulation example is included to showcase the result.

[7] 2402.18758

Analog Isolated Multilevel Quantizer for Voltage Sensing while Maintaining Galvanic Isolation

A low-power, compact device for performing measurements in electrical systems with isolated voltage domains is proposed. Isolated measurements are required in numerous applications. For instance, a measurement of the bus voltage for a system with a high supply voltage and lower isolated local voltage level may be needed for system health monitoring and control. Such a requirement may necessitate the use of isolation amplifiers to provide voltage telemetry for the local system. Isolation amplifiers require dual galvanically isolated supplies and use magnetic, capacitive, or optical barriers between primary and secondary sides. Producing this supplemental voltage requires an extra voltage converter, which consumes power and generates electromagnetic interference which must, in turn, be filtered. Complex designs incorporating feedback are needed to achieve linear response. The proposed Analog Isolated Multilevel Quantizer (AIMQ) addresses these issues by monitoring the primary-side signal and communicating the results to the secondary side using a novel scheme involving Zener diodes, optocouplers, transistors, one-hot coding, and discrete outputs. The result is a low power isolated transducer that can in principle be extended to an arbitrary bit depth.

[8] 2402.18761

Exploration of Learned Lifting-Based Transform Structures for Fully Scalable and Accessible Wavelet-Like Image Compression

This paper provides a comprehensive study on features and performance of different ways to incorporate neural networks into lifting-based wavelet-like transforms, within the context of fully scalable and accessible image compression. Specifically, we explore different arrangements of lifting steps, as well as various network architectures for learned lifting operators. Moreover, we examine the impact of the number of learned lifting steps, the number of channels, the number of layers and the support of kernels in each learned lifting operator. To facilitate the study, we investigate two generic training methodologies that are simultaneously appropriate to a wide variety of lifting structures considered. Experimental results ultimately suggest that retaining fixed lifting steps from the base wavelet transform is highly beneficial. Moreover, we demonstrate that employing more learned lifting steps and more layers in each learned lifting operator do not contribute strongly to the compression performance. However, benefits can be obtained by utilizing more channels in each learned lifting operator. Ultimately, the learned wavelet-like transform proposed in this paper achieves over 25% bit-rate savings compared to JPEG 2000 with compact spatial support.

[9] 2402.18777

GDCNet: Calibrationless geometric distortion correction of echo planar imaging data using deep learning

Functional magnetic resonance imaging techniques benefit from echo-planar imaging's fast image acquisition but are susceptible to inhomogeneities in the main magnetic field, resulting in geometric distortion and signal loss artifacts in the images. Traditional methods leverage a field map or voxel displacement map for distortion correction. However, voxel displacement map estimation requires additional sequence acquisitions, and the accuracy of the estimation influences correction performance. This work implements a novel approach called GDCNet, which estimates a geometric distortion map by non-linear registration to T1-weighted anatomical images and applies it for distortion correction. GDCNet demonstrated fast distortion correction of functional images in retrospectively and prospectively acquired datasets. Among the compared models, the 2D self-supervised configuration resulted in a statistically significant improvement to normalized mutual information between distortion-corrected functional and T1-weighted images compared to the benchmark methods FUGUE and TOPUP. Furthermore, GDCNet models achieved processing speeds 14 times faster than TOPUP in the prospective dataset.

[10] 2402.18856

Anatomy-guided fiber trajectory distribution estimation for cranial nerves tractography

Diffusion MRI tractography is an important tool for identifying and analyzing the intracranial course of cranial nerves (CNs). However, the complex environment of the skull base leads to ambiguous spatial correspondence between diffusion directions and fiber geometry, and existing diffusion tractography methods of CNs identification are prone to producing erroneous trajectories and missing true positive connections. To overcome the above challenge, we propose a novel CNs identification framework with anatomy-guided fiber trajectory distribution, which incorporates anatomical shape prior knowledge during the process of CNs tracing to build diffusion tensor vector fields. We introduce higher-order streamline differential equations for continuous flow field representations to directly characterize the fiber trajectory distribution of CNs from the tract-based level. The experimental results on the vivo HCP dataset and the clinical MDM dataset demonstrate that the proposed method reduces false-positive fiber production compared to competing methods and produces reconstructed CNs (i.e. CN II, CN III, CN V, and CN VII/VIII) that are judged to better correspond to the known anatomy.

[11] 2402.18862

Towards Backward-Compatible Continual Learning of Image Compression

This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility. To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. Our code is available at

[12] 2402.18864

Privacy-Preserving Autoencoder for Collaborative Object Detection

Privacy is a crucial concern in collaborative machine vision where a part of a Deep Neural Network (DNN) model runs on the edge, and the rest is executed on the cloud. In such applications, the machine vision model does not need the exact visual content to perform its task. Taking advantage of this potential, private information could be removed from the data insofar as it does not significantly impair the accuracy of the machine vision system. In this paper, we present an autoencoder-style network integrated within an object detection pipeline, which generates a latent representation of the input image that preserves task-relevant information while removing private information. Our approach employs an adversarial training strategy that not only removes private information from the bottleneck of the autoencoder but also promotes improved compression efficiency for feature channels coded by conventional codecs like VVC-Intra. We assess the proposed system using a realistic evaluation framework for privacy, directly measuring face and license plate recognition accuracy. Experimental results show that our proposed method is able to reduce the bitrate significantly at the same object detection accuracy compared to coding the input images directly, while keeping the face and license plate recognition accuracy on the images recovered from the bottleneck features low, implying strong privacy protection.

[13] 2402.18867

Message-Enhanced DeGroot Model

Understanding the impact of messages on agents' opinions over social networks is important. However, to our best knowledge, there has been limited quantitative investigation into this phenomenon in the prior works. To address this gap, this paper proposes the Message-Enhanced DeGroot model. The Bounded Brownian Message model provides a quantitative description of the message evolution, jointly considering temporal continuity, randomness, and polarization from mass media theory. The Message-Enhanced DeGroot model, combining the Bounded Brownian Message model with the traditional DeGroot model, quantitatively describes the evolution of agents' opinions under the influence of messages. We theoretically study the probability distribution and statistics of the messages and agents' opinions and quantitatively analyze the impact of messages on opinions. We also conduct simulations to validate our analyses.

[14] 2402.18871

LoLiSRFlow: Joint Single Image Low-light Enhancement and Super-resolution via Cross-scale Transformer-based Conditional Flow

The visibility of real-world images is often limited by both low-light and low-resolution, however, these issues are only addressed in the literature through Low-Light Enhancement (LLE) and Super- Resolution (SR) methods. Admittedly, a simple cascade of these approaches cannot work harmoniously to cope well with the highly ill-posed problem for simultaneously enhancing visibility and resolution. In this paper, we propose a normalizing flow network, dubbed LoLiSRFLow, specifically designed to consider the degradation mechanism inherent in joint LLE and SR. To break the bonds of the one-to-many mapping for low-light low-resolution images to normal-light high-resolution images, LoLiSRFLow directly learns the conditional probability distribution over a variety of feasible solutions for high-resolution well-exposed images. Specifically, a multi-resolution parallel transformer acts as a conditional encoder that extracts the Retinex-induced resolution-and-illumination invariant map as the previous one. And the invertible network maps the distribution of usually exposed high-resolution images to a latent distribution. The backward inference is equivalent to introducing an additional constrained loss for the normal training route, thus enabling the manifold of the natural exposure of the high-resolution image to be immaculately depicted. We also propose a synthetic dataset modeling the realistic low-light low-resolution degradation, named DFSR-LLE, containing 7100 low-resolution dark-light/high-resolution normal sharp pairs. Quantitative and qualitative experimental results demonstrate the effectiveness of our method on both the proposed synthetic and real datasets.

[15] 2402.18930

Variable-Rate Learned Image Compression with Multi-Objective Optimization and Quantization-Reconstruction Offsets

Achieving successful variable bitrate compression with computationally simple algorithms from a single end-to-end learned image or video compression model remains a challenge. Many approaches have been proposed, including conditional auto-encoders, channel-adaptive gains for the latent tensor or uniformly quantizing all elements of the latent tensor. This paper follows the traditional approach to vary a single quantization step size to perform uniform quantization of all latent tensor elements. However, three modifications are proposed to improve the variable rate compression performance. First, multi objective optimization is used for (post) training. Second, a quantization-reconstruction offset is introduced into the quantization operation. Third, variable rate quantization is used also for the hyper latent. All these modifications can be made on a pre-trained single-rate compression model by performing post training. The algorithms are implemented into three well-known image compression models and the achieved variable rate compression results indicate negligible or minimal compression performance loss compared to training multiple models. (Codes will be shared at

[16] 2402.18932

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.

[17] 2402.18963

Quantification of Tracer Dilution Dynamics: An Exploration into the Mathematical Modeling of Medical Imaging

Convolution and deconvolution are essential techniques in various fields, notably in medical imaging, where they play a crucial role in analyzing dynamic processes such as blood flow. This paper explores the convolution and deconvolution of arterial and microvascular signals for determining impulse and residue functions from in vivo or simulated data and the derivation of the relationship between the residue function and perfusion metrics such as the Cerebral Blood Flow (CBF), Mean Transit Time (MTT) and Transit Time to Heterogeneity (TTH). The paper presents the spectral derivatives as a technique for recovering the impulse response function from the residue function, detailing the computational procedures involved and strategies for mitigating noise effects.

[18] 2402.18968

Ambisonics Networks -- The Effect Of Radial Functions Regularization

Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This can be overcome by regularization, with the downside of introducing errors to the Ambisonics encoding. This paper aims to investigate the impact of different ways of regularization on Deep Neural Network (DNN) training and performance. Ideally, these networks should be robust to the way of regularization. Simulated data of a single speaker in a room and experimental data from the LOCATA challenge were used to evaluate this robustness on an example algorithm of speaker localization based on the direct-path dominance (DPD) test. Results show that performance may be sensitive to the way of regularization, and an informed approach is proposed and investigated, highlighting the importance of regularization information.

[19] 2402.19013

Ultraviolet Positioning via TDOA: Error Analysis and System Prototype

This work performs the design, real-time hardware realization, and experimental evaluation of a positioning system by ultra-violet (UV) communication under photon-level signal detection. The positioning is based on time-difference of arrival (TDOA) principle. Time division-based transmission of synchronization sequence from three transmitters with known positions is applied. We investigate the positioning error via decomposing it into two parts, the transmitter-side timing error and the receiver-side synchronization error. The theoretical average error matches well with the simulation results, which indicates that theoretical fitting can provide reliable guidance and prediction for hardware experiments. We also conduct real-time hardware realization of the TDOA-based positioning system using Field Programmable Gate Array (FPGA), which is experimentally evaluated via outdoor experiments. Experimental results match well with the theoretical and simulation results.

[20] 2402.19020

Unsupervised Learning of High-resolution Light Field Imaging via Beam Splitter-based Hybrid Lenses

In this paper, we design a beam splitter-based hybrid light field imaging prototype to record 4D light field image and high-resolution 2D image simultaneously, and make a hybrid light field dataset. The 2D image could be considered as the high-resolution ground truth corresponding to the low-resolution central sub-aperture image of 4D light field image. Subsequently, we propose an unsupervised learning-based super-resolution framework with the hybrid light field dataset, which adaptively settles the light field spatial super-resolution problem with a complex degradation model. Specifically, we design two loss functions based on pre-trained models that enable the super-resolution network to learn the detailed features and light field parallax structure with only one ground truth. Extensive experiments demonstrate the same superiority of our approach with supervised learning-based state-of-the-art ones. To our knowledge, it is the first end-to-end unsupervised learning-based spatial super-resolution approach in light field imaging research, whose input is available from our beam splitter-based hybrid light field system. The hardware and software together may help promote the application of light field super-resolution to a great extent.

[21] 2402.19023

Jointly Learning Selection Matrices For Transmitters, Receivers And Fourier Coefficients In Multichannel Imaging

Strategic subsampling has become a focal point due to its effectiveness in compressing data, particularly in the Full Matrix Capture (FMC) approach in ultrasonic imaging. This paper introduces the Joint Deep Probabilistic Subsampling (J-DPS) method, which aims to learn optimal selection matrices simultaneously for transmitters, receivers, and Fourier coefficients. This task-based algorithm is realized by introducing a specialized measurement model and integrating a customized Complex Learned FISTA (CL-FISTA) network. We propose a parallel network architecture, partitioned into three segments corresponding to the three matrices, all working toward a shared optimization objective with adjustable loss allocation. A synthetic dataset is designed to reflect practical scenarios, and we provide quantitative comparisons with a traditional CRB-based algorithm, standard DPS, and J-DPS.

[22] 2402.19043

WDM: 3D Wavelet Diffusion Models for High-Resolution Medical Image Synthesis

Due to the three-dimensional nature of CT- or MR-scans, generative modeling of medical images is a particularly challenging task. Existing approaches mostly apply patch-wise, slice-wise, or cascaded generation techniques to fit the high-dimensional data into the limited GPU memory. However, these approaches may introduce artifacts and potentially restrict the model's applicability for certain downstream tasks. This work presents WDM, a wavelet-based medical image synthesis framework that applies a diffusion model on wavelet decomposed images. The presented approach is a simple yet effective way of scaling diffusion models to high resolutions and can be trained on a single 40 GB GPU. Experimental results on BraTS and LIDC-IDRI unconditional image generation at a resolution of $128 \times 128 \times 128$ show state-of-the-art image fidelity (FID) and sample diversity (MS-SSIM) scores compared to GANs, Diffusion Models, and Latent Diffusion Models. Our proposed method is the only one capable of generating high-quality images at a resolution of $256 \times 256 \times 256$.

[23] 2402.19062

Graph Convolutional Neural Networks for Automated Echocardiography View Recognition: A Holistic Approach

To facilitate diagnosis on cardiac ultrasound (US), clinical practice has established several standard views of the heart, which serve as reference points for diagnostic measurements and define viewports from which images are acquired. Automatic view recognition involves grouping those images into classes of standard views. Although deep learning techniques have been successful in achieving this, they still struggle with fully verifying the suitability of an image for specific measurements due to factors like the correct location, pose, and potential occlusions of cardiac structures. Our approach goes beyond view classification and incorporates a 3D mesh reconstruction of the heart that enables several more downstream tasks, like segmentation and pose estimation. In this work, we explore learning 3D heart meshes via graph convolutions, using similar techniques to learn 3D meshes in natural images, such as human pose estimation. As the availability of fully annotated 3D images is limited, we generate synthetic US images from 3D meshes by training an adversarial denoising diffusion model. Experiments were conducted on synthetic and clinical cases for view recognition and structure detection. The approach yielded good performance on synthetic images and, despite being exclusively trained on synthetic data, it already showed potential when applied to clinical images. With this proof-of-concept, we aim to demonstrate the benefits of graphs to improve cardiac view recognition that can ultimately lead to better efficiency in cardiac diagnosis.

[24] 2402.19073

Automatic Radar Signal Detection and FFT Estimation using Deep Learning

This paper addresses a critical preliminary step in radar signal processing: detecting the presence of a radar signal and robustly estimating its bandwidth. Existing methods which are largely statistical feature-based approaches face challenges in electronic warfare (EW) settings where prior information about signals is lacking. While alternate deep learning based methods focus on more challenging environments, they primarily formulate this as a binary classification problem. In this research, we propose a novel methodology that not only detects the presence of a signal, but also localises it in the time domain and estimates its operating frequency band at that point in time. To achieve robust estimation, we introduce a compound loss function that leverages complementary information from both time-domain and frequency-domain representations. By integrating these approaches, we aim to improve the efficiency and accuracy of radar signal detection and parameter estimation, reducing both unnecessary resource consumption and human effort in downstream tasks.

[25] 2402.19106

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound.

[26] 2402.19110

Temporal-Aware Deep Reinforcement Learning for Energy Storage Bidding in Energy and Contingency Reserve Markets

The battery energy storage system (BESS) has immense potential for enhancing grid reliability and security through its participation in the electricity market. BESS often seeks various revenue streams by taking part in multiple markets to unlock its full potential, but effective algorithms for joint-market participation under price uncertainties are insufficiently explored in the existing research. To bridge this gap, we develop a novel BESS joint bidding strategy that utilizes deep reinforcement learning (DRL) to bid in the spot and contingency frequency control ancillary services (FCAS) markets. Our approach leverages a transformer-based temporal feature extractor to effectively respond to price fluctuations in seven markets simultaneously and helps DRL learn the best BESS bidding strategy in joint-market participation. Additionally, unlike conventional "black-box" DRL model, our approach is more interpretable and provides valuable insights into the temporal bidding behavior of BESS in the dynamic electricity market. We validate our method using realistic market prices from the Australian National Electricity Market. The results show that our strategy outperforms benchmarks, including both optimization-based and other DRL-based strategies, by substantial margins. Our findings further suggest that effective temporal-aware bidding can significantly increase profits in the spot and contingency FCAS markets compared to individual market participation.

[27] 2402.19111

Deep Network for Image Compressed Sensing Coding Using Local Structural Sampling

Existing image compressed sensing (CS) coding frameworks usually solve an inverse problem based on measurement coding and optimization-based image reconstruction, which still exist the following two challenges: 1) The widely used random sampling matrix, such as the Gaussian Random Matrix (GRM), usually leads to low measurement coding efficiency. 2) The optimization-based reconstruction methods generally maintain a much higher computational complexity. In this paper, we propose a new CNN based image CS coding framework using local structural sampling (dubbed CSCNet) that includes three functional modules: local structural sampling, measurement coding and Laplacian pyramid reconstruction. In the proposed framework, instead of GRM, a new local structural sampling matrix is first developed, which is able to enhance the correlation between the measurements through a local perceptual sampling strategy. Besides, the designed local structural sampling matrix can be jointly optimized with the other functional modules during training process. After sampling, the measurements with high correlations are produced, which are then coded into final bitstreams by the third-party image codec. At last, a Laplacian pyramid reconstruction network is proposed to efficiently recover the target image from the measurement domain to the image domain. Extensive experimental results demonstrate that the proposed scheme outperforms the existing state-of-the-art CS coding methods, while maintaining fast computational speed.

[28] 2402.19124

Analysis of Processing Pipelines for Indoor Human Tracking using FMCW radar

In this paper, the problem of formulating effective processing pipelines for indoor human tracking is investigated, with the usage of a Multiple Input Multiple Output (MIMO) Frequency Modulated Continuous Wave (FMCW) radar. Specifically, two processing pipelines starting with detections on the Range-Azimuth (RA) maps and the Range-Doppler (RD) maps are formulated and compared, together with subsequent clustering and tracking algorithms and their relevant parameters. Experimental results are presented to validate and assess both pipelines, using a 24 GHz commercial radar platform with 250 MHz bandwidth and 15 virtual channels. Scenarios where 1 and 2 people move in an indoor environment are considered, and the influence of the number of virtual channels and detectors' parameters is discussed. The characteristics and limitations of both pipelines are presented, with the approach based on detections on RA maps showing in general more robust results.

[29] 2402.19168

Disturbance Decoupling Problem for $n$-link chain pendulum on a cart system

A disturbance decoupling problem for a $n$-link chain pendulum on a cart is considered. A model of the cart developed in a coordinate-free framework and the linearized equations of this system are considered from [1]. It is shown that it is possible to design a suitable state feedback such that the angular position or velocity of the $n^{th}$-link can always be decoupled from the disturbance coming at the cart.

[30] 2402.19172

Point Processes and spatial statistics in time-frequency analysis

A finite-energy signal is represented by a square-integrable, complex-valued function $t\mapsto s(t)$ of a real variable $t$, interpreted as time. Similarly, a noisy signal is represented by a random process. Time-frequency analysis, a subfield of signal processing, amounts to describing the temporal evolution of the frequency content of a signal. Loosely speaking, if $s$ is the audio recording of a musical piece, time-frequency analysis somehow consists in writing the musical score of the piece. Mathematically, the operation is performed through a transform $\mathcal{V}$, mapping $s \in L^2(\mathbb{R})$ onto a complex-valued function $\mathcal{V}s \in L^2(\mathbb{R}^2)$ of time $t$ and angular frequency $\omega$. The squared modulus $(t, \omega) \mapsto \vert\mathcal{V}s(t,\omega)\vert^2$ of the time-frequency representation is known as the spectrogram of $s$; in the musical score analogy, a peaked spectrogram at $(t_0,\omega_0)$ corresponds to a musical note at angular frequency $\omega_0$ localized at time $t_0$. More generally, the intuition is that upper level sets of the spectrogram contain relevant information about in the original signal. Hence, many signal processing algorithms revolve around identifying maxima of the spectrogram. In contrast, zeros of the spectrogram indicate perfect silence, that is, a time at which a particular frequency is absent. Assimilating $\mathbb{R}^2$ to $\mathbb{C}$ through $z = \omega + \mathrm{i}t$, this chapter focuses on time-frequency transforms $\mathcal{V}$ that map signals to analytic functions. The zeros of the spectrogram of a noisy signal are then the zeros of a random analytic function, hence forming a Point Process in $\mathbb{C}$. This chapter is devoted to the study of these Point Processes, to their links with zeros of Gaussian Analytic Functions, and to designing signal detection and denoising algorithms using spatial statistics.

[31] 2402.19188

KGAMC: A Novel Knowledge Graph Driven Automatic Modulation Classification Scheme

Automatic modulation classification (AMC) is a promising technology to realize intelligent wireless communications in the sixth generation (6G) wireless communication networks. Recently, many data-and-knowledge dual-driven AMC schemes have achieved high accuracy. However, most of these schemes focus on generating additional prior knowledge or features of blind signals, which consumes longer computation time and ignores the interpretability of the model learning process. To solve these problems, we propose a novel knowledge graph (KG) driven AMC (KGAMC) scheme by training the networks under the guidance of domain knowledge. A modulation knowledge graph (MKG) with the knowledge of modulation technical characteristics and application scenarios is constructed and a relation-graph convolution network (RGCN) is designed to extract knowledge of the MKG. This knowledge is utilized to facilitate the signal features separation of the data-oriented model by implementing a specialized feature aggregation method. Simulation results demonstrate that KGAMC achieves superior classification performance compared to other benchmark schemes, especially in the low signal-to-noise ratio (SNR) range. Furthermore, the signal features of the high-order modulation are more discriminative, thus reducing the confusion between similar signals.

[32] 2402.19205

DeepEMC-T2 Mapping: Deep Learning-Enabled T2 Mapping Based on Echo Modulation Curve Modeling

Purpose: Echo modulation curve (EMC) modeling can provide accurate and reproducible quantification of T2 relaxation times. The standard EMC-T2 mapping framework, however, requires sufficient echoes and cumbersome pixel-wise dictionary-matching steps. This work proposes a deep learning version of EMC-T2 mapping, called DeepEMC-T2 mapping, to efficiently estimate accurate T2 maps from fewer echoes without a dictionary. Methods: DeepEMC-T2 mapping was developed using a modified U-Net to estimate both T2 and Proton Density (PD) maps directly from multi-echo spin-echo (MESE) images. The modified U-Net employs several new features to improve the accuracy of T2/PD estimation. MESE datasets from 68 subjects were used for training and evaluation of the DeepEMC-T2 mapping technique. Multiple experiments were conducted to evaluate the impact of the proposed new features on DeepEMC-T2 mapping. Results: DeepEMC-T2 mapping achieved T2 estimation errors ranging from 3%-12% in different T2 ranges and 0.8%-1.7% for PD estimation with 10/7/5/3 echoes, which yielded more accurate parameter estimation than standard EMC-T2 mapping. The new features proposed in DeepEMC-T2 mapping enabled improved parameter estimation. The use of a larger echo spacing with fewer echoes can maintain the accuracy of T2 and PD estimations while reducing the number of 180-degree refocusing pulses. Conclusions: DeepEMC-T2 mapping enables simplified, efficient, and accurate T2 quantification directly from MESE images without a time-consuming dictionary-matching step and requires fewer echoes. This allows for increased volumetric coverage and/or decreased SAR by reducing the number of 180-degree refocusing pulses.

[33] 2402.19215

Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts

Super-resolution (SR) is an ill-posed inverse problem, where the size of the set of feasible solutions that are consistent with a given low-resolution image is very large. Many algorithms have been proposed to find a "good" solution among the feasible solutions that strike a balance between fidelity and perceptual quality. Unfortunately, all known methods generate artifacts and hallucinations while trying to reconstruct high-frequency (HF) image details. A fundamental question is: Can a model learn to distinguish genuine image details from artifacts? Although some recent works focused on the differentiation of details and artifacts, this is a very challenging problem and a satisfactory solution is yet to be found. This paper shows that the characterization of genuine HF details versus artifacts can be better learned by training GAN-based SR models using wavelet-domain loss functions compared to RGB-domain or Fourier-space losses. Although wavelet-domain losses have been used in the literature before, they have not been used in the context of the SR task. More specifically, we train the discriminator only on the HF wavelet sub-bands instead of on RGB images and the generator is trained by a fidelity loss over wavelet subbands to make it sensitive to the scale and orientation of structures. Extensive experimental results demonstrate that our model achieves better perception-distortion trade-off according to multiple objective measures and visual evaluations.

[34] 2402.19275

Adaptive Testing Environment Generation for Connected and Automated Vehicles with Dense Reinforcement Learning

The assessment of safety performance plays a pivotal role in the development and deployment of connected and automated vehicles (CAVs). A common approach involves designing testing scenarios based on prior knowledge of CAVs (e.g., surrogate models), conducting tests in these scenarios, and subsequently evaluating CAVs' safety performances. However, substantial differences between CAVs and the prior knowledge can significantly diminish the evaluation efficiency. In response to this issue, existing studies predominantly concentrate on the adaptive design of testing scenarios during the CAV testing process. Yet, these methods have limitations in their applicability to high-dimensional scenarios. To overcome this challenge, we develop an adaptive testing environment that bolsters evaluation robustness by incorporating multiple surrogate models and optimizing the combination coefficients of these surrogate models to enhance evaluation efficiency. We formulate the optimization problem as a regression task utilizing quadratic programming. To efficiently obtain the regression target via reinforcement learning, we propose the dense reinforcement learning method and devise a new adaptive policy with high sample efficiency. Essentially, our approach centers on learning the values of critical scenes displaying substantial surrogate-to-real gaps. The effectiveness of our method is validated in high-dimensional overtaking scenarios, demonstrating that our approach achieves notable evaluation efficiency.

[35] 2402.19276

Modular Blind Video Quality Assessment

Blind video quality assessment (BVQA) plays a pivotal role in evaluating and improving the viewing experience of end-users across a wide range of video-based platforms and services. Contemporary deep learning-based models primarily analyze the video content in its aggressively downsampled format, while being blind to the impact of actual spatial resolution and frame rate on video quality. In this paper, we propose a modular BVQA model, and a method of training it to improve its modularity. Specifically, our model comprises a base quality predictor, a spatial rectifier, and a temporal rectifier, responding to the visual content and distortion, spatial resolution, and frame rate changes on video quality, respectively. During training, spatial and temporal rectifiers are dropped out with some probabilities so as to make the base quality predictor a standalone BVQA model, which should work better with the rectifiers. Extensive experiments on both professionally-generated content and user generated content video databases show that our quality model achieves superior or comparable performance to current methods. Furthermore, the modularity of our model offers a great opportunity to analyze existing video quality databases in terms of their spatial and temporal complexities. Last, our BVQA model is cost-effective to add other quality-relevant video attributes such as dynamic range and color gamut as additional rectifiers.

[36] 2402.19286

PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation

Understanding the anatomy of renal pathology is crucial for advancing disease diagnostics, treatment evaluation, and clinical research. The complex kidney system comprises various components across multiple levels, including regions (cortex, medulla), functional units (glomeruli, tubules), and cells (podocytes, mesangial cells in glomerulus). Prior studies have predominantly overlooked the intricate spatial interrelations among objects from clinical knowledge. In this research, we introduce a novel universal proposition learning approach, called panoramic renal pathology segmentation (PrPSeg), designed to segment comprehensively panoramic structures within kidney by integrating extensive knowledge of kidney anatomy. In this paper, we propose (1) the design of a comprehensive universal proposition matrix for renal pathology, facilitating the incorporation of classification and spatial relationships into the segmentation process; (2) a token-based dynamic head single network architecture, with the improvement of the partial label image segmentation and capability for future data enlargement; and (3) an anatomy loss function, quantifying the inter-object relationships across the kidney.

[37] 2402.19289

CAMixerSR: Only Details Need More "Attention"

To satisfy the rapidly increasing demands on the large image (2K-8K) super-resolution (SR), prevailing methods follow two independent tracks: 1) accelerate existing networks by content-aware routing, and 2) design better super-resolution networks via token mixer refining. Despite directness, they encounter unavoidable defects (e.g., inflexible route or non-discriminative processing) limiting further improvements of quality-complexity trade-off. To erase the drawbacks, we integrate these schemes by proposing a content-aware mixer (CAMixer), which assigns convolution for simple contexts and additional deformable window-attention for sparse textures. Specifically, the CAMixer uses a learnable predictor to generate multiple bootstraps, including offsets for windows warping, a mask for classifying windows, and convolutional attentions for endowing convolution with the dynamic property, which modulates attention to include more useful textures self-adaptively and improves the representation capability of convolution. We further introduce a global classification loss to improve the accuracy of predictors. By simply stacking CAMixers, we obtain CAMixerSR which achieves superior performance on large-image SR, lightweight SR, and omnidirectional-image SR.

[38] 2402.19309

Closed-loop training of static output feedback neural network controllers for large systems: A distillation case study

The online implementation of model predictive control for constrained multivariate systems has two main disadvantages: it requires an estimate of the entire model state and an optimisation problem must be solved online. These issues have typically been treated separately. This work proposes an integrated approach for the offline training of an output feedback neural network controller in closed loop. Online this neural network controller computers the plant inputs cheaply using noisy measurements. In addition, the controller can be trained to only make use of certain predefined measurements. Further, a heuristic approach is proposed to perform the automatic selection of important measurements. The proposed method is demonstrated by extensive simulations using a non-linear distillation column model of 50 states.

[39] 2402.19387

SeD: Semantic-Aware Discriminator for Image Super-Resolution

Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular, one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However, the distribution learning is overly coarse-grained, which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this, we propose the simple and effective Semantic-aware Discriminator (denoted as SeD), which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely, we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics, the discriminator is able to distinguish the real-fake images individually and adaptively, which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics, we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets, and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way, our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our proposed methods.

[40] 2402.19470

Towards Generalizable Tumor Synthesis

Tumor synthesis enables the creation of artificial tumors in medical images, facilitating the training of AI models for tumor detection and segmentation. However, success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and, furthermore, the resulting AI models being capable of detecting real tumors in images sourced from different domains (e.g., hospitals). This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors (< 2cm) tend to have similar imaging characteristics in computed tomography (CT), whether they originate in the liver, pancreas, or kidneys. We have ascertained that generative AI models, e.g., Diffusion Models, can create realistic tumors generalized to a range of organs even when trained on a limited number of tumor examples from only one organ. Moreover, we have shown that AI models trained on these synthetic tumors can be generalized to detect and segment real tumors from CT volumes, encompassing a broad spectrum of patient demographics, imaging protocols, and healthcare facilities.

[41] 2402.18579

Wilcoxon Nonparametric CFAR Scheme for Ship Detection in SAR Image

The parametric constant false alarm rate (CFAR) detection algorithms which are based on various statistical distributions, such as Gaussian, Gamma, Weibull, log-normal, G0 distribution, alpha-stable distribution, etc, are most widely used to detect the ship targets in SAR image at present. However, the clutter background in SAR images is complicated and variable. When the actual clutter background deviates from the assumed statistical distribution, the performance of the parametric CFAR detector will deteriorate. In addition to the parametric CFAR schemes, there is another class of nonparametric CFAR detectors which can maintain a constant false alarm rate for the target detection without the assumption of a known clutter distribution. In this work, the Wilcoxon nonparametric CFAR scheme for ship detection in SAR image is proposed and analyzed, and a closed form of the false alarm rate for the Wilcoxon nonparametric detector to determine the decision threshold is presented. By comparison with several typical parametric CFAR schemes on Radarsat-2, ICEYE-X6 and Gaofen-3 SAR images, the robustness of the Wilcoxon nonparametric detector to maintain a good false alarm performance in different detection backgrounds is revealed, and its detection performance for the weak ship in rough sea surface is improved to some extent. Moreover, the Wilcoxon nonparametric detector can suppress the false alarms resulting from the sidelobes at some degree and its detection speed is fast.

[42] 2402.18630

GNSS Positioning using Cost Function Regulated Multilateration and Graph Neural Networks

In urban environments, where line-of-sight signals from GNSS satellites are frequently blocked by high-rise objects, GNSS receivers are subject to large errors in measuring satellite ranges. Heuristic methods are commonly used to estimate these errors and reduce the impact of noisy measurements on localization accuracy. In our work, we replace these error estimation heuristics with a deep learning model based on Graph Neural Networks. Additionally, by analyzing the cost function of the multilateration process, we derive an optimal method to utilize the estimated errors. Our approach guarantees that the multilateration converges to the receiver's location as the error estimation accuracy increases. We evaluate our solution on a real-world dataset containing more than 100k GNSS epochs, collected from multiple cities with diverse characteristics. The empirical results show improvements from 40% to 80% in the horizontal localization error against recent deep learning baselines as well as classical localization approaches.

[43] 2402.18677

Fault Tolerant Neural Control Barrier Functions for Robotic Systems under Sensor Faults and Attacks

Safety is a fundamental requirement of many robotic systems. Control barrier function (CBF)-based approaches have been proposed to guarantee the safety of robotic systems. However, the effectiveness of these approaches highly relies on the choice of CBFs. Inspired by the universal approximation power of neural networks, there is a growing trend toward representing CBFs using neural networks, leading to the notion of neural CBFs (NCBFs). Current NCBFs, however, are trained and deployed in benign environments, making them ineffective for scenarios where robotic systems experience sensor faults and attacks. In this paper, we study safety-critical control synthesis for robotic systems under sensor faults and attacks. Our main contribution is the development and synthesis of a new class of CBFs that we term fault tolerant neural control barrier function (FT-NCBF). We derive the necessary and sufficient conditions for FT-NCBFs to guarantee safety, and develop a data-driven method to learn FT-NCBFs by minimizing a loss function constructed using the derived conditions. Using the learned FT-NCBF, we synthesize a control input and formally prove the safety guarantee provided by our approach. We demonstrate our proposed approach using two case studies: obstacle avoidance problem for an autonomous mobile robot and spacecraft rendezvous problem, with code available via

[44] 2402.18683

Integrated Sensing and Communication Meets Smart Propagation Engineering: Opportunities and Challenges

Both smart propagation engineering as well as integrated sensing and communication (ISAC) constitute promising candidates for next-generation (NG) mobile networks. We provide a synergistic view of these technologies, and explore their mutual benefits. First, moving beyond just intelligent surfaces, we provide a holistic view of the engineering aspects of smart propagation environments. By delving into the fundamental characteristics of intelligent surfaces, fluid antennas, and unmanned aerial vehicles, we reveal that more efficient control of the pathloss and fading can be achieved, thus facilitating intrinsic integration and mutual assistance between sensing and communication functionalities. In turn, with the exploitation of the sensing capabilities of ISAC to orchestrate the efficient configuration of radio environments, both the computational effort and signaling overheads can be reduced. We present indicative simulation results, which verify that cooperative smart propagation environment design significantly enhances the ISAC performance. Finally, some promising directions are outlined for combining ISAC with smart propagation engineering.

[45] 2402.18775

How to Evaluate Human-likeness of Interaction-aware Driver Models

This study proposes a method for qualitatively evaluating and designing human-like driver models for autonomous vehicles. While most existing research on human-likeness has been focused on quantitative evaluation, it is crucial to consider qualitative measures to accurately capture human perception. To this end, we conducted surveys utilizing both video study and human experience-based study. The findings of this research can significantly contribute to the development of naturalistic and human-like driver models for autonomous vehicles, enabling them to safely and efficiently coexist with human-driven vehicles in diverse driving scenarios.

[46] 2402.18781

Conjectural Online Learning with First-order Beliefs in Asymmetric Information Stochastic Games

Stochastic games arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures, where information asymmetry presents challenges for decision-making entities (players). Existing computational methods for asymmetric information stochastic games (AISG) are primarily offline, targeting special classes of AISGs to avoid belief hierarchies, and lack online adaptability to deviations from equilibrium. To address this limitation, we propose a conjectural online learning (COL), a learning scheme for generic AISGs. COL, structured as a forecaster-actor-critic (FAC) architecture, utilizes first-order beliefs over the hidden states and subjective forecasts of the opponent's strategies. Against the conjectured opponent, COL updates strategies in an actor-critic approach using online rollout and calibrates conjectures through Bayesian learning. We prove that conjecture in COL is asymptotically consistent with the information feedback in the sense of a relaxed Bayesian consistency. The resulting empirical strategy profile converges to the Berk-Nash equilibrium, a solution concept characterizing rationality under subjectivity. Experimental results from an intrusion response use case demonstrate COL's superiority over state-of-the-art reinforcement learning methods against nonstationary attacks.

[47] 2402.18847

Flexible Precoding for Multi-User Movable Antenna Communications

This letter rethinks traditional precoding in multi-user wireless communications with movable antennas (MAs). Utilizing MAs for optimal antenna positioning, we introduce a sparse optimization (SO)-based approach focusing on regularized zero-forcing (RZF). This framework targets the optimization of antenna positions and the precoding matrix to minimize inter-user interference and transmit power. We propose an off-grid regularized least squares-based orthogonal matching pursuit (RLS-OMP) method for this purpose. Moreover, we provide deeper insights into antenna position optimization using RLS-OMP, viewed from a subspace projection angle. Overall, our proposed flexible precoding scheme demonstrates a sum rate that exceeds more than twice that of fixed antenna positions.

[48] 2402.18859

Taking Second-life Batteries from Exhausted to Empowered using Experiments, Data Analysis, and Health Estimation

The reuse of retired electric vehicle (EV) batteries in electric grid energy storage emerges as a promising strategy to address environmental concerns and boost economic value. This study concentrates on devising health monitoring algorithms for retired batteries (BMS$_2$) deployed in grid storage applications. Over 15 months of testing, we compile, analyze, and publicly share a dataset of second-life (SL) batteries, implementing a cycling protocol simulating grid energy storage load profiles within a 3 V-4 V voltage window. Four machine learning-based health estimation models, relying on BMS$_2$ features and initial capacity, are developed and compared, with the selected model achieving a Mean Absolute Percentage Error (MAPE) below 2.3% on test data. Additionally, an adaptive online health estimation algorithm is proposed by integrating a clustering-based method, limiting estimation errors during online deployment. These results constitute an initial proof of concept, showcasing the feasibility of repurposing retired batteries for second-life applications. Based on obtained data and representative power demand, these SL batteries exhibit the potential, under specific conditions, for over a decade of grid energy storage use.

[49] 2402.18923

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)

[50] 2402.18936

Energy-Efficient UAV Swarm Assisted MEC with Dynamic Clustering and Scheduling

In this paper, the energy-efficient unmanned aerial vehicle (UAV) swarm assisted mobile edge computing (MEC) with dynamic clustering and scheduling is studied. In the considered system model, UAVs are divided into multiple swarms, with each swarm consisting of a leader UAV and several follower UAVs to provide computing services to end-users. Unlike existing work, we allow UAVs to dynamically cluster into different swarms, i.e., each follower UAV can change its leader based on the time-varying spatial positions, updated application placement, etc. in a dynamic manner. Meanwhile, UAVs are required to dynamically schedule their energy replenishment, application placement, trajectory planning and task delegation. With the aim of maximizing the long-term energy efficiency of the UAV swarm assisted MEC system, a joint optimization problem of dynamic clustering and scheduling is formulated. Taking into account the underlying cooperation and competition among intelligent UAVs, we further reformulate this optimization problem as a combination of a series of strongly coupled multi-agent stochastic games, and then propose a novel reinforcement learning-based UAV swarm dynamic coordination (RLDC) algorithm for obtaining the equilibrium. Simulations are conducted to evaluate the performance of the RLDC algorithm and demonstrate its superiority over counterparts.

[51] 2402.18946

Real-Time Adaptive Safety-Critical Control with Gaussian Processes in High-Order Uncertain Models

This paper presents an adaptive online learning framework for systems with uncertain parameters to ensure safety-critical control in non-stationary environments. Our approach consists of two phases. The initial phase is centered on a novel sparse Gaussian process (GP) framework. We first integrate a forgetting factor to refine a variational sparse GP algorithm, thus enhancing its adaptability. Subsequently, the hyperparameters of the Gaussian model are trained with a specially compound kernel, and the Gaussian model's online inferential capability and computational efficiency are strengthened by updating a solitary inducing point derived from new samples, in conjunction with the learned hyperparameters. In the second phase, we propose a safety filter based on high-order control barrier functions (HOCBFs), synergized with the previously trained learning model. By leveraging the compound kernel from the first phase, we effectively address the inherent limitations of GPs in handling high-dimensional problems for real-time applications. The derived controller ensures a rigorous lower bound on the probability of satisfying the safety specification. Finally, the efficacy of our proposed algorithm is demonstrated through real-time obstacle avoidance experiments executed using both a simulation platform and a real-world 7-DOF robot.

[52] 2402.19004

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

The development of high-resolution remote sensing satellites has provided great convenience for research work related to remote sensing. Segmentation and extraction of specific targets are essential tasks when facing the vast and complex remote sensing images. Recently, the introduction of Segment Anything Model (SAM) provides a universal pre-training model for image segmentation tasks. While the direct application of SAM to remote sensing image segmentation tasks does not yield satisfactory results, we propose RSAM-Seg, which stands for Remote Sensing SAM with Semantic Segmentation, as a tailored modification of SAM for the remote sensing field and eliminates the need for manual intervention to provide prompts. Adapter-Scale, a set of supplementary scaling modules, are proposed in the multi-head attention blocks of the encoder part of SAM. Furthermore, Adapter-Feature are inserted between the Vision Transformer (ViT) blocks. These modules aim to incorporate high-frequency image information and image embedding features to generate image-informed prompts. Experiments are conducted on four distinct remote sensing scenarios, encompassing cloud detection, field monitoring, building detection and road mapping tasks . The experimental results not only showcase the improvement over the original SAM and U-Net across cloud, buildings, fields and roads scenarios, but also highlight the capacity of RSAM-Seg to discern absent areas within the ground truth of certain datasets, affirming its potential as an auxiliary annotation method. In addition, the performance in few-shot scenarios is commendable, underscores its potential in dealing with limited datasets.

[53] 2402.19041

Atmospheric Turbulence Removal with Video Sequence Deep Visual Priors

Atmospheric turbulence poses a challenge for the interpretation and visual perception of visual imagery due to its distortion effects. Model-based approaches have been used to address this, but such methods often suffer from artefacts associated with moving content. Conversely, deep learning based methods are dependent on large and diverse datasets that may not effectively represent any specific content. In this paper, we address these problems with a self-supervised learning method that does not require ground truth. The proposed method is not dependent on any dataset outside of the single data sequence being processed but is also able to improve the quality of any input raw sequences or pre-processed sequences. Specifically, our method is based on an accelerated Deep Image Prior (DIP), but integrates temporal information using pixel shuffling and a temporal sliding window. This efficiently learns spatio-temporal priors leading to a system that effectively mitigates atmospheric turbulence distortions. The experiments show that our method improves visual quality results qualitatively and quantitatively.

[54] 2402.19085

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving Pareto improvements in multi-objective alignment.

[55] 2402.19128

ARMCHAIR: integrated inverse reinforcement learning and model predictive control for human-robot collaboration

One of the key issues in human-robot collaboration is the development of computational models that allow robots to predict and adapt to human behavior. Much progress has been achieved in developing such models, as well as control techniques that address the autonomy problems of motion planning and decision-making in robotics. However, the integration of computational models of human behavior with such control techniques still poses a major challenge, resulting in a bottleneck for efficient collaborative human-robot teams. In this context, we present a novel architecture for human-robot collaboration: Adaptive Robot Motion for Collaboration with Humans using Adversarial Inverse Reinforcement learning (ARMCHAIR). Our solution leverages adversarial inverse reinforcement learning and model predictive control to compute optimal trajectories and decisions for a mobile multi-robot system that collaborates with a human in an exploration task. During the mission, ARMCHAIR operates without human intervention, autonomously identifying the necessity to support and acting accordingly. Our approach also explicitly addresses the network connectivity requirement of the human-robot team. Extensive simulation-based evaluations demonstrate that ARMCHAIR allows a group of robots to safely support a simulated human in an exploration scenario, preventing collisions and network disconnections, and improving the overall performance of the task.

[56] 2402.19176

Proximal Dogleg Opportunistic Majorization for Nonconvex and Nonsmooth Optimization

We consider minimizing a function consisting of a quadratic term and a proximable term which is possibly nonconvex and nonsmooth. This problem is also known as scaled proximal operator. Despite its simple form, existing methods suffer from slow convergence or high implementation complexity or both. To overcome these limitations, we develop a fast and user-friendly second-order proximal algorithm. Key innovation involves building and solving a series of opportunistically majorized problems along a hybrid Newton direction. The approach directly uses the precise Hessian of the quadratic term, and calculates the inverse only once, eliminating the iterative numerical approximation of the Hessian, a common practice in quasi-Newton methods. The algorithm's convergence to a critical point is established, and local convergence rate is derived based on the Kurdyka-Lojasiewicz property of the objective function. Numerical comparisons are conducted on well-known optimization problems. The results demonstrate that the proposed algorithm not only achieves a faster convergence but also tends to converge to a better local optimum compare to benchmark algorithms.

[57] 2402.19290

Estimation and Deconvolution of Second Order Cyclostationary Signals

This method solves the dual problem of blind deconvolution and estimation of the time waveform of noisy second-order cyclo-stationary (CS2) signals that traverse a Transfer Function (TF) en route to a sensor. We have proven that the deconvolution filter exists and eliminates the TF effect from signals whose statistics vary over time. This method is blind, meaning it does not require prior knowledge about the signals or TF. Simulations demonstrate the algorithm high precision across various signal types, TFs, and Signal-to-Noise Ratios (SNRs). In this study, the CS2 signals family is restricted to the product of a deterministic periodic function and white noise. Furthermore, this method has the potential to improve the training of Machine Learning models where the aggregation of signals from identical systems but with different TFs is required.

[58] 2402.19315

On the Existence of Static Equilibria of a Cable-Suspended Load with Non-stopping Flying Carriers

Aerial cooperative robotic manipulation of cable-suspended objects has been largely studied as it allows handling large and heavy objects, and cables offer multiple advantages, such as their low weight and cost efficiency. Multirotors have been typically considered, which, however, can be unsuitable for long-lasting manipulation tasks due to their scarce endurance. Hence, this work investigates whether non-stop flights are possible for maintaining constant the pose of cable-suspended objects. First, we show that one or two flying carriers alone cannot perform non-stop flights while maintaining a constant pose of the suspended object. Instead, we demonstrate that \emph{three} flying carriers can achieve this task provided that the orientation of the load at the equilibrium is such that the components of the cable forces that balance the external force (typically gravity) do not belong to the plane of the cable anchoring points on the load. Numerical tests are presented in support of the analytical results.

[59] 2402.19325

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes vector representations of the speakers in a conversation - attractors. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom allowing them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.

[60] 2402.19333

Compact Speech Translation Models via Discrete Speech Units Pretraining

Using Self-Supervised Learning (SSL) as model initialization is now common to obtain strong results in Speech Translation (ST). However, they also impose a large memory footprint, hindering on-device deployment. In this paper, we leverage the SSL models by pretraining smaller models on their Discrete Speech Units (DSU). We pretrain encoder-decoder models on 1) Filterbank-to-DSU and 2) DSU-to-Translation data, and take the encoder from 1) and the decoder from 2) to initialise a new model, finetuning this on limited speech-translation data. The final model becomes compact by using the DSU pretraining to distil the knowledge of the SSL model. Our method has several benefits over using DSU as model inputs, such as shorter inference pipeline and robustness over (DSU) tokenization. In contrast to ASR pretraining, it does not require transcripts, making it applicable to low-resource settings. Evaluation on CoVoST-2 X-En shows that our method is >$0.5$ BLEU better than a ST model that directly finetune the SSL model, given only half the model size, and on a par with ASR pretraining.

[61] 2402.19345

Multi-frequency tracking via group-sparse optimal transport

In this work, we introduce an optimal transport framework for inferring power distributions over both spatial location and temporal frequency. Recently, it has been shown that optimal transport is a powerful tool for estimating spatial spectra that change smoothly over time. In this work, we consider the tracking of the spatio-temporal spectrum corresponding to a small number of moving broad-band signal sources. Typically, such tracking problems are addressed by treating the spatio-temporal power distribution in a frequency-by-frequency manner, allowing to use well-understood models for narrow-band signals. This however leads to decreased target resolution due to inefficient use of the available information. We propose an extension of the optimal transport framework that exploits information from several frequencies simultaneously by estimating a spatio-temporal distribution penalized by a group-sparsity regularizer. This approach finds a spatial spectrum that changes smoothly over time, and at each time instance has a small support that is similar across frequencies. To the best of the authors knowledge, this is the first formulation combining optimal transport and sparsity for solving inverse problems. As is shown on simulated and real data, our method can successfully track targets in scenarios where information from separate frequency bands alone is insufficient.

[62] 2402.19355

Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification

Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectures. Additionally, we introduce a method for identifying the victim model on which the adversarial attack is carried out. To achieve this, we generate a new dataset containing multiple attacks performed against various victim models. We achieve an AUC of 0.982 for attack detection, with no more than a 0.03 drop in performance for unknown attacks. Our attack classification accuracy (excluding benign) reaches 86.48% across eight attack types using our LightResNet34 architecture, while our victim model classification accuracy reaches 72.28% across four victim models.

[63] 2402.19360

Joint Chance Constrained Optimal Control via Linear Programming

We establish a linear programming formulation for the solution of joint chance constrained optimal control problems over finite time horizons. The joint chance constraint may represent an invariance, reachability or reach-avoid specification that the trajectory must satisfy with a predefined probability. Compared to the existing literature, the formulation is computationally tractable and the solution exact.

[64] 2402.19410

Genie: Smart ROS-based Caching for Connected Autonomous Robots

Despite the promising future of autonomous robots, several key issues currently remain that can lead to compromised performance and safety. One such issue is latency, where we find that even the latest embedded platforms from NVIDIA fail to execute intelligence tasks (e.g., object detection) of autonomous vehicles in a real-time fashion. One remedy to this problem is the promising paradigm of edge computing. Through collaboration with our industry partner, we identify key prohibitive limitations of the current edge mindset: (1) servers are not distributed enough and thus, are not close enough to vehicles, (2) current proposed edge solutions do not provide substantially better performance and extra information specific to autonomous vehicles to warrant their cost to the user, and (3) the state-of-the-art solutions are not compatible with popular frameworks used in autonomous systems, particularly the Robot Operating System (ROS). To remedy these issues, we provide Genie, an encapsulation technique that can enable transparent caching in ROS in a non-intrusive way (i.e., without modifying the source code), can build the cache in a distributed manner (in contrast to traditional central caching methods), and can construct a collective three-dimensional object map to provide substantially better latency (even on low-power edge servers) and higher quality data to all vehicles in a certain locality. We fully implement our design on state-of-the-art industry-adopted embedded and edge platforms, using the prominent autonomous driving software Autoware, and find that Genie can enhance the latency of Autoware Vision Detector by 82% on average, enable object reusability 31% of the time on average and as much as 67% for the incoming requests, and boost the confidence in its object map considerably over time.

[65] 2402.19416

Vision-Radio Experimental Infrastructure Architecture Towards 6G

Telecommunications and computer vision have evolved separately so far. Yet, with the shift to sub-terahertz (sub-THz) and terahertz (THz) radio communications, there is an opportunity to explore computer vision technologies together with radio communications, considering the dependency of both technologies on Line of Sight. The combination of radio sensing and computer vision can address challenges such as obstructions and poor lighting. Also, machine learning algorithms, capable of processing multimodal data, play a crucial role in deriving insights from raw and low-level sensing data, offering a new level of abstraction that can enhance various applications and use cases such as beamforming and terminal handovers. This paper introduces CONVERGE, a pioneering vision-radio paradigm that bridges this gap by leveraging Integrated Sensing and Communication (ISAC) to facilitate a dual "View-to-Communicate, Communicate-to-View" approach. CONVERGE offers tools that merge wireless communications and computer vision, establishing a novel Research Infrastructure (RI) that will be open to the scientific community and capable of providing open datasets. This new infrastructure will support future research in 6G and beyond concerning multiple verticals, such as telecommunications, automotive, manufacturing, media, and health.

[66] 2402.19434

Digital Twin Aided Massive MIMO: CSI Compression and Feedback

Deep learning (DL) approaches have demonstrated high performance in compressing and reconstructing the channel state information (CSI) and reducing the CSI feedback overhead in massive MIMO systems. One key challenge, however, with the DL approaches is the demand for extensive training data. Collecting this real-world CSI data incurs significant overhead that hinders the DL approaches from scaling to a large number of communication sites. To address this challenge, we propose a novel direction that utilizes site-specific \textit{digital twins} to aid the training of DL models. The proposed digital twin approach generates site-specific synthetic CSI data from the EM 3D model and ray tracing, which can then be used to train the DL model without real-world data collection. To further improve the performance, we adopt online data selection to refine the DL model training with a small real-world CSI dataset. Results show that a DL model trained solely on the digital twin data can achieve high performance when tested in a real-world deployment. Further, leveraging domain adaptation techniques, the proposed approach requires orders of magnitude less real-world data to approach the same performance of the model trained completely on a real-world CSI dataset.

[67] 2402.19443

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.

[68] 2402.19455

Listening to the Noise: Blind Denoising with Gibbs Diffusion

In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the family of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and spectral index, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe.