We study the Compressed Sensing (CS) problem, which is the problem of finding the most sparse vector that satisfies a set of linear measurements up to some numerical tolerance. CS is a central problem in Statistics, Operations Research and Machine Learning which arises in applications such as signal processing, data compression and image reconstruction. We introduce an $\ell_2$ regularized formulation of CS which we reformulate as a mixed integer second order cone program. We derive a second order cone relaxation of this problem and show that under mild conditions on the regularization parameter, the resulting relaxation is equivalent to the well studied basis pursuit denoising problem. We present a semidefinite relaxation that strengthens the second order cone relaxation and develop a custom branch-and-bound algorithm that leverages our second order cone relaxation to solve instances of CS to certifiable optimality. Our numerical results show that our approach produces solutions that are on average $6.22\%$ more sparse than solutions returned by state of the art benchmark methods on synthetic data in minutes. On real world ECG data, for a given $\ell_2$ reconstruction error our approach produces solutions that are on average $9.95\%$ more sparse than benchmark methods, while for a given sparsity level our approach produces solutions that have on average $10.77\%$ lower reconstruction error than benchmark methods in minutes.
Modulation classification is an essential step of signal processing and has been regularly applied in the field of tele-communication. Since variations of frequency with respect to time remains a vital distinction among radio signals having different modulation formats, these variations can be used for feature extraction by converting 1-D radio signals into frequency domain. In this paper, we propose a scheme for Automatic Modulation Classification (AMC) using modern architectures of Convolutional Neural Networks (CNN), through generating spectrum images of eleven different modulation types. Additionally, we perform resolution transformation of spectrograms that results up to 99.61% of computational load reduction and 8x faster conversion from the received I/Q data. This proposed AMC is implemented on CPU and GPU, to recognize digital as well as analogue signal modulation schemes on signals. The performance is evaluated on existing CNN models including SqueezeNet, Resnet-50, InceptionResnet-V2, Inception-V3, VGG-16 and Densenet-201. Best results of 91.2% are achieved in presence of AWGN and other noise impairments in the signals, stating that the transformed spectrogram-based AMC has good classification accuracy as the spectral features are highly discriminant, and CNN based models have capability to extract these high-dimensional features. The spectrograms were created under different SNRs ranging from 5 to 30db with a step size of 5db to observe the experimental results at various SNR levels. The proposed methodology is efficient to be applied in wireless communication networks for real-time applications.
As machine learning becomes increasingly prevalent in critical fields such as healthcare, ensuring the safety and reliability of machine learning systems becomes paramount. A key component of reliability is the ability to estimate uncertainty, which enables the identification of areas of high and low confidence and helps to minimize the risk of error. In this study, we propose a machine learning pipeline called U-PASS tailored for clinical applications that incorporates uncertainty estimation at every stage of the process, including data acquisition, training, and model deployment. The training process is divided into a supervised pre-training step and a semi-supervised finetuning step. We apply our uncertainty-guided deep learning pipeline to the challenging problem of sleep staging and demonstrate that it systematically improves performance at every stage. By optimizing the training dataset, actively seeking informative samples, and deferring the most uncertain samples to an expert, we achieve an expert-level accuracy of 85% on a challenging clinical dataset of elderly sleep apnea patients, representing a significant improvement over the baseline accuracy of 75%. U-PASS represents a promising approach to incorporating uncertainty estimation into machine learning pipelines, thereby improving their reliability and unlocking their potential in clinical settings.
Positron emission tomography (PET) is an important functional medical imaging technique often used in the evaluation of certain brain disorders, whose reconstruction problem is ill-posed. The vast majority of reconstruction methods in PET imaging, both iterative and deep learning, return a single estimate without quantifying the associated uncertainty. Due to ill-posedness and noise, a single solution can be misleading or inaccurate. Thus, providing a measure of uncertainty in PET image reconstruction can help medical practitioners in making critical decisions. This paper proposes a deep learning-based method for uncertainty quantification in PET image reconstruction via posterior sampling. The method is based on training a conditional generative adversarial network whose generator approximates sampling from the posterior in Bayesian inversion. The generator is conditioned on reconstruction from a low-dose PET scan obtained by a conventional reconstruction method and a high-quality magnetic resonance image and learned to estimate a corresponding standard-dose PET scan reconstruction. We show that the proposed model generates high-quality posterior samples and yields physically-meaningful uncertainty estimates.
We propose to formulate point cloud extraction from ultrasound volumes as an image segmentation problem. Through this convenient formulation, a quick prototype exploring various variants of the U-Net architecture was developed and evaluated. This report documents the experimental results compiled using a training dataset of 5 labelled ultrasound volumes and 84 unlabelled volumes that got completed in a two-week period as part of a challenge submission to an open challenge entitled ``Deep Learning in Ultrasound Image Analysis''. Source code is shared with the research community at this GitHub URL \url{https://github.com/lisatwyw/smrvis}.
Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and discusses convergence guarantees for stochastic natural thresholding algorithms by extending the NT from the deterministic version with linear measurements to the stochastic version with a general objective function. We also conduct various numerical experiments on linear and nonlinear measurements to demonstrate the performance of StoNT.
Computational modeling of Multiresolution- Fractional Brownian motion (fBm) has been effective in stochastic multiscale fractal texture feature extraction and machine learning of abnormal brain tissue segmentation. Further, deep multiresolution methods have been used for pixel-wise brain tissue segmentation. Robust tissue segmentation and volumetric measurement may provide more objective quantification of disease burden and offer improved tracking of treatment response for the disease. However, we posit that computational modeling of deep multiresolution fractal texture features may offer elegant feature learning. Consequently, this work proposes novel modeling of Multiresolution Fractal Deep Neural Network (MFDNN) and its computational implementation that mathematically combines a multiresolution fBm model and deep multiresolution analysis. The proposed full 3D MFDNN model offers the desirable properties of estimating multiresolution stochastic texture features by analyzing large amount of raw MRI image data for brain tumor segmentation. We apply the proposed MFDNN to estimate stochastic deep multiresolution fractal texture features for tumor tissues in brain MRI images. The MFDNN model is evaluated using 1251 patient cases for brain tumor segmentation using the most recent BRATS 2021 Challenges dataset. The evaluation of the proposed model using Dice overlap score, Husdorff distance and associated uncertainty estimation offers either better or comparable performances in abnormal brain tissue segmentation when compared to the state-of-the-art methods in the literature. Index Terms: Computational Modeling, Multiresolution Fractional Brownian Motion (fBm), Deep Multiresolution Analysis, Fractal Dimension (FD), Texture Features, Brain tumor segmentation, Deep Learning.
Presenting whole slide images (WSIs) as graph will enable a more efficient and accurate learning framework for cancer diagnosis. Due to the fact that a single WSI consists of billions of pixels and there is a lack of vast annotated datasets required for computational pathology, the problem of learning from WSIs using typical deep learning approaches such as convolutional neural network (CNN) is challenging. Additionally, WSIs down-sampling may lead to the loss of data that is essential for cancer detection. A novel two-stage learning technique is presented in this work. Since context, such as topological features in the tumor surroundings, may hold important information for cancer grading and diagnosis, a graph representation capturing all dependencies among regions in the WSI is very intuitive. Graph convolutional network (GCN) is deployed to include context from the tumor and adjacent tissues, and self-supervised learning is used to enhance training through unlabeled data. More specifically, the entire slide is presented as a graph, where the nodes correspond to the patches from the WSI. The proposed framework is then tested using WSIs from prostate and kidney cancers. To assess the performance improvement through self-supervised mechanism, the proposed context-aware model is tested with and without use of pre-trained self-supervised layer. The overall model is also compared with multi-instance learning (MIL) based and other existing approaches.
Adverse road conditions can cause vehicle yaw instability and loss of traction. To compensate for the instability under such conditions, corrective actions must be taken. In comparison to a mechanical differential, an electronic differential can independently control the two drive wheels and provide means of generating more effective corrective actions. As a solution for traction and stability issues in automobiles, this paper has developed a controller for a vehicle electronic differential consisting of two program-controlled rear motors. The control algorithm adjusts to changing road conditions. Traction was controlled using a motor reaction torque observer-based slip ratio estimation, and yaw stability was achieved by tracking a reference yaw rate calculated using estimated tyre cornering stiffnesses. A recursive least squares algorithm was used to estimate cornering stiffness. The yaw rate of the vehicle, as well as its longitudinal and lateral accelerations, were measured, and the body slip angle was estimated using an observer. A fuzzy inference system was used to integrate the independently developed traction control and yaw control schemes. The fuzzy inference system modifies the commanded voltage generated by the driver's input to account for the traction and yaw stability controller outputs. A vehicle simulator was used to numerically simulate the integrated controller.
We present JDLL, an agile Java library that offers a comprehensive toolset/API to unify the development of high-end applications of DL for bioimage analysis and to streamline their installation and maintenance. JDLL provides all the functions required to consume DL models seamlessly, without being burdened by the configuration of the Python-based DL frameworks, within Java bioimage informatics platforms. Moreover, it allows the deployment of pre-trained models in the Bioimage Model Zoo (BMZ) by shipping the logic to connect to the BMZ website, download and run a selected model inference.
This work presents a novel methodology for identification of the most critical cyberattacks that may disrupt the operation of an industrial system. Application of the proposed framework can enable the design and development of advanced cybersecurity functionality for a wide range of industrial applications. The attacks are assessed taking into direct consideration how they impact the system operation as measured by a defined Key Performance Indicator (KPI). A simulation, or Digital Twin (DT), of the industrial process is employed for calculation of the KPI based on operating conditions. Such DT is augmented with a layer of information describing the computer network topology, connected devices, and potential actions an adversary can take associated to each device or network link. Each possible action is associated with an abstract measure of effort, which is interpreted as a cost. It is assumed that the adversary has a corresponding budget that constrains the selection of the sequence of actions defining the progression of the attack. A dynamical system comprising a set of states associated to the cyberattack (cyber states) and transition logic for updating their values is also proposed. The resulting Augmented Digital Twin (ADT) is then employed in a sequential decision-making optimization formulated to yield the most critical attack scenarios as measured by the defined KPI. The methodology is successfully tested based on an electrical power distribution system simulation.
Optimal control schemes have achieved remarkable performance in numerous engineering applications. However, they typically require high computational cost, which has limited their use in real-world engineering systems with fast dynamics and/or limited computation power. To address this challenge, Neighboring Extremal (NE) has been developed as an efficient optimal adaption strategy to adapt a pre-computed nominal control solution to perturbations from the nominal trajectory. The resulting control law is a time-varying feedback gain that can be pre-computed along with the original optimal control problem, and it takes negligible online computation. However, existing NE frameworks only deal with state perturbations while in modern applications, optimal controllers (e.g., predictive controllers) frequently incorporate preview information. Therefore, a new NE framework is needed to adapt to such preview perturbations. In this work, an extended NE (ENE) framework is developed to systematically adapt the nominal control to both state and preview perturbations. We show that the derived ENE law is two time-varying feedback gains on the state perturbation and the preview perturbation. We also develop schemes to handle nominal non-optimal solutions and large perturbations to retain optimal performance and constraint satisfaction. Case study on nonlinear model predictive control is presented due to its popularity but it can be easily extended to other optimal control schemes. Promising simulation results on the cart inverted pendulum problem demonstrate the efficacy of the ENE algorithm.
There has been recent interest in leveraging federated learning (FL) for radio signal classification tasks. In FL, model parameters are periodically communicated from participating devices, which train on local datasets, to a central server which aggregates them into a global model. While FL has privacy/security advantages due to raw data not leaving the devices, it is still susceptible to adversarial attacks. In this work, we first reveal the susceptibility of FL-based signal classifiers to model poisoning attacks, which compromise the training process despite not observing data transmissions. In this capacity, we develop an attack framework that significantly degrades the training process of the global model. Our attack framework induces a more potent model poisoning attack to the global classifier than existing baselines while also being able to compromise existing server-driven defenses. In response to this gap, we develop Underlying Server Defense of Federated Learning (USD-FL), a novel defense methodology for FL-based signal classifiers. We subsequently compare the defensive efficacy, runtimes, and false positive detection rates of USD-FL relative to existing server-driven defenses, showing that USD-FL has notable advantages over the baseline defenses in all three areas.
Deep neural networks have been widely used in medical image analysis and medical image segmentation is one of the most important tasks. U-shaped neural networks with encoder-decoder are prevailing and have succeeded greatly in various segmentation tasks. While CNNs treat an image as a grid of pixels in Euclidean space and Transformers recognize an image as a sequence of patches, graph-based representation is more generalized and can construct connections for each part of an image. In this paper, we propose a novel ViG-UNet, a graph neural network-based U-shaped architecture with the encoder, the decoder, the bottleneck, and skip connections. The downsampling and upsampling modules are also carefully designed. The experimental results on ISIC 2016, ISIC 2017 and Kvasir-SEG datasets demonstrate that our proposed architecture outperforms most existing classic and state-of-the-art U-shaped networks.
Road departure detection systems (RDDSs) for eliminating unintentional road departure collisions have been developed and equipped on some commercial vehicles in recent years. In order to provide a standardized and objective performance evaluation of RDDSs without the affections of systems complex nature of RDDSs and the design requirements, this paper proposes the development of the scoring method for evaluating vehicle RDDSs. Both flat road edge and vertical road edge are considered in the proposed scoring method, which combines two key variables: 1) the lateral distance of vehicle from road edge when RDW triggers; 2) the lateral distance of vehicle from road edge when RKA triggers. Two main criteria of road departure warning (RDW) and Road Keeping Assistance (RKA) are used to describe the performance of RDDSs.
Many modern autonomous systems, particularly multi-agent systems, are time-critical and need to be robust against timing uncertainties. Previous works have studied left and right time robustness of signal temporal logic specifications by considering time shifts in the predicates that are either only to the left or only to the right. We propose a combined notion of temporal robustness which simultaneously considers left and right time shifts. For instance, in a scenario where a robot plans a trajectory around a pedestrian, this combined notion can now capture uncertainty of the pedestrian arriving earlier or later than anticipated. We first derive desirable properties of this new notion with respect to left and right time shifts and then design control laws for linear systems that maximize temporal robustness using mixed-integer linear programming. Finally, we present two case studies to illustrate how the proposed temporal robustness accounts for timing uncertainties.
3D speech enhancement has attracted much attention in recent years with the development of augmented reality technology. Traditional denoising convolutional autoencoders have limitations in extracting dynamic voice information. In this paper, we propose a two-stage autoencoder neural network for 3D speech enhancement. We incorporate a dual-path recurrent neural network block into the convolutional autoencoder to iteratively apply time-domain and frequency-domain modeling in an alternate fashion. And an attention mechanism for fusing the high-dimension features is proposed. We also introduce a loss function to simultaneously optimize the network in the time-frequency and time domains. Experimental results show that our system outperforms the state-of-the-art systems on the dataset of ICASSP L3DAS23 challenge.
We propose to use a liquid time constant (LTC) network to predict the future blockage status of a millimeter wave (mmWave) link using only the received signal power as the input to the system. The LTC network is based on an ordinary differential equation (ODE) system inspired by biology and specialized for near-future prediction for time sequence observation as the input. Using an experimental dataset at 60 GHz, we show that our proposed use of LTC can reliably predict the occurrence of blockage and the length of the blockage without the need for scenario-specific data. The results show that the proposed LTC can predict with upwards of 97.85\% accuracy without prior knowledge of the outdoor scenario or retraining/tuning. These results highlight the promising gains of using LTC networks to predict time series-dependent signals, which can lead to more reliable and low-latency communication.
The goal of DCASE 2023 Challenge Task 7 is to generate various sound clips for Foley sound synthesis (FSS) by "category-to-sound" approach. "Category" is expressed by a single index while corresponding "sound" covers diverse and different sound examples. To generate diverse sounds for a given category, we adopt VITS, a text-to-speech (TTS) model with variational inference. In addition, we apply various techniques from speech synthesis including PhaseAug and Avocodo. Different from TTS models which generate short pronunciation from phonemes and speaker identity, the category-to-sound problem requires generating diverse sounds just from a category index. To compensate for the difference while maintaining consistency within each audio clip, we heavily modified the prior encoder to enhance consistency with posterior latent variables. This introduced additional Gaussian on the prior encoder which promotes variance within the category. With these modifications, we propose VIFS, variational inference for end-to-end Foley sound synthesis, which generates diverse high-quality sounds.
This paper presents a novel Sequence-to-Sequence (Seq2Seq) model based on a transformer-based attention mechanism and temporal pooling for Non-Intrusive Load Monitoring (NILM) of smart buildings. The paper aims to improve the accuracy of NILM by using a deep learning-based method. The proposed method uses a Seq2Seq model with a transformer-based attention mechanism to capture the long-term dependencies of NILM data. Additionally, temporal pooling is used to improve the model's accuracy by capturing both the steady-state and transient behavior of appliances. The paper evaluates the proposed method on a publicly available dataset and compares the results with other state-of-the-art NILM techniques. The results demonstrate that the proposed method outperforms the existing methods in terms of both accuracy and computational efficiency.
Demand-side management now encompasses more residential loads. To efficiently apply demand response strategies, it's essential to periodically observe the contribution of various domestic appliances to total energy consumption. Non-intrusive load monitoring (NILM), also known as load disaggregation, is a method for decomposing the total energy consumption profile into individual appliance load profiles within the household. It has multiple applications in demand-side management, energy consumption monitoring, and analysis. Various methods, including machine learning and deep learning, have been used to implement and improve NILM algorithms. This paper reviews some recent NILM methods based on deep learning and introduces the most accurate methods for residential loads. It summarizes public databases for NILM evaluation and compares methods using standard performance metrics.
This work aims for transferring a Transformer-based image compression codec from human vision to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, we propose an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the codec to various machine tasks and outshining the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task.
Neuromorphic sampling is a paradigm shift in analog-to-digital conversion where the acquisition strategy is opportunistic and measurements are recorded only when there is a significant change in the signal. Neuromorphic sampling has given rise to a new class of event-based sensors called dynamic vision sensors or neuromorphic cameras. The neuromorphic sampling mechanism utilizes low power and provides high-dynamic range sensing with low latency and high temporal resolution. The measurements are sparse and have low redundancy making it convenient for downstream tasks. In this paper, we present a sampling-theoretic perspective to neuromorphic sensing of continuous-time signals. We establish a close connection between neuromorphic sampling and time-based sampling - where signals are encoded temporally. We analyse neuromorphic sampling of signals in shift-invariant spaces, in particular, bandlimited signals and polynomial splines. We present an iterative technique for perfect reconstruction subject to the events satisfying a density criterion. We also provide necessary and sufficient conditions for perfect reconstruction. Owing to practical limitations in meeting the sufficient conditions for perfect reconstruction, we extend the analysis to approximate reconstruction from sparse events. In the latter setting, we pose signal reconstruction as a continuous-domain linear inverse problem whose solution can be obtained by solving an equivalent finite-dimensional convex optimization program using a variable-splitting approach. We demonstrate the performance of the proposed algorithm and validate our claims via experiments on synthetic signals.
High-bandwidth signals are needed in many applications like radar, sensing, measurement and communications. Especially in optical networks, the sampling rate and analog bandwidth of digital-to-analog converters (DACs) is a bottleneck for further increasing data rates. To circumvent the sampling rate and bandwidth problem of electronic DACs, we demonstrate the generation of wide-band signals with low-bandwidth electronics. This generation is based on orthogonal sampling with sinc-pulse sequences in N parallel branches. The method not only reduces the sampling rate and bandwidth, at the same time the effective number of bits (ENOB) is improved, dramatically reducing the requirements on the electronic signal processing. In proof of concept experiments the generation of analog signals, as well as Nyquist shaped and normal data will be shown. In simulations we investigate the performance of 60 GHz data generation by 20 and 12 GHz electronics. The method can easily be integrated together with already existing electronic DAC designs and would be of great interest for all high-bandwidth applications.
This paper investigates the safety guaranteed problem in spacecraft inspection missions, considering multiple position obstacles and logical attitude forbidden zones. In order to address this issue, we propose a control strategy based on control barrier functions, summarized as "safety check on kinematics" and "velocity tracking on dynamics" approach. The proposed approach employs control barrier functions to describe the obstacles and to generate safe velocities via the solution of a quadratic programming problem. Subsequently, we design a proportional-like controller based on the generated velocity, which, despite its simplicity, can ensure safety even in the presence of velocity tracking errors. The stability and safety of the system are rigorously analyzed in this paper. Furthermore, to account for model uncertainties and external disturbances, we incorporate an immersion and invariance-based disturbance observer in our design. Finally, numerical simulations are performed to demonstrate the effectiveness of the proposed control strategy.
This paper presents a modeling framework to optimize the two-dimensional placement of powertrain elements inside the vehicle, explicitly accounting for the rotation, relative placement and alignment. Specifically, we first capture the multi-level nature of the system mathematically, and construct a model that captures different powertrain component orientations. Second, we include the relative element placement as variables in the model and derive alignment constraints for both child components and parent subsystems to automatically connect mechanical ports. Finally, we showcase our framework on a four-wheel driven electric vehicle. Our results demonstrate that our framework is capable of efficiently generating system design solutions in a fully automated manner, only using basic component properties.
This paper considers a data detection problem in multiple-input multiple-output (MIMO) communication systems with hardware impairments. To address challenges posed by nonlinear and unknown distortion in received signals, two learning-based detection methods, referred to as model-driven and data-driven, are presented. The model-driven method employs a generalized Gaussian distortion model to approximate the conditional distribution of the distorted received signal. By using the outputs of coarse data detection as noisy training data, the model-driven method avoids the need for additional training overhead beyond traditional pilot overhead for channel estimation. An expectation-maximization algorithm is devised to accurately learn the parameters of the distortion model from noisy training data. To resolve a model mismatch problem in the model-driven method, the data-driven method employs a deep neural network (DNN) for approximating a-posteriori probabilities for each received signal. This method uses the outputs of the model-driven method as noisy labels and therefore does not require extra training overhead. To avoid the overfitting problem caused by noisy labels, a robust DNN training algorithm is devised, which involves a warm-up period, sample selection, and loss correction. Simulation results demonstrate that the two proposed methods outperform existing solutions with the same overhead under various hardware impairment scenarios.
Beamforming techniques utilized either at the transmitter or the receiver terminals have achieved superior quality-of-service performances from both the multi-antenna wireless communications systems, communications intelligence and radar target detection perspectives. Despite the overwhelming advantages in ideal operating conditions, beamforming approaches have been shown to face substantial performance degradations due to unknown mutual coupling effects and miscalibrated array elements. As a promising solution, blind beamformers have been proposed as a class of receiver beamformers that do not require a reference signal to operate. In this paper, a novel gradient-based blind beamformer is introduced with the aim of mitigating the deteriorating effects of unknown mutual coupling or miscalibration effects. The proposed approach is shown to find the optimal weights in different antenna array configurations in the presence of several unknown imperfections (e.g., mutual coupling effects, miscalibration effects due to gain and phase variations, inaccurate antenna positions). By providing numerical results related to the proposed algorithm for different array configurations, and bench-marking with the other existing approaches, the proposed scheme has been shown to achieve superior performance in many aspects. Additionally, a measurement-based analysis has been included with validation purposes.
Over the past decades, interest in enhancing the safety of cyber-physical systems (CPSs) has risen. The systems and control research society has recognised that the embedded closed-loop in integrated systems may be damaged if attackers can execute a successful malicious attack. This article examines the resilient control problem for CPSs with numerous transmission channels under Denial-of-Service (DoS). First, a partial observer technique is developed in response to the Multi-Channel DoS (MCDoS) condition. The changing frequency of MCDoS is characterized while maintaining the Global Asymptotic Stability (GAS) of the closed loop system. The partial observer is modified then to reduce the effect of the changing frequency of MCDoS in the system. Then a resilient event-based feedback control scheme is developed to address the Full-Scale DoS (FSDoS). We depict the changing frequency of MCDoS and the frequency and duration of FSDoS, allowing the feedback system's Global Asymptotic Stability (GAS) to be maintained. We regard event-based controllers for which a minimal inter-sample time is precisely formulated in response to the existence of digital channels.
Transmit antenna muting (TAM) in multiple-user multiple-input multiple-output (MU-MIMO) networks allows reducing the power consumption of the base station (BS) by properly utilizing only a subset of antennas in the BS. In this paper, we consider the downlink transmission of an MU-MIMO network where TAM is formulated to minimize the number of active antennas in the BS while guaranteeing the per-user throughput requirements. To address the computational complexity of the combinatorial optimization problem, we propose an algorithm called neural antenna muting (NAM) with an asymmetric custom loss function. NAM is a classification neural network trained in a supervised manner. The classification error in this scheme leads to either sub-optimal energy consumption or lower quality of service (QoS) for the communication link. We control the classification error probability distribution by designing an asymmetric loss function such that the erroneous classification outputs are more likely to result in fulfilling the QoS requirements. Furthermore, we present three heuristic algorithms and compare them with the NAM. Using a 3GPP compliant system-level simulator, we show that NAM achieves $\sim73\%$ energy saving compared to the full antenna configuration in the BS with $\sim95\%$ reliability in achieving the user throughput requirements while being around $1000\times$ and $24\times$ less computationally intensive than the greedy heuristic algorithm and the fixed column antenna muting algorithm, respectively.
Characteristics such as low contrast and significant organ shape variations are often exhibited in medical images. The improvement of segmentation performance in medical imaging is limited by the generally insufficient adaptive capabilities of existing attention mechanisms. An efficient Channel Prior Convolutional Attention (CPCA) method is proposed in this paper, supporting the dynamic distribution of attention weights in both channel and spatial dimensions. Spatial relationships are effectively extracted while preserving the channel prior by employing a multi-scale depth-wise convolutional module. The ability to focus on informative channels and important regions is possessed by CPCA. A segmentation network called CPCANet for medical image segmentation is proposed based on CPCA. CPCANet is validated on two publicly available datasets. Improved segmentation performance is achieved by CPCANet while requiring fewer computational resources through comparisons with state-of-the-art algorithms. Our code is publicly available at \url{https://github.com/Cuthbert-Huang/CPCANet}.
A new paradigm is proposed for the robustification of the LQG controller against distributional uncertainties on the noise process. Our controller optimizes the closed-loop performances in the worst possible scenario under the constraint that the noise distributional aberrance does not exceed a certain threshold limiting the relative entropy pseudo-distance between the actual noise distribution the nominal one. The main novelty is that the bounds on the distributional aberrance can be arbitrarily distributed along the whole disturbance trajectory. We discuss why this can, in principle, be a substantial advantage and we provide simulation results that substantiate such a principle.
Existing approaches to robust global asymptotic stabilization of a pair of antipodal points on unit $n$-sphere $\mathbb{S}^n$ typically involve the non-centrally synergistic hybrid controllers for attitude tracking on unit quaternion space. However, when switching faults occur due to parameter errors, the non-centrally synergistic property can lead to the unwinding problem or in some cases, destabilize the desired set. In this work, a hybrid controller is first proposed based on a novel centrally synergistic family of potential functions on $\mathbb{S}^n$, which is generated from a basic potential function through angular warping. The synergistic parameter can be explicitly expressed if the warping angle has a positive lower bound at the undesired critical points of the family. Next, the proposed approach induces a new quaternion-based controller for global attitude tracking. It has three advantageous features over existing synergistic designs: 1) it is consistent, i.e., free from the ambiguity of unit quaternion representation; 2) it is switching-fault-tolerant, i.e., the desired closed-loop equilibria remain asymptotically stable even when the switching mechanism does not work; 3) it relaxes the assumption on the parameter of the basic potential function in literature. Comprehensive simulation confirms the high robustness of the proposed centrally synergistic approach compared with existing non-centrally synergistic approaches.
Integrated sensing and communication (ISAC), with sensing and communication sharing the same wireless resources and hardware, has the advantages of high spectrum efficiency and low hardware cost, which is regarded as one of the key technologies of the fifth generation advanced (5G-A) and sixth generation (6G) mobile communication systems. ISAC has the potential to be applied in the intelligent applications requiring both communication and high accurate sensing capabilities. The fundamental challenges of ISAC system are the ISAC signal design and ISAC signal processing. However, the existing ISAC signal has low anti-noise capability. And the existing ISAC signal processing algorithms have the disadvantages of quantization errors and high complexity, resulting in large energy consumption. In this paper, phase coding is applied in ISAC signal design to improve the anti-noise performance of ISAC signal. Then, the effect of phase coding method on improving the sensing accuracy is analyzed. In order to improve the sensing accuracy with low-complexity algorithm, the iterative ISAC signal processing methods are proposed. The proposed methods improve the sensing accuracy with low computational complexity, realizing energy efficient ISAC signal processing. Taking the scenarios of short distance and long distance sensing into account, the iterative two-dimensional (2D) fast Fourier transform (FFT) and iterative cyclic cross-correlation (CC) methods are proposed, respectively, realizing high sensing accuracy and low computational complexity. Finally, the feasibility of the proposed ISAC signal processing methods are verified by simulation results.
Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.
To improve the performance in identifying the faults under strong noise for rotating machinery, this paper presents a dynamic feature reconstruction signal graph method, which plays the key role of the proposed end-to-end fault diagnosis model. Specifically, the original mechanical signal is first decomposed by wavelet packet decomposition (WPD) to obtain multiple subbands including coefficient matrix. Then, with originally defined two feature extraction factors MDD and DDD, a dynamic feature selection method based on L2 energy norm (DFSL) is proposed, which can dynamically select the feature coefficient matrix of WPD based on the difference in the distribution of norm energy, enabling each sub-signal to take adaptive signal reconstruction. Next the coefficient matrices of the optimal feature sub-bands are reconstructed and reorganized to obtain the feature signal graphs. Finally, deep features are extracted from the feature signal graphs by 2D-Convolutional neural network (2D-CNN). Experimental results on a public data platform of a bearing and our laboratory platform of robot grinding show that this method is better than the existing methods under different noise intensities.
A heart murmur is an atypical sound produced by the flow of blood through the heart. It can be a sign of a serious heart condition, so detecting heart murmurs is critical for identifying and managing cardiovascular diseases. However, current methods for identifying murmurous heart sounds do not fully utilize the valuable insights that can be gained by exploring intrinsic properties of heart sound signals. To address this issue, this study proposes a new discriminatory set of multiscale features based on the self-similarity and complexity properties of heart sounds, as derived in the wavelet domain. Self-similarity is characterized by assessing fractal behaviors, while complexity is explored by calculating wavelet entropy. We evaluated the diagnostic performance of these proposed features for detecting murmurs using a set of standard classifiers. When applied to a publicly available heart sound dataset, our proposed wavelet-based multiscale features achieved comparable performance to existing methods with fewer features. This suggests that self-similarity and complexity properties in heart sounds could be potential biomarkers for improving the accuracy of murmur detection.
Recognizing human activities from sensor data is a vital task in various domains, but obtaining diverse and labeled sensor data remains challenging and costly. In this paper, we propose an unsupervised statistical feature-guided diffusion model for sensor-based human activity recognition. The proposed method aims to generate synthetic time-series sensor data without relying on labeled data, addressing the scarcity and annotation difficulties associated with real-world sensor data. By conditioning the diffusion model on statistical information such as mean, standard deviation, Z-score, and skewness, we generate diverse and representative synthetic sensor data. We conducted experiments on public human activity recognition datasets and compared the proposed method to conventional oversampling methods and state-of-the-art generative adversarial network methods. The experimental results demonstrate that the proposed method can improve the performance of human activity recognition and outperform existing techniques.
This work presents a novel and promising approach to the clinical management of acute stroke. Using machine learning techniques, our research has succeeded in developing accurate diagnosis and prediction real-time models from hemodynamic data. These models are able to diagnose stroke subtype with 30 minutes of monitoring, to predict the exitus during the first 3 hours of monitoring, and to predict the stroke recurrence in just 15 minutes of monitoring. Patients with difficult access to a \acrshort{CT} scan, and all patients that arrive at the stroke unit of a specialized hospital will benefit from these positive results. The results obtained from the real-time developed models are the following: stroke diagnosis around $98\%$ precision ($97.8\%$ Sensitivity, $99.5\%$ Specificity), exitus prediction with $99.8\%$ precision ($99.8\%$ Sens., $99.9\%$ Spec.) and $98\%$ precision predicting stroke recurrence ($98\%$ Sens., $99\%$ Spec.).
Concentration of drivers on traffic is a vital safety issue; thus, monitoring a driver being on road becomes an essential requirement. The key purpose of supervision is to detect abnormal behaviours of the driver and promptly send warnings to him her for avoiding incidents related to traffic accidents. In this paper, to meet the requirement, based on radar sensors applications, the authors first use a small sized millimetre wave radar installed at the steering wheel of the vehicle to collect signals from different head movements of the driver. The received signals consist of the reflection patterns that change in response to the head movements of the driver. Then, in order to distinguish these different movements, a classifier based on the measured signal of the radar sensor is designed. However, since the collected data set is not large, in this paper, the authors propose One shot learning to classify four cases of driver's head movements. The experimental results indicate that the proposed method can classify the four types of cases according to the various head movements of the driver with a high accuracy reaching up to 100. In addition, the classification performance of the proposed method is significantly better than that of the convolutional neural network model.
The Fibonacci sequence (FS) possesses exceptional mathematical properties that have captivated mathematicians, scientists, and artists across centuries. Its intriguing nature lies in its profound connection to the golden ratio, as well as its prevalence in the natural world, exhibited through phenomena such as spiral galaxies, plant seeds, the arrangement of petals, and branching structures. This report delves into the fundamental characteristics of the FS, explores its relationship with the golden ratio using Linear Time Invariant (LTI) systems, and investigates its diverse applications in various fields. Approaching the topic from the standpoint of a digital signal processing instructor in a grade course, we depict the FS as the consequential outcome of an LTI system when subjected to the unit impulse function. This LTI system can be regarded as the original source from which one of the most renowned formulas in mathematics emerges, and its parametric definition, along with the associated systems, is intricately tied to the golden ratio, symbolized by the irrational number Phi. This perspective naturally elucidates the well-established intricate relationship between the FS and Phi. Furthermore, building upon this perspective, we showcase other LTI systems that exhibit the same magnitude in the frequency domain. These systems are characterized by either an impulse response or a difference equation, resulting in a comparable or equivalent FS in terms of absolute value. By exploring these connections, we shed light on the remarkable similarities and variations that arise within the FS under different LTI systems.
In this paper, we address the problem of Received Signal Strength map reconstruction based on location-dependent radio measurements and utilizing side knowledge about the local region; for example, city plan, terrain height, gateway position. Depending on the quantity of such prior side information, we employ Neural Architecture Search to find an optimized Neural Network model with the best architecture for each of the supposed settings. We demonstrate that using additional side information enhances the final accuracy of the Received Signal Strength map reconstruction on three datasets that correspond to three major cities, particularly in sub-areas near the gateways where larger variations of the average received signal power are typically observed.
Structural magnetic resonance imaging (sMRI) has shown great clinical value and has been widely used in deep learning (DL) based computer-aided brain disease diagnosis. Previous approaches focused on local shapes and textures in sMRI that may be significant only within a particular domain. The learned representations are likely to contain spurious information and have a poor generalization ability in other diseases and datasets. To facilitate capturing meaningful and robust features, it is necessary to first comprehensively understand the intrinsic pattern of the brain that is not restricted within a single data/task domain. Considering that the brain is a complex connectome of interlinked neurons, the connectional properties in the brain have strong biological significance, which is shared across multiple domains and covers most pathological information. In this work, we propose a connectional style contextual representation learning model (CS-CRL) to capture the intrinsic pattern of the brain, used for multiple brain disease diagnosis. Specifically, it has a vision transformer (ViT) encoder and leverages mask reconstruction as the proxy task and Gram matrices to guide the representation of connectional information. It facilitates the capture of global context and the aggregation of features with biological plausibility. The results indicate that CS-CRL achieves superior accuracy in multiple brain disease diagnosis tasks across six datasets and three diseases and outperforms state-of-the-art models. Furthermore, we demonstrate that CS-CRL captures more brain-network-like properties, better aggregates features, is easier to optimize and is more robust to noise, which explains its superiority in theory. Our source code will be released soon.
In the Global Navigation Satellite System (GNSS) context, the growing number of available satellites has lead to many challenges when it comes to choosing the most accurate pseudorange contributions, given the strong impact of biased measurements on positioning accuracy, particularly in single-epoch scenarios. This work leverages the potential of machine learning in predicting link-wise measurement quality factors and, hence, optimize measurement weighting. For this purpose, we use a customized matrix composed of heterogeneous features such as conditional pseudorange residuals and per-link satellite metrics (e.g., carrier-to-noise power density ratio and its empirical statistics, satellite elevation, carrier phase lock time). This matrix is then fed as an input to a recurrent neural network (RNN) (i.e., a long-short term memory (LSTM) network). Our experimental results on real data, obtained from extensive field measurements, demonstrate the high potential of our proposed solution being able to outperform traditional measurements weighting and selection strategies from state-of-the-art.
CMOS Image Sensors are experiencing significant growth due to their capabilities to be integrated in smartphones with refined image quality. One of the major contributions to the growth of image sensors is the innovation brought about in their fabrication processes. This paper presents a detailed review of the different fabrication processes of the CMOS Image Sensors and its impact on the image quality of smartphone pictures. Fabrication of CMOS image sensors using wafer bonding technologies such as Through Silicon Vias and CuCu hybrid bonding along with their experimental results are discussed. A 2 layer architecture of photodiode and pixel transistors has adopted the 3D sequential integration, by which the wafers are bonded together one after the other in the fabrication process. Electrical characteristics and reliability test results are presented for the former two fabrication processes and the improvements in the pixels performance such as conversion gain, quantum efficiency, full well capacity and dynamic range for the 2 layer architecture are discussed.
The high frequency communication bands (mmWave and sub-THz) promise tremendous data rates, however, they also have very high power consumption which is particularly significant for battery-power-limited user-equipment (UE). In this context, we design an energy aware band assignment system which reduces the power consumption while also achieving a target sum rate of M in T time-slots. We do this by using 1) Rate forecaster(s); 2) Channel forecaster(s) which forecasts T direct multistep ahead using a stacked (long short term memory) LSTM architecture. We propose an iterative rate updating algorithm which updates the target rate based on current rate and future predicted rates in a frame. The proposed approach is validated on the publicly available `DeepMIMO' dataset. Research findings shows that the rate forecaster based approach performs better than the channel forecaster. Furthermore, LSTM based predictions outperforms well celebrated Transformer predictions in terms of NRMSE and NMAE. Research findings reveals that the power consumption with this approach is ~ 300 mW lower compared to a greedy band assignment at a 1.5Gb/s target rate.
Precise localization is one key element of the Internet of Things (IoT). Especially concepts for position estimation when Global Navigation Satellite Systems (GNSS) are unavailable have moved into the focus. One crucial component for localization systems in general and precise runtime-based positioning, in particular, is the necessity of ultra-precise clock synchronization between the receiving base stations. Our work presents a software-based approach for the wireless synchronization of spatially separated base stations using a low-cost off-the-shelf frontend architecture. The proposed system estimates the time synchronization, sampling clock offset, and carrier frequency offset using broadcast signals as Signals of Opportunity. In this paper, we derive the theoretical lower bound for the estimation variance according to the Modified Cramer-Rao Bound. We show that a theoretical time synchronization accuracy in the range of ps and a frequency synchronization precision in the range of milli-Hertz is achievable. An algorithm is presented that estimates the desired parameter based on evaluating the Cross-Correlation Function between base stations. Initial measurements are conducted in a real-world environment. It is shown that the presented estimator nearly reaches the theoretical bound within a time and frequency synchronization accuracy of down to 200 ps and 6 mHz, respectively.
The emergence of smart cities demands harnessing advanced technologies like the Internet of Things (IoT) and Artificial Intelligence (AI) and promises to unlock cities' potential to become more sustainable, efficient, and ultimately livable for their inhabitants. This work introduces an intelligent city management system that provides a data-driven approach to three use cases: (i) analyze traffic information to reduce the risk of traffic collisions and improve driver and pedestrian safety, (ii) identify when and where energy consumption can be reduced to improve cost savings, and (iii) detect maintenance issues like potholes in the city's roads and sidewalks, as well as the beginning of hazards like floods and fires. A case study in Aveiro City demonstrates the system's effectiveness in generating actionable insights that enhance security, energy efficiency, and sustainability, while highlighting the potential of AI and IoT-driven solutions for smart city development.
With the rapid advancements of the text-to-image generative model, AI-generated images (AGIs) have been widely applied to entertainment, education, social media, etc. However, considering the large quality variance among different AGIs, there is an urgent need for quality models that are consistent with human subjective ratings. To address this issue, we extensively consider various popular AGI models, generated AGI through different prompts and model parameters, and collected subjective scores at the perceptual quality and text-to-image alignment, thus building the most comprehensive AGI subjective quality database AGIQA-3K so far. Furthermore, we conduct a benchmark experiment on this database to evaluate the consistency between the current Image Quality Assessment (IQA) model and human perception, while proposing StairReward that significantly improves the assessment performance of subjective text-to-image alignment. We believe that the fine-grained subjective scores in AGIQA-3K will inspire subsequent AGI quality models to fit human subjective perception mechanisms at both perception and alignment levels and to optimize the generation result of future AGI models. The database is released on \url{https://github.com/lcysyzxdxc/AGIQA-3k-Database}.
Reconfigurable intelligent surface (RIS) has shown its great potential in facilitating device-based integrated sensing and communication (ISAC), where sensing and communication tasks are mostly conducted on different time-frequency resources. While the more challenging scenarios of simultaneous sensing and communication (SSC) have so far drawn little attention. In this paper, we propose a novel RIS-aided ISAC framework where the inherent location information in the received communication signals from a blind-zone user equipment is exploited to enable SSC. We first design a two-phase ISAC transmission protocol. In the first phase, communication and coarse-grained location sensing are performed concurrently by exploiting the very limited channel state information, while in the second phase, by using the coarse-grained sensing information obtained from the first phase, simple-yet-efficient sensing-based beamforming designs are proposed to realize both higher-rate communication and fine-grained location sensing. We demonstrate that our proposed framework can achieve almost the same performance as the communication-only frameworks, while providing up to millimeter-level positioning accuracy. In addition, we show how the communication and sensing performance can be simultaneously boosted through our proposed sensing-based beamforming designs. The results presented in this work provide valuable insights into the design and implementation of other ISAC systems considering SSC.
In the domain of remote sensing image interpretation, road extraction from high-resolution aerial imagery has already been a hot research topic. Although deep CNNs have presented excellent results for semantic segmentation, the efficiency and capabilities of vision transformers are yet to be fully researched. As such, for accurate road extraction, a deep semantic segmentation neural network that utilizes the abilities of residual learning, HetConvs, UNet, and vision transformers, which is called \texttt{ResUNetFormer}, is proposed in this letter. The developed \texttt{ResUNetFormer} is evaluated on various cutting-edge deep learning-based road extraction techniques on the public Massachusetts road dataset. Statistical and visual results demonstrate the superiority of the \texttt{ResUNetFormer} over the state-of-the-art CNNs and vision transformers for segmentation. The code will be made available publicly at \url{https://github.com/aj1365/ResUNetFormer}.
The rapid advancement of spoofing algorithms necessitates the development of robust detection methods capable of accurately identifying emerging fake audio. Traditional approaches, such as finetuning on new datasets containing these novel spoofing algorithms, are computationally intensive and pose a risk of impairing the acquired knowledge of known fake audio types. To address these challenges, this paper proposes an innovative approach that mitigates the limitations associated with finetuning. We introduce the concept of training low-rank adaptation matrices tailored specifically to the newly emerging fake audio types. During the inference stage, these adaptation matrices are combined with the existing model to generate the final prediction output. Extensive experimentation is conducted to evaluate the efficacy of the proposed method. The results demonstrate that our approach effectively preserves the prediction accuracy of the existing model for known fake audio types. Furthermore, our approach offers several advantages, including reduced storage memory requirements and lower equal error rates compared to conventional finetuning methods, particularly on specific spoofing algorithms.
This paper studies the motion planning problem of the pick-and-place of an aerial manipulator that consists of a quadcopter flying base and a Delta arm. We propose a novel partially decoupled motion planning framework to solve this problem. Compared to the state-of-the-art approaches, the proposed one has two novel features. First, it does not suffer from increased computation in high-dimensional configuration spaces. That is because it calculates the trajectories of the quadcopter base and the end-effector separately in the Cartesian space based on proposed geometric feasibility constraints. The geometric feasibility constraints can ensure the resulting trajectories satisfy the aerial manipulator's geometry. Second, collision avoidance for the Delta arm is achieved through an iterative approach based on a pinhole mapping method, so that the feasible trajectory can be found in an efficient manner. The proposed approach is verified by three experiments on a real aerial manipulation platform. The experimental results show the effectiveness of the proposed method for the aerial pick-and-place task.
This work introduces approaches to assessing phrase breaks in ESL learners' speech using pre-trained language models (PLMs) and large language models (LLMs). There are two tasks: overall assessment of phrase break for a speech clip and fine-grained assessment of every possible phrase break position. To leverage NLP models, speech input is first force-aligned with texts, and then pre-processed into a token sequence, including words and phrase break information. To utilize PLMs, we propose a pre-training and fine-tuning pipeline with the processed tokens. This process includes pre-training with a replaced break token detection module and fine-tuning with text classification and sequence labeling. To employ LLMs, we design prompts for ChatGPT. The experiments show that with the PLMs, the dependence on labeled training data has been greatly reduced, and the performance has improved. Meanwhile, we verify that ChatGPT, a renowned LLM, has potential for further advancement in this area.
Each year, wildfires destroy larger areas of Spain, threatening numerous ecosystems. Humans cause 90% of them (negligence or provoked) and the behaviour of individuals is unpredictable. However, atmospheric and environmental variables affect the spread of wildfires, and they can be analysed by using deep learning. In order to mitigate the damage of these events we proposed the novel Wildfire Assessment Model (WAM). Our aim is to anticipate the economic and ecological impact of a wildfire, assisting managers resource allocation and decision making for dangerous regions in Spain, Castilla y Le\'on and Andaluc\'ia. The WAM uses a residual-style convolutional network architecture to perform regression over atmospheric variables and the greenness index, computing necessary resources, the control and extinction time, and the expected burnt surface area. It is first pre-trained with self-supervision over 100,000 examples of unlabelled data with a masked patch prediction objective and fine-tuned using 311 samples of wildfires. The pretraining allows the model to understand situations, outclassing baselines with a 1,4%, 3,7% and 9% improvement estimating human, heavy and aerial resources; 21% and 10,2% in expected extinction and control time; and 18,8% in expected burnt area. Using the WAM we provide an example assessment map of Castilla y Le\'on, visualizing the expected resources over an entire region.
Deep Learning models are a standard solution for sensor-based Human Activity Recognition (HAR), but their deployment is often limited by labeled data scarcity and models' opacity. Neuro-Symbolic AI (NeSy) provides an interesting research direction to mitigate these issues by infusing knowledge about context information into HAR deep learning classifiers. However, existing NeSy methods for context-aware HAR require computationally expensive symbolic reasoners during classification, making them less suitable for deployment on resource-constrained devices (e.g., mobile devices). Additionally, NeSy approaches for context-aware HAR have never been evaluated on in-the-wild datasets, and their generalization capabilities in real-world scenarios are questionable. In this work, we propose a novel approach based on a semantic loss function that infuses knowledge constraints in the HAR model during the training phase, avoiding symbolic reasoning during classification. Our results on scripted and in-the-wild datasets show the impact of different semantic loss functions in outperforming a purely data-driven model. We also compare our solution with existing NeSy methods and analyze each approach's strengths and weaknesses. Our semantic loss remains the only NeSy solution that can be deployed as a single DNN without the need for symbolic reasoning modules, reaching recognition rates close (and better in some cases) to existing approaches.
Phonetic convergence describes the automatic and unconscious speech adaptation of two interlocutors in a conversation. This paper proposes a Siamese recurrent neural network (RNN) architecture to measure the convergence of the holistic spectral characteristics of speech sounds in an L2-L2 interaction. We extend an alternating reading task (the ART) dataset by adding 20 native Slovak L2 English speakers. We train and test the Siamese RNN model to measure phonetic convergence of L2 English speech from three different native language groups: Italian (9 dyads), French (10 dyads) and Slovak (10 dyads). Our results indicate that the Siamese RNN model effectively captures the dynamics of phonetic convergence and the speaker's imitation ability. Moreover, this text-independent model is scalable and capable of handling L1-induced speaker variability.
This paper presents a review of techniques used for the detection and tracking of UAVs or drones. There are different techniques that depend on collecting measurements of the position, velocity, and image of the UAV and then using them in detection and tracking. Hybrid detection techniques are also presented. The paper is a quick reference for a wide spectrum of methods that are used in the drone detection process.
Detecting an abrupt and persistent change in the underlying distribution of online data streams is an important problem in many applications. This paper proposes a new robust score-based algorithm called RSCUSUM, which can be applied to unnormalized models and addresses the issue of unknown post-change distributions. RSCUSUM replaces the Kullback-Leibler divergence with the Fisher divergence between pre- and post-change distributions for computational efficiency in unnormalized statistical models and introduces a notion of the ``least favorable'' distribution for robust change detection. The algorithm and its theoretical analysis are demonstrated through simulation studies.
Holographic MIMO (hMIMO) systems with a massive number of individually controlled antennas N make minimum mean square error (MMSE) channel estimation particularly challenging, due to its computational complexity that scales as $N^3$ . This paper investigates uniform linear arrays and proposes a low-complexity method based on the discrete Fourier transform (DFT) approximation, which follows from replacing the covariance matrix by a suitable circulant matrix. Numerical results show that, already for arrays with moderate size (in the order of tens of wavelengths), it achieves the same performance of the optimal MMSE, but with a significant lower computational load that scales as $N \log N$. Interestingly, the proposed method provides also increased robustness in case of imperfect knowledge of the covariance matrix.
Predicting the performance of traveling-wave parametric amplifiers (TWPAs) based on nonlinear elements like superconducting Josephson junctions (JJs) is vital for qubit read-out in quantum computers. The purpose of this article is twofold: (a) to demonstrate how nonlinear inductors based on combinations of JJs can be modeled in commercial circuit simulators, and (b) to show how the harmonic balance (HB) is used in the reliable prediction of the amplifier performance e.g., gain and pump harmonic power conversion. Experimental characterization of two types of TWPA architectures is compared with simulations to showcase the reliability of the HB method. We disseminate the modeling know-how and techniques to new designers of parametric amplifiers.
Machine learning head models (MLHMs) are developed to estimate brain deformation for early detection of traumatic brain injury (TBI). However, the overfitting to simulated impacts and the lack of generalizability caused by distributional shift of different head impact datasets hinders the broad clinical applications of current MLHMs. We propose brain deformation estimators that integrates unsupervised domain adaptation with a deep neural network to predict whole-brain maximum principal strain (MPS) and MPS rate (MPSR). With 12,780 simulated head impacts, we performed unsupervised domain adaptation on on-field head impacts from 302 college football (CF) impacts and 457 mixed martial arts (MMA) impacts using domain regularized component analysis (DRCA) and cycle-GAN-based methods. The new model improved the MPS/MPSR estimation accuracy, with the DRCA method significantly outperforming other domain adaptation methods in prediction accuracy (p<0.001): MPS RMSE: 0.027 (CF) and 0.037 (MMA); MPSR RMSE: 7.159 (CF) and 13.022 (MMA). On another two hold-out test sets with 195 college football impacts and 260 boxing impacts, the DRCA model significantly outperformed the baseline model without domain adaptation in MPS and MPSR estimation accuracy (p<0.001). The DRCA domain adaptation reduces the MPS/MPSR estimation error to be well below TBI thresholds, enabling accurate brain deformation estimation to detect TBI in future clinical applications.
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft.
Trajectory optimization of a robot manipulator consists of both optimization of the robot movement as well as optimization of the robot end-effector path. This paper aims to find optimum movement parameters including movement type, speed, and acceleration to minimize robot energy. Trajectory optimization by minimizing the energy would increase the longevity of robotic manipulators. We utilized the particle swarm optimization method to find the movement parameters leading to minimum energy consumption. The effectiveness of the proposed method is demonstrated on different trajectories. Experimental results show that 49% efficiency was obtained using a UR5 robotic arm.
Many recent studies have focused on fine-tuning pre-trained models for speech emotion recognition (SER), resulting in promising performance compared to traditional methods that rely largely on low-level, knowledge-inspired acoustic features. These pre-trained speech models learn general-purpose speech representations using self-supervised or weakly-supervised learning objectives from large-scale datasets. Despite the significant advances made in SER through the use of pre-trained architecture, fine-tuning these large pre-trained models for different datasets requires saving copies of entire weight parameters, rendering them impractical to deploy in real-world settings. As an alternative, this work explores parameter-efficient fine-tuning (PEFT) approaches for adapting pre-trained speech models for emotion recognition. Specifically, we evaluate the efficacy of adapter tuning, embedding prompt tuning, and LoRa (Low-rank approximation) on four popular SER testbeds. Our results reveal that LoRa achieves the best fine-tuning performance in emotion recognition while enhancing fairness and requiring only a minimal extra amount of weight parameters. Furthermore, our findings offer novel insights into future research directions in SER, distinct from existing approaches focusing on directly fine-tuning the model architecture. Our code is publicly available under: https://github.com/usc-sail/peft-ser.
There are increasing concerns about malicious attacks on autonomous vehicles. In particular, inaudible voice command attacks pose a significant threat as voice commands become available in autonomous driving systems. How to empirically defend against these inaudible attacks remains an open question. Previous research investigates utilizing deep learning-based multimodal fusion for defense, without considering the model uncertainty in trustworthiness. As deep learning has been applied to increasingly sensitive tasks, uncertainty measurement is crucial in helping improve model robustness, especially in mission-critical scenarios. In this paper, we propose the Multimodal Fusion Framework (MFF) as an intelligent security system to defend against inaudible voice command attacks. MFF fuses heterogeneous audio-vision modalities using VGG family neural networks and achieves the detection accuracy of 92.25% in the comparative fusion method empirical study. Additionally, extensive experiments on audio-vision tasks reveal the model's uncertainty. Using Expected Calibration Errors, we measure calibration errors and Monte-Carlo Dropout to estimate the predictive distribution for the proposed models. Our findings show empirically to train robust multimodal models, improve standard accuracy and provide a further step toward interpretability. Finally, we discuss the pros and cons of our approach and its applicability for Advanced Driver Assistance Systems.
Previous initial research has already been carried out to propose speech-based BCI using brain signals (e.g.~non-invasive EEG and invasive sEEG / ECoG), but there is a lack of combined methods that investigate non-invasive brain, articulation, and speech signals together and analyze the cognitive processes in the brain, the kinematics of the articulatory movement and the resulting speech signal. In this paper, we describe our multimodal (electroencephalography, ultrasound tongue imaging, and speech) analysis and synthesis experiments, as a feasibility study. We extend the analysis of brain signals recorded during speech production with ultrasound-based articulation data. From the brain signal measured with EEG, we predict ultrasound images of the tongue with a fully connected deep neural network. The results show that there is a weak but noticeable relationship between EEG and ultrasound tongue images, i.e. the network can differentiate articulated speech and neutral tongue position.
We consider the federated submodel learning (FSL) problem in a distributed storage system. In the FSL framework, the full learning model at the server side is divided into multiple submodels such that each selected client needs to download only the required submodel(s) and upload the corresponding update(s) in accordance with its local training data. The server comprises multiple independent databases and the full model is stored across these databases. An eavesdropper passively observes all the storage and listens to all the communicated data, of its controlled databases, to gain knowledge about the remote client data and the submodel information. In addition, a subset of databases may fail, negatively affecting the FSL process, as FSL process may take a non-negligible amount of time for large models. To resolve these two issues together (i.e., security and database repair), we propose a novel coding mechanism coined ramp secure regenerating coding (RSRC), to store the full model in a distributed manner. Using our new RSRC method, the eavesdropper is permitted to learn a controllable amount of submodel information for the sake of reducing the communication and storage costs. Further, during the database repair process, in the construction of the replacement database, the submodels to be updated are stored in the form of their latest version from updating clients, while the remaining submodels are obtained from the previous version in other databases through routing clients. Our new RSRC-based distributed FSL approach is constructed on top of our earlier two-database FSL scheme which uses private set union (PSU). A complete one-round FSL process consists of FSL-PSU phase, FSL-write phase and additional auxiliary phases. Our proposed FSL scheme is also robust against database drop-outs, client drop-outs, client late-arrivals and an active adversary controlling databases.