New articles on Electrical Engineering and Systems Science


[1] 2412.18614

Investigating Acoustic-Textual Emotional Inconsistency Information for Automatic Depression Detection

Previous studies have demonstrated that emotional features from a single acoustic sentiment label can enhance depression diagnosis accuracy. Additionally, according to the Emotion Context-Insensitivity theory and our pilot study, individuals with depression might convey negative emotional content in an unexpectedly calm manner, showing a high degree of inconsistency in emotional expressions during natural conversations. So far, few studies have recognized and leveraged the emotional expression inconsistency for depression detection. In this paper, a multimodal cross-attention method is presented to capture the Acoustic-Textual Emotional Inconsistency (ATEI) information. This is achieved by analyzing the intricate local and long-term dependencies of emotional expressions across acoustic and textual domains, as well as the mismatch between the emotional content within both domains. A Transformer-based model is then proposed to integrate this ATEI information with various fusion strategies for detecting depression. Furthermore, a scaling technique is employed to adjust the ATEI feature degree during the fusion process, thereby enhancing the model's ability to discern patients with depression across varying levels of severity. To best of our knowledge, this work is the first to incorporate emotional expression inconsistency information into depression detection. Experimental results on a counseling conversational dataset illustrate the effectiveness of our method.


[2] 2412.18649

FITS: Ensuring Safe and Effective Touchscreen Use in Moving Vehicles

Touch interfaces are replacing physical buttons, dials, and switches in the new generation of cars, aircraft, and vessels. However, vehicle vibrations and accelerations perturb finger movements and cause erroneous touchscreen inputs by users. Furthermore, unlike physical buttons, touchscreens cannot be operated by touch alone and always require users' visual focus. Hence, despite their numerous benefits, touchscreens are not inherently suited for use in vehicles, which results in an increased risk of accidents. In a recently awarded research project titled "Right Touch Right Time: Future In-vehicle Touchscreens (FITS)", we aim to address these problems by developing novel in-vehicle touchscreens that actively predict and correct perturbed finger movements and simulate physical touch interactions with artificial tactile feedback.


[3] 2412.18667

State-of-the-Art Underwater Vehicles and Technologies Enabling Smart Ocean: Survey and Classifications

The exploration and sustainable use of marine environments have become increasingly critical as oceans cover over 70% of surface of Earth. This paper provides a comprehensive survey and classification of state-of-the-art underwater vehicles (UVs) and supporting technologies essential for enabling a smart ocean. We categorize UVs into several types, including remotely operated vehicles (ROVs), autonomous underwater vehicles (AUVs), hybrid underwater vehicles (HUVs), unmanned surface vehicles (USVs), and underwater bionic vehicles (UBVs). These technologies are fundamental in a wide range of applications, such as environmental monitoring, deep-sea exploration, defense, and underwater infrastructure inspection. Additionally, the paper explores advancements in underwater communication technologies, namely acoustic, optical, and hybrid systems, as well as key support facilities, including submerged buoys, underwater docking stations, and wearable underwater localization systems. By classifying the vehicles and analyzing their technological capabilities and limitations, this work aims to guide future developments in underwater exploration and monitoring, addressing challenges such as energy efficiency, communication limitations, and environmental adaptability. The paper concludes by discussing the integration of artificial intelligence and machine learning in enhancing the autonomy and operational efficiency of these systems, paving the way for the realization of a fully interconnected and sustainable Smart Ocean.


[4] 2412.18668

Pruning Unrolled Networks (PUN) at Initialization for MRI Reconstruction Improves Generalization

Deep learning methods are highly effective for many image reconstruction tasks. However, the performance of supervised learned models can degrade when applied to distinct experimental settings at test time or in the presence of distribution shifts. In this study, we demonstrate that pruning deep image reconstruction networks at training time can improve their robustness to distribution shifts. In particular, we consider unrolled reconstruction architectures for accelerated magnetic resonance imaging and introduce a method for pruning unrolled networks (PUN) at initialization. Our experiments demonstrate that when compared to traditional dense networks, PUN offers improved generalization across a variety of experimental settings and even slight performance gains on in-distribution data.


[5] 2412.18723

MRI Reconstruction with Regularized 3D Diffusion Model (R3DM)

Magnetic Resonance Imaging (MRI) is a powerful imaging technique widely used for visualizing structures within the human body and in other fields such as plant sciences. However, there is a demand to develop fast 3D-MRI reconstruction algorithms to show the fine structure of objects from under-sampled acquisition data, i.e., k-space data. This emphasizes the need for efficient solutions that can handle limited input while maintaining high-quality imaging. In contrast to previous methods only using 2D, we propose a 3D MRI reconstruction method that leverages a regularized 3D diffusion model combined with optimization method. By incorporating diffusion based priors, our method improves image quality, reduces noise, and enhances the overall fidelity of 3D MRI reconstructions. We conduct comprehensive experiments analysis on clinical and plant science MRI datasets. To evaluate the algorithm effectiveness for under-sampled k-space data, we also demonstrate its reconstruction performance with several undersampling patterns, as well as with in- and out-of-distribution pre-trained data. In experiments, we show that our method improves upon tested competitors.


[6] 2412.18749

RIS-Assisted Simultaneous Legitimate Monitoring and Jamming for Industrial Wireless Networks

In this paper, we study reconfigurable intelligent surface (RIS)-assisted simultaneous legitimate monitoring and jamming techniques for industrial environments, so that egitimate monitor (LM) and legitimate jammers (LJs) can sustainably monitor and interfere with suspicious communications with minimum transmission power. Specifically, we propose a Block Coordinate Descent-Particle Swarm Optimization (BCD-PSO) based scheme to optimize RIS's phase shift matrix and minimize LJs' transmission power, while successfully jamming and stably monitoring unauthorized communications. Simulation results demonstrate that the proposed BCD-PSO can enhance the performances in terms of monitoring, resource utilization and robustness. Moreover, we effectively exam the best deployment of RIS towards diverse objectives.


[7] 2412.18762

Experimental Study of RCS Diversity with Novel No-divergent OAM Beams

This research proposes a novel approach utilizing Orbital Angular Momentum (OAM) beams to enhance Radar Cross Section (RCS) diversity for target detection in future transportation systems. Unlike conventional OAM beams with hollow-shaped divergence patterns, the new proposed OAM beams provide uniform illumination across the target without a central energy void, but keep the inherent phase gradient of vortex property. We utilize waveguide slot antennas to generate four different modes of these novel OAM beams at X-band frequency. Furthermore, these different mode OAM beams are used to illuminate metal models, and the resulting RCS is compared with that obtained using plane waves. The findings reveal that the novel OAM beams produce significant azimuthal RCS diversity, providing a new approach for the detection of weak and small targets.This study not only reveals the RCS diversity phenomenon based on novel OAM beams of different modes but also addresses the issue of energy divergence that hinders traditional OAM beams in long-range detection applications.


[8] 2412.18784

Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants

Computational music research plays a critical role in advancing music production, distribution, and understanding across various musical styles worldwide. Despite the immense cultural and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chants are relatively underrepresented in computational music research. This paper contributes to this field by introducing a new dataset specifically tailored for analyzing EOTC chants, also known as Yaredawi Zema. This work provides a comprehensive overview of a 10-hour dataset, 369 instances, creation, and curation process, including rigorous quality assurance measures. Our dataset has a detailed word-level temporal boundary and reading tone annotation along with the corresponding chanting mode label of audios. Moreover, we have also identified the chanting options associated with multiple chanting notations in the manuscript by annotating them accordingly. Our goal in making this dataset available to the public 1 is to encourage more research and study of EOTC chants, including lyrics transcription, lyric-to-audio alignment, and music generation tasks. Such research work will advance knowledge and efforts to preserve this distinctive liturgical music, a priceless cultural artifact for the Ethiopian people.


[9] 2412.18788

Computational Analysis of Yaredawi YeZema Silt in Ethiopian Orthodox Tewahedo Church Chants

Despite its musicological, cultural, and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chant is relatively underrepresented in music research. Historical records, including manuscripts, research papers, and oral traditions, confirm Saint Yared's establishment of three canonical EOTC chanting modes during the 6th century. This paper attempts to investigate the EOTC chants using music information retrieval (MIR) techniques. Among the research questions regarding the analysis and understanding of EOTC chants, Yaredawi YeZema Silt, namely the mode of chanting adhering to Saint Yared's standards, is of primary importance. Therefore, we consider the task of Yaredawi YeZema Silt classification in EOTC chants by introducing a new dataset and showcasing a series of classification experiments for this task. Results show that using the distribution of stabilized pitch contours as the feature representation on a simple neural network-based classifier becomes an effective solution. The musicological implications and insights of such results are further discussed through a comparative study with the previous ethnomusicology literature on EOTC chants. By making this dataset publicly accessible, we aim to promote future exploration and analysis of EOTC chants and highlight potential directions for further research, thereby fostering a deeper understanding and preservation of this unique spiritual and cultural heritage.


[10] 2412.18832

Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition

Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning stage to reduce undue bias to training data speakers, and serves as a more neutral and robust starting point for test time unsupervised adaptation. Speech variability attributed to speaker identity and speech impairment severity, or aging induced neurocognitive decline, are modelled using separate adapters that can be combined together to model any seen or unseen speaker. Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3.01% and 1.50% absolute (10.86% and 6.94% relative) on the two tasks respectively. The lowest published WER of 19.45% (49.34% on very low intelligibility, 33.17% on unseen words) is obtained on the UASpeech test set of 16 dysarthric speakers.


[11] 2412.18876

Towards Compatible Semantic Communication: A Perspective on Digital Coding and Modulation

Semantic communication (SC) is emerging as a pivotal innovation within the 6G framework, aimed at enabling more intelligent transmission. This development has led to numerous studies focused on designing advanced systems through powerful deep learning techniques. Nevertheless, many of these approaches envision an analog transmission manner by formulating the transmitted signals as continuous-valued semantic representation vectors, limiting their compatibility with existing digital systems. To enhance compatibility, it is essential to explore digitized SC systems. This article systematically identifies two promising paradigms for designing digital SC: probabilistic and deterministic approaches, according to the modulation strategies. For both, we first provide a comprehensive analysis of the methodologies. Then, we put forward the principles of designing digital SC systems with a specific focus on informativeness and robustness of semantic representations to enhance performance, along with constellation design. Additionally, we present a case study to demonstrate the effectiveness of these methods. Moreover, this article also explores the intrinsic advantages and opportunities provided by digital SC systems, and then outlines several potential research directions for future investigation.


[12] 2412.18887

Preventing output saturation in active noise control: An output-constrained Kalman filter approach

The Kalman filter (KF)-based active noise control (ANC) system demonstrates superior tracking and faster convergence compared to the least mean square (LMS) method, particularly in dynamic noise cancellation scenarios. However, in environments with extremely high noise levels, the power of the control signal can exceed the system's rated output power due to hardware limitations, leading to output saturation and subsequent non-linearity. To mitigate this issue, a modified KF with an output constraint is proposed. In this approach, the disturbance treated as an measurement is re-scaled by a constraint factor, which is determined by the system's rated power, the secondary path gain, and the disturbance power. As a result, the output power of the system, i.e. the control signal, is indirectly constrained within the maximum output of the system, ensuring stability. Simulation results indicate that the proposed algorithm not only achieves rapid suppression of dynamic noise but also effectively prevents non-linearity due to output saturation, highlighting its practical significance.


[13] 2412.18894

Comprehensive Study on Lumbar Disc Segmentation Techniques Using MRI Data

Lumbar disk segmentation is essential for diagnosing and curing spinal disorders by enabling precise detection of disk boundaries in medical imaging. The advent of deep learning has resulted in the development of many segmentation methods, offering differing levels of accuracy and effectiveness. This study assesses the effectiveness of several sophisticated deep learning architectures, including ResUnext, Ef3 Net, UNet, and TransUNet, for lumbar disk segmentation, highlighting key metrics like as Pixel Accuracy, Mean Intersection over Union (Mean IoU), and Dice Coefficient. The findings indicate that ResUnext achieved the highest segmentation accuracy, with a Pixel Accuracy of 0.9492 and a Dice Coefficient of 0.8425, with TransUNet following closely after. Filtering techniques somewhat enhanced the performance of most models, particularly Dense UNet, improving stability and segmentation quality. The findings underscore the efficacy of these models in lumbar disk segmentation and highlight potential areas for improvement.


[14] 2412.18905

External Bias and Opinion Clustering in Cooperative Networks

In this work, we consider a group of n agents which interact with each other in a cooperative framework. A Laplacian-based model is proposed to govern the evolution of opinions in the group when the agents are subjected to external biases like agents' traits, news, etc. The objective of the paper is to design a control input which leads to any desired opinion clustering even in the presence of external bias factors. Further, we also determine the conditions which ensure the reachability to any arbitrary opinion states. Note that all of these results hold for any kind of graph structure. Finally, some numerical simulations are discussed to validate these results.


[15] 2412.18931

An Approximated Model of Wildfire Propagation on Slope

The increasing frequency and intensity of wildfires underscore the need for accurate predictive models to enhance wildfire management. Traditional models, such as Rothermel and FARSITE, provide foundational insights but often oversimplify the complex dynamics of wildfire spread. Advanced methods, employing sophisticated mathematical techniques, offer more precise modeling by accounting for real-world complexities and dynamic environmental factors. This paper focuses on wildfire propagation over inclined terrains and combines the Rothermel model, Huygens' principle, and advanced mathematical techniques to provide a more precise model of propagation. Environmental parameters and vegetation factors are directly incorporated into formulas and equations to improve the reliability and effectiveness of wildfire management strategies. The practical application of these results is demonstrated through MATLAB simulations, specifically examining wildfire spread under wind conditions that do not impede upwind fire advancement. The findings of this work contribute to both wildfire research and the development of more effective management strategies.


[16] 2412.18957

RIS-Assisted Aerial Non-Terrestrial Networks: An Intelligent Synergy with Deep Reinforcement Learning

Reconfigurable intelligent surface (RIS)-assisted aerial non-terrestrial networks (NTNs) offer a promising paradigm for enhancing wireless communications in the era of 6G and beyond. By integrating RIS with aerial platforms such as unmanned aerial vehicles (UAVs) and high-altitude platforms (HAPs), these networks can intelligently control signal propagation, extending coverage, improving capacity, and enhancing link reliability. This article explores the application of deep reinforcement learning (DRL) as a powerful tool for optimizing RIS-assisted aerial NTNs. We focus on hybrid proximal policy optimization (H-PPO), a robust DRL algorithm well-suited for handling the complex, hybrid action spaces inherent in these networks. Through a case study of an aerial RIS (ARIS)-aided coordinated multi-point non-orthogonal multiple access (CoMP-NOMA) network, we demonstrate how H-PPO can effectively optimize the system and maximize the sum rate while adhering to system constraints. Finally, we discuss key challenges and promising research directions for DRL-powered RIS-assisted aerial NTNs, highlighting their potential to transform next-generation wireless networks.


[17] 2412.18983

Deep Learning-Based Traffic-Aware Base Station Sleep Mode and Cell Zooming Strategy in RIS-Aided Multi-Cell Networks

Advances in wireless technology have significantly increased the number of wireless connections, leading to higher energy consumption in networks. Among these, base stations (BSs) in radio access networks (RANs) account for over half of the total energy usage. To address this, we propose a multi-cell sleep strategy combined with adaptive cell zooming, user association, and reconfigurable intelligent surface (RIS) to minimize BS energy consumption. This approach allows BSs to enter sleep during low traffic, while adaptive cell zooming and user association dynamically adjust coverage to balance traffic load and enhance data rates through RIS, minimizing the number of active BSs. However, it is important to note that the proposed method may achieve energy-savings at the cost of increased delay, requiring a trade-off between these two factors. Moreover, minimizing BS energy consumption under the delay constraint is a complicated non-convex problem. To address this issue, we model the RIS-aided multi-cell network as a Markov decision process (MDP) and use the proximal policy optimization (PPO) algorithm to optimize sleep mode (SM), cell zooming, and user association. Besides, we utilize a double cascade correlation network (DCCN) algorithm to optimize the RIS reflection coefficients. Simulation results demonstrate that PPO balances energy-savings and delay, while DCCN-optimized RIS enhances BS energy-savings. Compared to systems optimised by the benchmark DQN algorithm, energy consumption is reduced by 49.61%


[18] 2412.18996

WaveDiffUR: A diffusion SDE-based solver for ultra magnification super-resolution in remote sensing images

Deep neural networks have recently achieved significant advancements in remote sensing superresolu-tion (SR). However, most existing methods are limited to low magnification rates (e.g., 2 or 4) due to the escalating ill-posedness at higher magnification scales. To tackle this challenge, we redefine high-magnification SR as the ultra-resolution (UR) problem, reframing it as solving a conditional diffusion stochastic differential equation (SDE). In this context, we propose WaveDiffUR, a novel wavelet-domain diffusion UR solver that decomposes the UR process into sequential sub-processes addressing conditional wavelet components. WaveDiffUR iteratively reconstructs low-frequency wavelet details (ensuring global consistency) and high-frequency components (enhancing local fidelity) by incorporating pre-trained SR models as plug-and-play modules. This modularity mitigates the ill-posedness of the SDE and ensures scalability across diverse applications. To address limitations in fixed boundary conditions at extreme magnifications, we introduce the cross-scale pyramid (CSP) constraint, a dynamic and adaptive framework that guides WaveDiffUR in generating fine-grained wavelet details, ensuring consistent and high-fidelity outputs even at extreme magnification rates.


[19] 2412.19005

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.


[20] 2412.19017

Brain Ageing Prediction using Isolation Forest Technique and Residual Neural Network (ResNet)

Brain aging is a complex and dynamic process, leading to functional and structural changes in the brain. These changes could lead to the increased risk of neurodegenerative diseases and cognitive decline. Accurate brain-age estimation utilizing neuroimaging data has become necessary for detecting initial signs of neurodegeneration. Here, we propose a novel deep learning approach using the Residual Neural Network 101 Version 2 (ResNet101V2) model to predict brain age from MRI scans. To train, validate and test our proposed model, we used a large dataset of 2102 images which were selected randomly from the International Consortium for Brain Mapping (ICBM). Next, we applied data preprocessing techniques, including normalizing the images and using outlier detection via Isolation Forest method. Then, we evaluated various pre-trained approaches (namely: MobileNetV2, ResNet50V2, ResNet101V2, Xception). The results demonstrated that the ResNet101V2 model has higher performance compared with the other models, attaining MAEs of 0.9136 and 0.8242 years for before and after using Isolation Forest process. Our method achieved a high accuracy in brain age estimation in ICBM dataset and it provides a reliable brain age prediction.


[21] 2412.19026

Modality-Projection Universal Model for Comprehensive Full-Body Medical Imaging Segmentation

The integration of deep learning in medical imaging has shown great promise for enhancing diagnostic, therapeutic, and research outcomes. However, applying universal models across multiple modalities remains challenging due to the inherent variability in data characteristics. This study aims to introduce and evaluate a Modality Projection Universal Model (MPUM). MPUM employs a novel modality-projection strategy, which allows the model to dynamically adjust its parameters to optimize performance across different imaging modalities. The MPUM demonstrated superior accuracy in identifying anatomical structures, enabling precise quantification for improved clinical decision-making. It also identifies metabolic associations within the brain-body axis, advancing research on brain-body physiological correlations. Furthermore, MPUM's unique controller-based convolution layer enables visualization of saliency maps across all network layers, significantly enhancing the model's interpretability.


[22] 2412.19036

Unifying Tree-Reweighted Belief Propagation and Mean Field for Tracking Extended Targets

This paper proposes a unified tree-reweighted belief propagation (BP) and mean field (MF) approach for scalable detection and tracking of extended targets within the framework of factor graph. The factor graph is partitioned into a BP region and an MF region so that the messages in each region are updated according to the corresponding region rules. The BP region exploits the tree-reweighted BP, which offers improved convergence than the standard BP for graphs with massive cycles, to resolve data association. The MF region approximates the posterior densities of the measurement rate, kinematic state and extent. For linear Gaussian target models and gamma Gaussian inverse Wishart distributed state density, the unified approach provides a closed-form recursion for the state density. Hence, the proposed algorithm is more efficient than particle-based BP algorithms for extended target tracking. This method also avoids measurement clustering and gating since it solves the data association problem in a probabilistic fashion. We compare the proposed approach with algorithms such as the Poisson multi-Bernoulli mixture filter and the BP-based Poisson multi-Bernoulli filter. Simulation results demonstrate that the proposed algorithm achieves enhanced tracking performance.


[23] 2412.19068

Attacking Voice Anonymization Systems with Augmented Feature and Speaker Identity Difference

This study focuses on the First VoicePrivacy Attacker Challenge within the ICASSP 2025 Signal Processing Grand Challenge, which aims to develop speaker verification systems capable of determining whether two anonymized speech signals are from the same speaker. However, differences between feature distributions of original and anonymized speech complicate this task. To address this challenge, we propose an attacker system that combines Data Augmentation enhanced feature representation and Speaker Identity Difference enhanced classifier to improve verification performance, termed DA-SID. Specifically, data augmentation strategies (i.e., data fusion and SpecAugment) are utilized to mitigate feature distribution gaps, while probabilistic linear discriminant analysis (PLDA) is employed to further enhance speaker identity difference. Our system significantly outperforms the baseline, demonstrating exceptional effectiveness and robustness against various voice anonymization systems, ultimately securing a top-5 ranking in the challenge.


[24] 2412.19071

Movable Intelligent Surface (MIS) for Wireless Communications: Architecture, Modeling, Algorithm, and Prototyping

Reconfigurable intelligent surfaces enhance wireless systems by reshaping propagation environments. However, dynamic metasurfaces (MSs) with numerous phase-shift elements incur undesired control and hardware costs. In contrast, static MSs (SMSs), configured with static phase shifts pre-designed for specific communication demands, offer a cost-effective alternative by eliminating element-wise tuning. Nevertheless, SMSs typically support a single beam pattern with limited flexibility. In this paper, we propose a novel Movable Intelligent Surface (MIS) technology that enables dynamic beamforming while maintaining static phase shifts. Specifically, we design a MIS architecture comprising two closely stacked transmissive MSs: a larger fixed-position MS 1 and a smaller movable MS 2. By differentially shifting MS 2's position relative to MS 1, the MIS synthesizes distinct beam patterns. Then, we model the interaction between MS 2 and MS 1 using binary selection matrices and padding vectors and formulate a new optimization problem that jointly designs the MIS phase shifts and selects shifting positions for worst-case signal-to-noise ratio maximization. This position selection, equal to beam pattern scheduling, offers a new degree of freedom for RIS-aided systems. To solve the intractable problem, we develop an efficient algorithm that handles unit-modulus and binary constraints and employs manifold optimization methods. Finally, extensive validation results are provided. We implement a MIS prototype and perform proof-of-concept experiments, demonstrating the MIS's ability to synthesize desired beam patterns that achieve one-dimensional beam steering. Numerical results show that by introducing MS 2 with a few elements, MIS effectively offers beamforming flexibility for significantly improved performance. We also draw insights into the optimal MIS configuration and element allocation strategy.


[25] 2412.19072

Robust Speech and Natural Language Processing Models for Depression Screening

Depression is a global health concern with a critical need for increased patient screening. Speech technology offers advantages for remote screening but must perform robustly across patients. We have described two deep learning models developed for this purpose. One model is based on acoustics; the other is based on natural language processing. Both models employ transfer learning. Data from a depression-labeled corpus in which 11,000 unique users interacted with a human-machine application using conversational speech is used. Results on binary depression classification have shown that both models perform at or above AUC=0.80 on unseen data with no speaker overlap. Performance is further analyzed as a function of test subset characteristics, finding that the models are generally robust over speaker and session variables. We conclude that models based on these approaches offer promise for generalized automated depression screening.


[26] 2412.19078

Graph-Enhanced Dual-Stream Feature Fusion with Pre-Trained Model for Acoustic Traffic Monitoring

Microphone array techniques are widely used in sound source localization and smart city acoustic-based traffic monitoring, but these applications face significant challenges due to the scarcity of labeled real-world traffic audio data and the complexity and diversity of application scenarios. The DCASE Challenge's Task 10 focuses on using multi-channel audio signals to count vehicles (cars or commercial vehicles) and identify their directions (left-to-right or vice versa). In this paper, we propose a graph-enhanced dual-stream feature fusion network (GEDF-Net) for acoustic traffic monitoring, which simultaneously considers vehicle type and direction to improve detection. We propose a graph-enhanced dual-stream feature fusion strategy which consists of a vehicle type feature extraction (VTFE) branch, a vehicle direction feature extraction (VDFE) branch, and a frame-level feature fusion module to combine the type and direction feature for enhanced performance. A pre-trained model (PANNs) is used in the VTFE branch to mitigate data scarcity and enhance the type features, followed by a graph attention mechanism to exploit temporal relationships and highlight important audio events within these features. The frame-level fusion of direction and type features enables fine-grained feature representation, resulting in better detection performance. Experiments demonstrate the effectiveness of our proposed method. GEDF-Net is our submission that achieved 1st place in the DCASE 2024 Challenge Task 10.


[27] 2412.19131

Synthetic Discrete Inertia

This letter demonstrates how synthetic inertia can be obtained with the control of flexible discrete devices to keep the power balance of power systems, even if the system does not include any synchronous generator or conventional grid-forming converter. The letter also discusses solutions to cycling issues, which can arise due to the interaction of uncoordinated discrete inertia controllers. The effectiveness, dynamic performance, and challenges of the proposed approach are validated through simulations using modified versions of the WSCC 9-bus test system and of the all-island Irish transmission system.


[28] 2412.19156

Advancements in Terahertz Antenna Design

The promising way to provide sufficient transmission capacity is by accessing transmission bands at higher carrier frequencies. This desire for higher carrier frequency or more bandwidth led the researchers to take advantage of the terahertz (THz) spectrum. The opportunity for large bandwidth in the THz band leads to the possibility of easy, high data rate transmission. In spite of the advantages, the THz band suffers from large free space path loss. In the development of THz communication systems, the antenna is the most significant component. The focus is especially on designing highly directive antennas because they enhance the performance of the overall system by compensating for the large path loss at THz and thus improving the signal-to-noise ratio. This chapter presents different types of THz antennas, including planar, reflectarray, horn antenna, and lens antenna. Emphasis has been made to present the latest trend of designing THz antennas using carbon-based materials, such as graphene and carbon nanotubes. The performance of these antennas has been compared with that of traditional copper-based THz antennas by critically analyzing their properties. A brief discussion on THz power sources is included in this chapter for completeness. A comprehensive discussion on different fabrication techniques has been provided to appraise the reader of the general fabrication processes of THz components.


[29] 2412.19161

Robust $H_{\infty}$ Position Controller for Steering Systems

This paper presents a robust position controller for electric power assisted steering and steer-by-wire force-feedback systems. A position controller is required in steering systems for haptic feedback control, advanced driver assistance systems and automated driving. However, the driver's \textit{physical} arm impedance causes an inertial uncertainty during coupling. Consequently, a typical position controller, i.e., based on single variable, becomes less robust and suffers tracking performance loss. Therefore, a robust position controller is investigated. The proposed solution is based on the multi-variable concept such that the sensed driver torque signal is also included in the position controller. The subsequent solution is obtained by solving the LMI$-H_{\infty}$ optimization problem. As a result, the desired loop gain shape is achieved, i.e., large gain at low frequencies for performance and small gain at high frequencies for robustness. Finally, frequency response comparison of different position controllers on real hardware is presented. Experiments and simulation results clearly illustrate the improvements in reference tracking and robustness with the proposed $H_\infty$ controller.


[30] 2412.19221

Interference-Robust Broadband Rapidly-Varying MIMO Communications: A Knowledge-Data Dual Driven Framework

A novel time-efficient framework is proposed for improving the robustness of a broadband multiple-input multiple-output (MIMO) system against unknown interference under rapidly-varying channels. A mean-squared error (MSE) minimization problem is formulated by optimizing the beamformers employed. Since the unknown interference statistics are the premise for solving the formulated problem, an interference statistics tracking (IST) module is first designed. The IST module exploits both the time- and spatial-domain correlations of the interference-plus-noise (IPN) covariance for the future predictions with data training. Compared to the conventional signal-free space sampling approach, the IST module can realize zero-pilot and low-latency estimation. Subsequently, an interference-resistant hybrid beamforming (IR-HBF) module is presented, which incorporates both the prior knowledge of the theoretical optimization method as well as the data-fed training. Taking advantage of the interpretable network structure, the IR-HBF module enables the simplified mapping from the interference statistics to the beamforming weights. The simulations are executed in high-mobility scenarios, where the numerical results unveil that: 1) the proposed IST module attains promising prediction accuracy compared to the conventional counterparts under different snapshot sampling errors; and 2) the proposed IR-HBF module achieves lower MSE with significantly reduced computational complexity.


[31] 2412.19248

Causal Speech Enhancement with Predicting Semantics based on Quantized Self-supervised Learning Features

Real-time speech enhancement (SE) is essential to online speech communication. Causal SE models use only the previous context while predicting future information, such as phoneme continuation, may help performing causal SE. The phonetic information is often represented by quantizing latent features of self-supervised learning (SSL) models. This work is the first to incorporate SSL features with causality into an SE model. The causal SSL features are encoded and combined with spectrogram features using feature-wise linear modulation to estimate a mask for enhancing the noisy input speech. Simultaneously, we quantize the causal SSL features using vector quantization to represent phonetic characteristics as semantic tokens. The model not only encodes SSL features but also predicts the future semantic tokens in multi-task learning (MTL). The experimental results using VoiceBank + DEMAND dataset show that our proposed method achieves 2.88 in PESQ, especially with semantic prediction MTL, in which we confirm that the semantic prediction played an important role in causal SE.


[32] 2412.19259

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.


[33] 2412.19280

Parametrizations of All Stable Closed-loop Responses: From Theory to Neural Network Control Design

The complexity of modern control systems necessitates architectures that achieve high performance while ensuring robust stability, particularly for nonlinear systems. In this work, we tackle the challenge of designing optimal output-feedback controllers to boost the performance of $\ell_p$-stable discrete-time nonlinear systems while preserving closed-loop stability from external disturbances to input and output channels. Leveraging operator theory and neural network representations, we parametrize the achievable closed-loop maps for a given system and propose novel parametrizations of all $\ell_p$-stabilizing controllers, unifying frameworks such as nonlinear Youla and Internal Model Control. Contributing to a rapidly growing research line, our approach enables unconstrained optimization exclusively over stabilizing output-feedback controllers and provides sufficient conditions to ensure robustness against model mismatch. Additionally, our methods reveal that stronger notions of stability can be imposed on the closed-loop maps if disturbance realizations are available after one time step. Last, our approaches are compatible with the design of nonlinear distributed controllers. Numerical experiments on cooperative robotics demonstrate the flexibility of our framework, allowing cost functions to be freely designed for achieving complex behaviors while preserving stability.


[34] 2412.19315

Towards a Single ASR Model That Generalizes to Disordered Speech

This study investigates the impact of integrating a dataset of disordered speech recordings ($\sim$1,000 hours) into the fine-tuning of a near state-of-the-art ASR baseline system. Contrary to what one might expect, despite the data being less than 1% of the training data of the ASR system, we find a considerable improvement in disordered speech recognition accuracy. Specifically, we observe a 33% improvement on prompted speech, and a 26% improvement on a newly gathered spontaneous, conversational dataset of disordered speech. Importantly, there is no significant performance decline on standard speech recognition benchmarks. Further, we observe that the proposed tuning strategy helps close the gap between the baseline system and personalized models by 64% highlighting the significant progress as well as the room for improvement. Given the substantial benefits of our findings, this experiment suggests that from a fairness perspective, incorporating a small fraction of high quality disordered speech data in a training recipe is an easy step that could be done to make speech technology more accessible for users with speech disabilities.


[35] 2412.19345

Advanced Scheduling of Electrolyzer Modules for Grid Flexibility

As the transition to sustainable power generation progresses, green hydrogen production via electrolysis is expected to gain importance as a means for energy storage and flexible load to complement variable renewable generation. With the increasing need for cost-effective and efficient hydrogen production, electrolyzer optimization is essential to improve both energy efficiency and profitability. This paper analyzes how the efficiency and modular setup of alkaline hydrogen electrolyzers can improve hydrogen output of systems linked to a fluctuating renewable power supply. To explore this, we propose a day-ahead optimal scheduling problem of a hybrid wind and electrolyzer system. The novelty of our approach lies in modeling the number and capacity of electrolyzer modules, and capturing the modules' impact on the hydrogen production and efficiency. We solve the resulting mixed-integer optimization problem with several different combinations of number of modules, efficiency and operating range parameters, using day-ahead market data from a wind farm generator in the ERCOT system as an input. Our results demonstrate that the proposed approach ensures that electrolyzer owners can better optimize the operation of their systems, achieving greater hydrogen production and higher revenue. Key findings include that as the number of modules in a system with the same overall capacity increases, hydrogen production and revenue increases.


[36] 2412.19362

Evaluating Convolutional Neural Networks for COVID-19 classification in chest X-ray images

Coronavirus Disease 2019 (COVID-19) pandemic rapidly spread globally, impacting the lives of billions of people. The effective screening of infected patients is a critical step to struggle with COVID-19, and treating the patients avoiding this quickly disease spread. The need for automated and scalable methods has increased due to the unavailability of accurate automated toolkits. Recent researches using chest X-ray images suggest they include relevant information about the COVID-19 virus. Hence, applying machine learning techniques combined with radiological imaging promises to identify this disease accurately. It is straightforward to collect these images once it is spreadly shared and analyzed in the world. This paper presents a method for automatic COVID-19 detection using chest Xray images through four convolutional neural networks, namely: AlexNet, VGG-11, SqueezeNet, and DenseNet-121. This method had been providing accurate diagnostics for positive or negative COVID-19 classification. We validate our experiments using a ten-fold cross-validation procedure over the training and test sets. Our findings include the shallow fine-tuning and data augmentation strategies that can assist in dealing with the low number of positive COVID-19 images publicly available. The accuracy for all CNNs is higher than 97.00%, and the SqueezeNet model achieved the best result with 99.20%.


[37] 2412.19365

Neuromorphic Dual-channel Encoding of Luminance and Contrast

There is perceptual and physiological evidence that the retina registers and signals luminance and luminance contrast using dual-channel mechanisms. This process begins in the retina, wherein the luminance of a uniform zone and differentials of luminance in neighboring zones determine the degree of brightness or darkness of the zones. The neurons that process the information can be classified as "bright" or "dark" channels. The present paper provides an overview of these retinal mechanisms along with evidence that they provide brightness judgments that are log-linear across roughly seven orders of magnitude.


[38] 2412.19374

A Review of Resilience Enhancement Measures for Hydrogen-penetrated Multi-energy Systems

Energy supply for electricity and heat sectors accounts for more than 40% of global carbon emissions in 2023, which brings great pressure for achieving net-zero carbon emission targets in the future. Under the above background, hydrogen-penetrated multi-energy systems (HMESs) have received wide attention due to their potential low-carbon attribute. However, HMESs still face the following challenge, i.e., how to survive and quickly recover from extreme and unexpected events (e.g., natural disasters, extreme weather, and cyber-physical attacks). To enable the above resilience attribute, many existing works on HMES resilience enhancement have been done. However, there lacks a systematic overview of different resilience enhancement measures for HMESs. To fill the research gap, this paper provides a comprehensive overview of resilience enhancement strategies for HMESs from the perspective of hydrogen-related planning and operation. To be specific, we propose a comprehensive resilience enhancement framework for HEMSs. Under the proposed framework, the widely used resilience metrics and event-oriented contingency models in existing works are summarized. Then, we classify the hydrogen-related planning measures for HMES resilience enhancement according to the type of hydrogen-related facilities and provide some insights for planning problem formulation framework. Moreover, we categorize the hydrogen-related operation measures for HMES resilience enhancement according to the three kinds of operation response stages involved, including preventive response, emergency response, and restoration response. Finally, we identify some research gaps and point out possible future directions.


[39] 2412.19382

Preventive Energy Management for Distribution Systems Under Uncertain Events: A Deep Reinforcement Learning Approach

As power systems become more complex with the continuous integration of intelligent distributed energy resources (DERs), new risks and uncertainties arise. Consequently, to enhance system resiliency, it is essential to account for various uncertain events when implementing the optimization problem for the energy management system (EMS). This paper presents a preventive EMS considering the probability of failure (PoF) of each system component across different scenarios. A conditional-value-at-risk (CVaR)-based framework is proposed to integrate the uncertainties of the distribution network. Loads are classified into critical, semi-critical, and non-critical categories to prioritize essential loads during generation resource shortages. A proximal policy optimization (PPO)-based reinforcement learning (RL) agent is used to solve the formulated problem and generate the control decisions. The proposed framework is evaluated on a notional MVDC ship system and a modified IEEE 30-bus system, where the results demonstrate that the PPO agent can successfully optimize the objective function while maintaining the network and operational constraints. For validation, the RL-based method is benchmarked against a traditional optimization approach, further highlighting its effectiveness and robustness. This comparison shows that RL agents can offer more resiliency against future uncertain events compared to the traditional solution methods due to their adaptability and learning capacity.


[40] 2412.19399

Online distributed algorithms for mixed equilibrium problems in dynamic environments

In this paper, the mixed equilibrium problem with coupled inequality constraints in dynamic environments is solved by employing a multi-agent system, where each agent only has access to its own bifunction, its own constraint function, and can only communicate with its immediate neighbors via a time-varying digraph. At each time, the goal of agents is to cooperatively find a point in the constraint set such that the sum of local bifunctions with a free variable is non-negative. Different from existing works, here the bifunctions and the constraint functions are time-varying and only available to agents after decisions are made. To tackle this problem, first, an online distributed algorithm involving accurate gradient information is proposed based on mirror descent algorithms and primal-dual strategies. Of particular interest is that dynamic regrets, whose offline benchmarks are to find the solution at each time, are employed to measure the performance of the algorithm. Under mild assumptions on the graph and the bifunctions, we prove that if the deviation in the solution sequence grows within a certain rate, then both the dynamic regret and the violation of coupled inequality constraints increase sublinearly. Second, considering the case where each agent only has access to a noisy estimate on the accurate gradient, we propose an online distributed algorithm involving the stochastic gradients. The result shows that under the same conditions as in the first case, if the noise distribution satisfies the sub-Gaussian condition, then dynamic regrets, as well as constraint violations, increase sublinearly with high probability. Finally, several simulation examples are presented to corroborate the validity of our results.


[41] 2412.19401

Joint Optimization of Multimodal Transit Frequency and Shared Autonomous Vehicle Fleet Size with Hybrid Metaheuristic and Nonlinear Programming

This paper presents an optimization framework for the joint multimodal transit frequency and shared autonomous vehicle (SAV) fleet size optimization, a problem variant of the transit network frequency setting problem (TNFSP) that explicitly considers mode choice behavior and route selection. To address the non-linear non-convex optimization problem, we develop a hybrid solution approach that combines metaheuristics (particle swarm optimization, PSO) with local nonlinear programming (NLP) improvement, incorporating approximation models for SAV waiting time, multimodal route choice, and mode choice. Applied to the Chicago metropolitan area, our method achieves a 33.3% increase in transit ridership.


[42] 2412.19471

Meta-Learning-Based Delayless Subband Adaptive Filter using Complex Self-Attention for Active Noise Control

Active noise control typically employs adaptive filtering to generate secondary noise, where the least mean square algorithm is the most widely used. However, traditional updating rules are linear and exhibit limited effectiveness in addressing nonlinear environments and nonstationary noise. To tackle this challenge, we reformulate the active noise control problem as a meta-learning problem and propose a meta-learning-based delayless subband adaptive filter with deep neural networks. The core idea is to utilize a neural network as an adaptive algorithm that can adapt to different environments and types of noise. The neural network will train under noisy observations, implying that it recognizes the optimized updating rule without true labels. A single-headed attention recurrent neural network is devised with learnable feature embedding to update the adaptive filter weight efficiently, enabling accurate computation of the secondary source to attenuate the unwanted primary noise. In order to relax the time constraint on updating the adaptive filter weights, the delayless subband architecture is employed, which will allow the system to be updated less frequently as the downsampling factor increases. In addition, the delayless subband architecture does not introduce additional time delays in active noise control systems. A skip updating strategy is introduced to decrease the updating frequency further so that machines with limited resources have more possibility to board our meta-learning-based model. Extensive multi-condition training ensures generalization and robustness against various types of noise and environments. Simulation results demonstrate that our meta-learning-based model achieves superior noise reduction performance compared to traditional methods.


[43] 2412.19475

Exploiting Dynamic Sparsity for Near-Field Spatial Non-Stationary XL-MIMO Channel Tracking

This work considers a spatial non-stationary channel tracking problem in broadband extremely large-scale multiple-input-multiple-output (XL-MIMO) systems. In the case of spatial non-stationary, each scatterer has a certain visibility region (VR) over antennas and power change may occur among visible antennas. Concentrating on the temporal correlation of XL-MIMO channels, we design a three-layer Markov prior model and hierarchical two-dimensional (2D) Markov model to exploit the dynamic sparsity of sparse channel vectors and VRs, respectively. Then, we formulate the channel tracking problem as a bilinear measurement process, and a novel dynamic alternating maximum a posteriori (DA-MAP) framework is developed to solve the problem. The DA-MAP contains four basic modules: channel estimation module, VR detection module, grid update module, and temporal correlated module. Specifically, the first module is an inverse-free variational Bayesian inference (IF-VBI) estimator that avoids computational intensive matrix inverse each iteration; the second module is a turbo compressive sensing (Turbo-CS) algorithm that only needs small-scale matrix operations in a parallel fashion; the third module refines the polar-delay domain grid; and the fourth module can process the temporal prior information to ensure high-efficiency channel tracking. Simulations show that the proposed method can achieve a significant channel tracking performance while achieving low computational overhead.


[44] 2412.19497

Multi-Condition Fault Diagnosis of Dynamic Systems: A Survey, Insights, and Prospects

With the increasing complexity of industrial production systems, accurate fault diagnosis is essential to ensure safe and efficient system operation. However, due to changes in production demands, dynamic process adjustments, and complex external environmental disturbances, multiple operating conditions frequently arise during production. The multi-condition characteristics pose significant challenges to traditional fault diagnosis methods. In this context, multi-condition fault diagnosis has gradually become a key area of research, attracting extensive attention from both academia and industry. This paper aims to provide a systematic and comprehensive review of existing research in the field. Firstly, the mathematical definition of the problem is presented, followed by an overview of the current research status. Subsequently, the existing literature is reviewed and categorized from the perspectives of single-model and multi-model approaches. In addition, standard evaluation metrics and typical real-world application scenarios are summarized and analyzed. Finally, the key challenges and prospects in the field are thoroughly discussed.


[45] 2412.19527

Real-time Reflectance Generation for UAV Multispectral Imagery using an Onboard Downwelling Spectrometer in Varied Weather Conditions

Advancements in unmanned aerial vehicle (UAV) remote sensing with spectral imaging enable efficient assessment of critical agronomic traits. However, existing reflectance calibration or generation methods suffer from limited prediction accuracy and practical flexibility. This study explores reliable and cost-efficient methods for the accurate conversion of digital number values acquired from a multispectral imager into reflectance, leveraging real-time solar spectra as references. To ensure consistent measurements of incident light, an upward gimbal-mounted downwelling spectrometer was attached to the UAV, and a sinusoidal model was developed to correct for solar position variability. Using principal component analysis on the reference solar spectrum for band selection, a multiple linear regression model with four sensitive bands (4-Band MLR) and a 30 nm bandwidth achieved performance comparable to the direct correction method. The root mean square error (RMSE) for reflectance prediction improved by 86.1% compared to the empirical line method under fluctuating cloudy conditions and by 59.6% compared to the downwelling light sensor method averaged across different weather conditions. The RMSE was calculated as 2.24% in a ground-based diurnal validation, and 2.03% in a UAV campaign conducted at various times throughout a sunny day. Implementing the 4-Band MLR model enhanced the consistency of canopy reflectance within a homogeneous vegetation area by 95.0% during spectral imaging in a large rice field under significant cloud fluctuations. Additionally, improvements of 86.0% and 90.3% were noted for two vegetation indices: the normalized difference vegetation index (NDVI; a ratio index) and the difference vegetation index (DVI; a non-ratio index), respectively.


[46] 2412.19601

Arbitrarily Fast Tracking Multivariable Least-squares MRAC

A novel least-squares model-reference direct adaptive control (LS MRAC) algorithm for multivariable (MIMO) plants is presented. The controller parameters are directly updated based on the output tracking error. The control law is crucially modified to reduce the relative degree of the error model to zero. A comprehensive Lyapunov-based stability analysis as well as a tracking error convergence characterization is provided demonstrating that the LS MRAC can achieve arbitrarily fast tracking while maintaining satisfactory parameter convergence for quite large adaptation gains. Simulation results show a significant improvement in tracking performance compared to previous methods.


[47] 2412.19656

Movable Antenna Aided Physical Layer Security with No Eavesdropper CSI

A novel movable antenna (MA)-aided secure transmission framework is proposed to enhance the secrecy transmission rate without relying on the eavesdropper's channel state information. Within this framework, a joint beamforming and jamming scheme is proposed, where the power of the confidential signal is minimized by optimizing the positions of the MAs, and the residual power is used to jam the eavesdropper. An efficient gradient-based method is employed to solve this non-convex problem. Numerical results are provided to demonstrate the superiority of the MA-based framework over systems using traditional fixed-position antennas in secure transmission.


[48] 2412.19688

A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation

Artificial intelligence (AI) has emerged as a powerful tool to enhance decision-making and optimize treatment protocols in in vitro fertilization (IVF). In particular, AI shows significant promise in supporting decision-making during the ovarian stimulation phase of the IVF process. This review evaluates studies focused on the applications of AI combined with medical imaging in ovarian stimulation, examining methodologies, outcomes, and current limitations. Our analysis of 13 studies on this topic reveals that, reveal that while AI algorithms demonstrated notable potential in predicting optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the medical imaging data utilized predominantly came from two-dimensional (2D) ultrasound which mainly involved basic quantifications, such as follicle size and number, with limited use of direct feature extraction or advanced image analysis techniques. This points to an underexplored opportunity where advanced image analysis approaches, such as deep learning, and more diverse imaging modalities, like three-dimensional (3D) ultrasound, could unlock deeper insights. Additionally, the lack of explainable AI (XAI) in most studies raises concerns about the transparency and traceability of AI-driven decisions - key factors for clinical adoption and trust. Furthermore, many studies relied on single-center designs and small datasets, which limit the generalizability of their findings. This review highlights the need for integrating advanced imaging analysis techniques with explainable AI methodologies, as well as the importance of leveraging multicenter collaborations and larger datasets. Addressing these gaps has the potential to enhance ovarian stimulation management, paving the way for efficient, personalized, and data-driven treatment pathways that improve IVF outcomes.


[49] 2412.19713

ProKAN: Progressive Stacking of Kolmogorov-Arnold Networks for Efficient Liver Segmentation

The growing need for accurate and efficient 3D identification of tumors, particularly in liver segmentation, has spurred considerable research into deep learning models. While many existing architectures offer strong performance, they often face challenges such as overfitting and excessive computational costs. An adjustable and flexible architecture that strikes a balance between time efficiency and model complexity remains an unmet requirement. In this paper, we introduce proKAN, a progressive stacking methodology for Kolmogorov-Arnold Networks (KANs) designed to address these challenges. Unlike traditional architectures, proKAN dynamically adjusts its complexity by progressively adding KAN blocks during training, based on overfitting behavior. This approach allows the network to stop growing when overfitting is detected, preventing unnecessary computational overhead while maintaining high accuracy. Additionally, proKAN utilizes KAN's learnable activation functions modeled through B-splines, which provide enhanced flexibility in learning complex relationships in 3D medical data. Our proposed architecture achieves state-of-the-art performance in liver segmentation tasks, outperforming standard Multi-Layer Perceptrons (MLPs) and fixed KAN architectures. The dynamic nature of proKAN ensures efficient training times and high accuracy without the risk of overfitting. Furthermore, proKAN provides better interpretability by allowing insight into the decision-making process through its learnable coefficients. The experimental results demonstrate a significant improvement in accuracy, Dice score, and time efficiency, making proKAN a compelling solution for 3D medical image segmentation tasks.


[50] 2412.19719

Trading Off Energy Storage and Payload -- An Analytical Model for Freight Train Configuration

To support planning of alternative fuel technology (e.g., battery-electric locomotives) deployment for decarbonizing non-electrified freight rail, we develop a convex optimization formulation with a closed-form solution to determine the optimal number of energy storage tender cars in a train. The formulation shares a similar structure to an Economic Order Quantity (EOQ) model. For given market characteristics, cost forecasts, and technology parameters, our model captures the trade-offs between inventory carrying costs associated with trip times (including delays due to charging/refueling) and ordering costs associated with train dispatch and operation (energy, amortized equipment, and labor costs). To illustrate the framework, we find the optimal number of battery-electric energy tender cars in 22,501 freight markets (origin-destination pairs and commodities) for U.S. Class I railroads. The results display heterogeneity in optimal configurations with lighter, yet more time-sensitive shipments (e.g., intermodal) utilizing more battery tender cars. For heavier commodities (e.g., coal) with lower holding costs, single battery tender car configurations are generally optimal. The results also show that the optimal train configurations are sensitive to delays associated with recharging or swapping tender cars.


[51] 2412.19763

Multi-population Differential Evolution for RSS based Cooperative Localization in Wireless Sensor Networks with Limited Communication Range

This paper presents a novel approach to deal with the cooperative localization problem in wireless sensor networks based on received signal strength measurements. In cooperative scenarios, the cost function of the localization problem becomes increasingly nonlinear and nonconvex due to the heightened interaction between sensor nodes, making the estimation of the positions of the target nodes more challenging. Although most of existing cooperative localization algorithms assure acceptable localization accuracy, their computational complexity increases dramatically, which may restrict their applicability. To reduce the computational complexity and provide competitive localization accuracy at the same time, we propose a localization algorithm based on the differential evolution with multiple populations, opposite-based learning, redirection, and anchoring. In this work, the cooperative localization cost function is split into several simpler cost functions, each of which accounts only for one individual target node. Then, each cost function is solved by a dedicated population of the proposed algorithm. In addition, an enhanced version of the proposed algorithm which incorporates the population midpoint scheme for further improvement in the localization accuracy is devised. Simulation results demonstrate that the proposed algorithms provide comparative localization accuracy with much lower computational complexity compared with the state-of-the-art algorithms.


[52] 2412.18619

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \& evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction


[53] 2412.18710

Simi-SFX: A similarity-based conditioning method for controllable sound effect synthesis

Generating sound effects with controllable variations is a challenging task, traditionally addressed using sophisticated physical models that require in-depth knowledge of signal processing parameters and algorithms. In the era of generative and large language models, text has emerged as a common, human-interpretable interface for controlling sound synthesis. However, the discrete and qualitative nature of language tokens makes it difficult to capture subtle timbral variations across different sounds. In this research, we propose a novel similarity-based conditioning method for sound synthesis, leveraging differentiable digital signal processing (DDSP). This approach combines the use of latent space for learning and controlling audio timbre with an intuitive guiding vector, normalized within the range [0,1], to encode categorical acoustic information. By utilizing pre-trained audio representation models, our method achieves expressive and fine-grained timbre control. To benchmark our approach, we introduce two sound effect datasets--Footstep-set and Impact-set--designed to evaluate both controllability and sound quality. Regression analysis demonstrates that the proposed similarity score effectively controls timbre variations and enables creative applications such as timbre interpolation between discrete classes. Our work provides a robust and versatile framework for sound effect synthesis, bridging the gap between traditional signal processing and modern machine learning techniques.


[54] 2412.18727

SAFLITE: Fuzzing Autonomous Systems via Large Language Models

Fuzz testing effectively uncovers software vulnerabilities; however, it faces challenges with Autonomous Systems (AS) due to their vast search spaces and complex state spaces, which reflect the unpredictability and complexity of real-world environments. This paper presents a universal framework aimed at improving the efficiency of fuzz testing for AS. At its core is SaFliTe, a predictive component that evaluates whether a test case meets predefined safety criteria. By leveraging the large language model (LLM) with information about the test objective and the AS state, SaFliTe assesses the relevance of each test case. We evaluated SaFliTe by instantiating it with various LLMs, including GPT-3.5, Mistral-7B, and Llama2-7B, and integrating it into four fuzz testing tools: PGFuzz, DeepHyperion-UAV, CAMBA, and TUMB. These tools are designed specifically for testing autonomous drone control systems, such as ArduPilot, PX4, and PX4-Avoidance. The experimental results demonstrate that, compared to PGFuzz, SaFliTe increased the likelihood of selecting operations that triggered bug occurrences in each fuzzing iteration by an average of 93.1\%. Additionally, after integrating SaFliTe, the ability of DeepHyperion-UAV, CAMBA, and TUMB to generate test cases that caused system violations increased by 234.5\%, 33.3\%, and 17.8\%, respectively. The benchmark for this evaluation was sourced from a UAV Testing Competition.


[55] 2412.18733

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.


[56] 2412.18748

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.


[57] 2412.18771

RIS-Assisted MIMO CV-QKD at THz Frequencies: Channel Estimation and SKR Analysis

In this paper, a multiple-input multiple-output (MIMO) wireless system incorporating a reconfigurable intelligent surface (RIS) to efficiently operate at terahertz (THz) frequencies is considered. The transmitter, Alice, employs continuous-variable quantum key distribution (CV-QKD) to communicate secret keys to the receiver, Bob, which utilizes either homodyne or heterodyne detection. The latter node applies the least-squared approach to estimate the effective MIMO channel gain matrix prior to receiving the secret key, and this estimation is made available to Alice via an error-free feedback channel. An eavesdropper, Eve, is assumed to employ a collective Gaussian entanglement attack on the feedback channel to avail the estimated channel state information. We present a novel closed-form expression for the secret key rate (SKR) performance of the proposed RIS-assisted THz CV-QKD system. The effect of various system parameters, such as the number of RIS elements and their phase configurations, the channel estimation error, and the detector noise, on the SKR performance are studied via numerical evaluation of the derived formula. It is demonstrated that the RIS contributes to larger SKR for larger link distances, and that heterodyne detection is preferable over homodyne at lower pilot symbol powers.


[58] 2412.18812

A Tractable Approach for Queueing Analysis on Buffer-Aware Scheduling

Low-latency communication has recently attracted considerable attention owing to its potential of enabling delay-sensitive services in next-generation industrial cyber-physical systems. To achieve target average or maximum delay given random arrivals and time-varying channels, buffer-aware scheduling is expected to play a vital role. Evaluating and optimizing buffer-aware scheduling relies on its queueing analysis, while existing tools are not sufficiently tractable. Particularly, Markov chain and Monte-Carlo based approaches are computationally intensive, while large deviation theory (LDT) and extreme value theory (EVT) fail in providing satisfactory accuracy in the small-queue-length (SQL) regime. To tackle these challenges, a tractable yet accurate queueing analysis is presented by judiciously bridging Markovian analysis for the computationally manageable SQL regime and LDT/EVT for large-queue-length (LQL) regime where approximation error diminishes asymptotically. Specifically, we leverage censored Markov chain augmentation to approximate the original one in the SQL regime, while a piecewise approach is conceived to apply LDT/EVT across various queue-length intervals with different scheduling parameters. Furthermore, we derive closed-form bounds on approximation errors, validating the rigor and accuracy of our approach. As a case study, the approach is applied to analytically analyze a Lyapunov-drift-based cross-layer scheduling for wireless transmissions. Numerical results demonstrate its potential in balancing accuracy and complexity.


[59] 2412.18817

Wireless Communication with Flexible Reflector: Joint Placement and Rotation Optimization for Coverage Enhancement

Passive metal reflectors for communication enhancement have appealing advantages such as ultra low cost, zero energy expenditure, maintenance-free operation, long life span, and full compatibility with legacy wireless systems. To unleash the full potential of passive reflectors for wireless communications, this paper proposes a new passive reflector architecture, termed flexible reflector (FR), for enabling the flexible adjustment of beamforming direction via the FR placement and rotation optimization. We consider the multi-FR aided area coverage enhancement and aim to maximize the minimum expected receive power over all locations within the target coverage area, by jointly optimizing the placement positions and rotation angles of multiple FRs. To gain useful insights, the special case of movable reflector (MR) with fixed rotation is first studied to maximize the expected receive power at a target location, where the optimal single-MR placement positions for electrically large and small reflectors are derived in closed-form, respectively. It is shown that the reflector should be placed at the specular reflection point for electrically large reflector. While for area coverage enhancement, the optimal placement is obtained for the single-MR case and a sequential placement algorithm is proposed for the multi-MR case. Moreover, for the general case of FR, joint placement and rotation design is considered for the single-/multi-FR aided coverage enhancement, respectively. Numerical results are presented which demonstrate significant performance gains of FRs over various benchmark schemes under different practical setups in terms of receive power enhancement.


[60] 2412.18831

Data-driven $H_{\infty}$ predictive control for constrained systems: a Lagrange duality approcah

This article proposes a data-driven $H_{\infty}$ control scheme for time-domain constrained systems based on model predictive control formulation. The scheme combines $H_{\infty}$ control and minimax model predictive control, enabling more effective handling of external disturbances and time-domain constraints. First, by leveraging input-output-disturbance data, the scheme ensures $H_{\infty}$ performance of the closed-loop system. Then, a minimax optimization problem is converted to a more manageable minimization problem employing Lagrange duality, which reduces conservatism typically associated with ellipsoidal evaluations of time-domain constraints. The study examines key closed-loop properties, including stability, disturbance attenuation, and constraint satisfaction, achieved by the proposed data-driven moving horizon predictive control algorithm. The effectiveness and advantages of the proposed method are demonstrated through numerical simulations involving a batch reactor system, confirming its robustness and feasibility under noisy conditions.


[61] 2412.18836

MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI

Previous real-time MRI (rtMRI)-based speech synthesis models depend heavily on noisy ground-truth speech. Applying loss directly over ground truth mel-spectrograms entangles speech content with MRI noise, resulting in poor intelligibility. We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI and incorporates a new flow-based duration predictor for speaker-specific alignment. The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice. We conduct thorough experiments on two datasets and demonstrate our method's generalization ability to unseen speakers. We assess our framework's performance by masking parts of the rtMRI video to evaluate the impact of different articulators on text prediction. Our method achieves a $15.18\%$ Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art. Speech samples are available at \url{https://mri2speech.github.io/MRI2Speech/}


[62] 2412.18839

Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over $7.96$ hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at \url{https://diff-nam.github.io/DiffNAM/}


[63] 2412.18856

Digital Twin Enhanced Deep Reinforcement Learning for Intelligent Omni-Surface Configurations in MU-MIMO Systems

Intelligent omni-surface (IOS) is a promising technique to enhance the capacity of wireless networks, by reflecting and refracting the incident signal simultaneously. Traditional IOS configuration schemes, relying on all sub-channels' channel state information and user equipments' mobility, are difficult to implement in complex realistic systems. Existing works attempt to address this issue employing deep reinforcement learning (DRL), but this method requires a lot of trial-and-error interactions with the external environment for efficient results and thus cannot satisfy the real-time decision-making. To enable model-free and real-time IOS control, this paper puts forth a new framework that integrates DRL and digital twins. DeepIOS, a DRL based IOS configuration scheme with the goal of maximizing the sum data rate, is first developed to jointly optimize the phase-shift and amplitude of IOS in multi-user multiple-input-multiple-output systems. Thereafter, to further reduce the computational complexity, DeepIOS introduces an action branch architecture, which separately decides two optimization variables in parallel. Finally, a digital twin module is constructed through supervised learning as a pre-verification platform for DeepIOS, such that the decision-making's real-time can be guaranteed. The formulated framework is a closed-loop system, in which the physical space provides data to establish and calibrate the digital space, while the digital space generates experience samples for DeepIOS training and sends the trained parameters to the IOS controller for configurations. Numerical results show that compared with random and MAB schemes, the proposed framework attains a higher data rate and is more robust to different settings. Furthermore, the action branch architecture reduces DeepIOS's computational complexity, and the digital twin module improves the convergence speed and run-time.


[64] 2412.18883

MotionMap: Representing Multimodality in Human Pose Forecasting

Human pose forecasting is inherently multimodal since multiple futures exist for an observed pose sequence. However, evaluating multimodality is challenging since the task is ill-posed. Therefore, we first propose an alternative paradigm to make the task well-posed. Next, while state-of-the-art methods predict multimodality, this requires oversampling a large volume of predictions. This raises key questions: (1) Can we capture multimodality by efficiently sampling a smaller number of predictions? (2) Subsequently, which of the predicted futures is more likely for an observed pose sequence? We address these questions with MotionMap, a simple yet effective heatmap based representation for multimodality. We extend heatmaps to represent a spatial distribution over the space of all possible motions, where different local maxima correspond to different forecasts for a given observation. MotionMap can capture a variable number of modes per observation and provide confidence measures for different modes. Further, MotionMap allows us to introduce the notion of uncertainty and controllability over the forecasted pose sequence. Finally, MotionMap captures rare modes that are non-trivial to evaluate yet critical for safety. We support our claims through multiple qualitative and quantitative experiments using popular 3D human pose datasets: Human3.6M and AMASS, highlighting the strengths and limitations of our proposed method. Project Page: https://www.epfl.ch/labs/vita/research/prediction/motionmap/


[65] 2412.18913

Robust Target Speaker Direction of Arrival Estimation

In multi-speaker environments the direction of arrival (DOA) of a target speaker is key for improving speech clarity and extracting target speaker's voice. However, traditional DOA estimation methods often struggle in the presence of noise, reverberation, and particularly when competing speakers are present. To address these challenges, we propose RTS-DOA, a robust real-time DOA estimation system. This system innovatively uses the registered speech of the target speaker as a reference and leverages full-band and sub-band spectral information from a microphone array to estimate the DOA of the target speaker's voice. Specifically, the system comprises a speech enhancement module for initially improving speech quality, a spatial module for learning spatial information, and a speaker module for extracting voiceprint features. Experimental results on the LibriSpeech dataset demonstrate that our RTS-DOA system effectively tackles multi-speaker scenarios and established new optimal benchmarks.


[66] 2412.18933

TINQ: Temporal Inconsistency Guided Blind Video Quality Assessment

Blind video quality assessment (BVQA) has been actively researched for user-generated content (UGC) videos. Recently, super-resolution (SR) techniques have been widely applied in UGC. Therefore, an effective BVQA method for both UGC and SR scenarios is essential. Temporal inconsistency, referring to irregularities between consecutive frames, is relevant to video quality. Current BVQA approaches typically model temporal relationships in UGC videos using statistics of motion information, but inconsistencies remain unexplored. Additionally, different from temporal inconsistency in UGC videos, such inconsistency in SR videos is amplified due to upscaling algorithms. In this paper, we introduce the Temporal Inconsistency Guided Blind Video Quality Assessment (TINQ) metric, demonstrating that exploring temporal inconsistency is crucial for effective BVQA. Since temporal inconsistencies vary between UGC and SR videos, they are calculated in different ways. Based on this, a spatial module highlights inconsistent areas across consecutive frames at coarse and fine granularities. In addition, a temporal module aggregates features over time in two stages. The first stage employs a visual memory capacity block to adaptively segment the time dimension based on estimated complexity, while the second stage focuses on selecting key features. The stages work together through Consistency-aware Fusion Units to regress cross-time-scale video quality. Extensive experiments on UGC and SR video quality datasets show that our method outperforms existing state-of-the-art BVQA methods. Code is available at https://github.com/Lighting-YXLI/TINQ.


[67] 2412.18955

Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations

Contrastive learning has proven effective in self-supervised musical representation learning, particularly for Music Information Retrieval (MIR) tasks. However, reliance on augmentation chains for contrastive view generation and the resulting learnt invariances pose challenges when different downstream tasks require sensitivity to certain musical attributes. To address this, we propose the Leave One EquiVariant (LOEV) framework, which introduces a flexible, task-adaptive approach compared to previous work by selectively preserving information about specific augmentations, allowing the model to maintain task-relevant equivariances. We demonstrate that LOEV alleviates information loss related to learned invariances, improving performance on augmentation related tasks and retrieval without sacrificing general representation quality. Furthermore, we introduce a variant of LOEV, LOEV++, which builds a disentangled latent space by design in a self-supervised manner, and enables targeted retrieval based on augmentation related attributes.


[68] 2412.18994

Geospatial Data Fusion: Combining Lidar, SAR, and Optical Imagery with AI for Enhanced Urban Mapping

This study explores the integration of Lidar, Synthetic Aperture Radar (SAR), and optical imagery through advanced artificial intelligence techniques for enhanced urban mapping. By fusing these diverse geospatial datasets, we aim to overcome the limitations associated with single-sensor data, achieving a more comprehensive representation of urban environments. The research employs Fully Convolutional Networks (FCNs) as the primary deep learning model for urban feature extraction, enabling precise pixel-wise classification of essential urban elements, including buildings, roads, and vegetation. To optimize the performance of the FCN model, we utilize Particle Swarm Optimization (PSO) for hyperparameter tuning, significantly enhancing model accuracy. Key findings indicate that the FCN-PSO model achieved a pixel accuracy of 92.3% and a mean Intersection over Union (IoU) of 87.6%, surpassing traditional single-sensor approaches. These results underscore the potential of fused geospatial data and AI-driven methodologies in urban mapping, providing valuable insights for urban planning and management. The implications of this research pave the way for future developments in real-time mapping and adaptive urban infrastructure planning.


[69] 2412.19000

MGAN-CRCM: A Novel Multiple Generative Adversarial Network and Coarse-Refinement Based Cognizant Method for Image Inpainting

Image inpainting is a widely used technique in computer vision for reconstructing missing or damaged pixels in images. Recent advancements with Generative Adversarial Networks (GANs) have demonstrated superior performance over traditional methods due to their deep learning capabilities and adaptability across diverse image domains. Residual Networks (ResNet) have also gained prominence for their ability to enhance feature representation and compatibility with other architectures. This paper introduces a novel architecture combining GAN and ResNet models to improve image inpainting outcomes. Our framework integrates three components: Transpose Convolution-based GAN for guided and blind inpainting, Fast ResNet-Convolutional Neural Network (FR-CNN) for object removal, and Co-Modulation GAN (Co-Mod GAN) for refinement. The model's performance was evaluated on benchmark datasets, achieving accuracies of 96.59% on Image-Net, 96.70% on Places2, and 96.16% on CelebA. Comparative analyses demonstrate that the proposed architecture outperforms existing methods, highlighting its effectiveness in both qualitative and quantitative evaluations.


[70] 2412.19041

Revealing the Self: Brainwave-Based Human Trait Identification

People exhibit unique emotional responses. In the same scenario, the emotional reactions of two individuals can be either similar or vastly different. For instance, consider one person's reaction to an invitation to smoke versus another person's response to a query about their sleep quality. The identification of these individual traits through the observation of common physical parameters opens the door to a wide range of applications, including psychological analysis, criminology, disease prediction, addiction control, and more. While there has been previous research in the fields of psychometrics, inertial sensors, computer vision, and audio analysis, this paper introduces a novel technique for identifying human traits in real time using brainwave data. To achieve this, we begin with an extensive study of brainwave data collected from 80 participants using a portable EEG headset. We also conduct a statistical analysis of the collected data utilizing box plots. Our analysis uncovers several new insights, leading us to a groundbreaking unified approach for identifying diverse human traits by leveraging machine learning techniques on EEG data. Our analysis demonstrates that this proposed solution achieves high accuracy. Moreover, we explore two deep-learning models to compare the performance of our solution. Consequently, we have developed an integrated, real-time trait identification solution using EEG data, based on the insights from our analysis. To validate our approach, we conducted a rigorous user evaluation with an additional 20 participants. The outcomes of this evaluation illustrate both high accuracy and favorable user ratings, emphasizing the robust potential of our proposed method to serve as a versatile solution for human trait identification.


[71] 2412.19043

Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID

Multilingual text-to-speech systems convert text into speech across multiple languages. In many cases, text sentences may contain segments in different languages, a phenomenon known as code-switching. This is particularly common in Indonesia, especially between Indonesian and English. Despite its significance, no research has yet developed a multilingual TTS system capable of handling code-switching between these two languages. This study addresses Indonesian-English code-switching in STEN-TTS. Key modifications include adding a language identification component to the text-to-phoneme conversion using finetuned BERT for per-word language identification, as well as removing language embedding from the base model. Experimental results demonstrate that the code-switching model achieves superior naturalness and improved speech intelligibility compared to the Indonesian and English baseline STEN-TTS models.


[72] 2412.19099

BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement

Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models.


[73] 2412.19106

ERGNN: Spectral Graph Neural Network with Explicitly-optimized Rational Graph Filters

Approximation-based spectral graph neural networks, which construct graph filters with function approximation, have shown substantial performance in graph learning tasks. Despite their great success, existing works primarily employ polynomial approximation to construct the filters, whereas another superior option, namely ration approximation, remains underexplored. Although a handful of prior works have attempted to deploy the rational approximation, their implementations often involve intensive computational demands or still resort to polynomial approximations, hindering full potential of the rational graph filters. To address the issues, this paper introduces ERGNN, a novel spectral GNN with explicitly-optimized rational filter. ERGNN adopts a unique two-step framework that sequentially applies the numerator filter and the denominator filter to the input signals, thus streamlining the model paradigm while enabling explicit optimization of both numerator and denominator of the rational filter. Extensive experiments validate the superiority of ERGNN over state-of-the-art methods, establishing it as a practical solution for deploying rational-based GNNs.


[74] 2412.19110

A Selective Secure Precoding Framework for MU-MIMO Rate-Splitting Multiple Access Networks Under Limited CSIT

In this paper, we propose a robust and adaptable secure precoding framework designed to encapsulate a intricate scenario where legitimate users have different information security: secure private or normal public information. Leveraging rate-splitting multiple access (RSMA), we formulate the sum secrecy spectral efficiency (SE) maximization problem in downlink multi-user multiple-input multiple-output (MIMO) systems with multi-eavesdropper. To resolve the challenges including the heterogeneity of security, non-convexity, and non-smoothness of the problem, we initially approximate the problem using a LogSumExp technique. Subsequently, we derive the first-order optimality condition in the form of a generalized eigenvalue problem. We utilize a power iteration-based method to solve the condition, thereby achieving a superior local optimal solution. The proposed algorithm is further extended to a more realistic scenario involving limited channel state information at the transmitter (CSIT). To effectively utilize the limited channel information, we employ a conditional average rate approach. Handling the conditional average by deriving useful bounds, we establish a lower bound for the objective function under the conditional average. Then we apply the similar optimization method as for the perfect CSIT case. In simulations, we validate the proposed algorithm in terms of the sum secrecy SE.


[75] 2412.19111

Spectral Enhancement and Pseudo-Anchor Guidance for Infrared-Visible Person Re-Identification

The development of deep learning has facilitated the application of person re-identification (ReID) technology in intelligent security. Visible-infrared person re-identification (VI-ReID) aims to match pedestrians across infrared and visible modality images enabling 24-hour surveillance. Current studies relying on unsupervised modality transformations as well as inefficient embedding constraints to bridge the spectral differences between infrared and visible images, however, limit their potential performance. To tackle the limitations of the above approaches, this paper introduces a simple yet effective Spectral Enhancement and Pseudo-anchor Guidance Network, named SEPG-Net. Specifically, we propose a more homogeneous spectral enhancement scheme based on frequency domain information and greyscale space, which avoids the information loss typically caused by inefficient modality transformations. Further, a Pseudo Anchor-guided Bidirectional Aggregation (PABA) loss is introduced to bridge local modality discrepancies while better preserving discriminative identity embeddings. Experimental results on two public benchmark datasets demonstrate the superior performance of SEPG-Net against other state-of-the-art methods. The code is available at https://github.com/1024AILab/ReID-SEPG.


[76] 2412.19123

CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition

Dance generation is crucial and challenging, particularly in domains like dance performance and virtual gaming. In the current body of literature, most methodologies focus on Solo Music2Dance. While there are efforts directed towards Group Music2Dance, these often suffer from a lack of coherence, resulting in aesthetically poor dance performances. Thus, we introduce CoheDancers, a novel framework for Music-Driven Interactive Group Dance Generation. CoheDancers aims to enhance group dance generation coherence by decomposing it into three key aspects: synchronization, naturalness, and fluidity. Correspondingly, we develop a Cycle Consistency based Dance Synchronization strategy to foster music-dance correspondences, an Auto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity of the generated dances, and an Adversarial Training Strategy to augment the naturalness of the group dance output. Collectively, these strategies enable CohdeDancers to produce highly coherent group dances with superior quality. Furthermore, to establish better benchmarks for Group Music2Dance, we construct the most diverse and comprehensive open-source dataset to date, I-Dancers, featuring rich dancer interactions, and create comprehensive evaluation metrics. Experimental evaluations on I-Dancers and other extant datasets substantiate that CoheDancers achieves unprecedented state-of-the-art performance. Code will be released.


[77] 2412.19134

Extended Cross-Modality United Learning for Unsupervised Visible-Infrared Person Re-identification

Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.


[78] 2412.19200

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.


[79] 2412.19225

Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion

In this paper, we introduce the Selective Image Guided Network (SigNet), a novel degradation-aware framework that transforms depth completion into depth enhancement for the first time. Moving beyond direct completion using convolutional neural networks (CNNs), SigNet initially densifies sparse depth data through non-CNN densification tools to obtain coarse yet dense depth. This approach eliminates the mismatch and ambiguity caused by direct convolution over irregularly sampled sparse data. Subsequently, SigNet redefines completion as enhancement, establishing a self-supervised degradation bridge between the coarse depth and the targeted dense depth for effective RGB-D fusion. To achieve this, SigNet leverages the implicit degradation to adaptively select high-frequency components (e.g., edges) of RGB data to compensate for the coarse depth. This degradation is further integrated into a multi-modal conditional Mamba, dynamically generating the state parameters to enable efficient global high-frequency information interaction. We conduct extensive experiments on the NYUv2, DIML, SUN RGBD, and TOFDC datasets, demonstrating the state-of-the-art (SOTA) performance of SigNet.


[80] 2412.19238

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets. Both Both FineVD and FineVQ will be made publicly available.


[81] 2412.19279

Improving Generalization for AI-Synthesized Voice Detection

AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused for malicious purposes. While existing AI-synthesized voice detection models excel in intra-domain evaluation, they face challenges in generalizing across different domains, potentially becoming obsolete as new voice generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), but are limited by predefined vocoders and sensitivity to factors like background noise and speaker identity. In this work, we introduce an innovative disentanglement framework aimed at extracting domain-agnostic artifact features related to vocoders. Utilizing these features, we enhance model learning in a flat loss landscape, enabling escape from suboptimal solutions and improving generalization. Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.


[82] 2412.19351

ETTA: Elucidating the Design Space of Text-to-Audio Models

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.


[83] 2412.19392

Asymptotically Optimal Search for a Change Point Anomaly under a Composite Hypothesis Model

We address the problem of searching for a change point in an anomalous process among a finite set of M processes. Specifically, we address a composite hypothesis model in which each process generates measurements following a common distribution with an unknown parameter (vector). This parameter belongs to either a normal or abnormal space depending on the current state of the process. Before the change point, all processes, including the anomalous one, are in a normal state; after the change point, the anomalous process transitions to an abnormal state. Our goal is to design a sequential search strategy that minimizes the Bayes risk by balancing sample complexity and detection accuracy. We propose a deterministic search algorithm with the following notable properties. First, we analytically demonstrate that when the distributions of both normal and abnormal processes are unknown, the algorithm is asymptotically optimal in minimizing the Bayes risk as the error probability approaches zero. In the second setting, where the parameter under the null hypothesis is known, the algorithm achieves asymptotic optimality with improved detection time based on the true normal state. Simulation results are presented to validate the theoretical findings.


[84] 2412.19459

A Prototype Unit for Image De-raining using Time-Lapse Data

We address the challenge of single-image de-raining, a task that involves recovering rain-free background information from a single rain image. While recent advancements have utilized real-world time-lapse data for training, enabling the estimation of consistent backgrounds and realistic rain streaks, these methods often suffer from computational and memory consumption, limiting their applicability in real-world scenarios. In this paper, we introduce a novel solution: the Rain Streak Prototype Unit (RsPU). The RsPU efficiently encodes rain streak-relevant features as real-time prototypes derived from time-lapse data, eliminating the need for excessive memory resources. Our de-raining network combines encoder-decoder networks with the RsPU, allowing us to learn and encapsulate diverse rain streak-relevant features as concise prototypes, employing an attention-based approach. To ensure the effectiveness of our approach, we propose a feature prototype loss encompassing cohesion and divergence components. This loss function captures both the compactness and diversity aspects of the prototypical rain streak features within the RsPU. Our method evaluates various de-raining benchmarks, accompanied by comprehensive ablation studies. We show that it can achieve competitive results in various rain images compared to state-of-the-art methods.


[85] 2412.19470

Movable Antenna-Aided Near-Field Integrated Sensing and Communication

Integrated sensing and communication (ISAC) is emerging as a pivotal technology for next-generation wireless networks. However, existing ISAC systems are based on fixed-position antennas (FPAs), which inevitably incur a loss in performance when balancing the trade-off between sensing and communication. Movable antenna (MA) technology offers promising potential to enhance ISAC performance by enabling flexible antenna movement. Nevertheless, exploiting more spatial channel variations requires larger antenna moving regions, which may invalidate the conventional far-field assumption for channels between transceivers. Therefore, this paper utilizes the MA to enhance sensing and communication capabilities in near-field ISAC systems, where a full-duplex base station (BS) is equipped with multiple transmit and receive MAs movable in large-size regions to simultaneously sense multiple targets and serve multiple uplink (UL) and downlink (DL) users for communication. We aim to maximize the weighted sum of sensing and communication rates (WSR) by jointly designing the transmit beamformers, sensing signal covariance matrices, receive beamformers, and MA positions at the BS, as well as the UL power allocation. The resulting optimization problem is challenging to solve, while we propose an efficient two-layer random position (RP) algorithm to tackle it. In addition, to reduce movement delay and cost, we design an antenna position matching (APM) algorithm based on the greedy strategy to minimize the total MA movement distance. Extensive simulation results demonstrate the substantial performance improvement achieved by deploying MAs in near-field ISAC systems. Moreover, the results show the effectiveness of the proposed APM algorithm in reducing the antenna movement distance, which is helpful for energy saving and time overhead reduction for MA-aided near-field ISAC systems with large moving regions.


[86] 2412.19478

An Overview of Machine Learning-Driven Resource Allocation in IoT Networks

In the wake of disruptive IoT technologies generating massive amounts of diverse data, Machine Learning (ML) will play a crucial role in bringing intelligence to Internet of Things (IoT) networks. This paper provides a comprehensive analysis of the current state of resource allocation within IoT networks, focusing specifically on two key categories: Low-Power IoT Networks and Mobile IoT Networks. We delve into the resource allocation strategies that are crucial for optimizing network performance and energy efficiency in these environments. Furthermore, the paper explores the transformative role of Machine Learning (ML), Deep Learning (DL), and Reinforcement Learning (RL) in enhancing IoT functionalities. We highlight a range of applications and use cases where these advanced technologies can significantly improve decision-making and optimization processes. In addition to the opportunities presented by ML, DL, and RL, we also address the potential challenges that organizations may face when implementing these technologies in IoT settings. These challenges include crucial accuracy, low flexibility and adaptability, and high computational cost, etc. Finally, the paper identifies promising avenues for future research, emphasizing the need for innovative solutions to overcome existing hurdles and improve the integration of ML, DL, and RL into IoT networks. By providing this holistic perspective, we aim to contribute to the ongoing discourse on resource allocation strategies and the application of intelligent technologies in the IoT landscape.


[87] 2412.19479

Generative Adversarial Network on Motion-Blur Image Restoration

In everyday life, photographs taken with a camera often suffer from motion blur due to hand vibrations or sudden movements. This phenomenon can significantly detract from the quality of the images captured, making it an interesting challenge to develop a deep learning model that utilizes the principles of adversarial networks to restore clarity to these blurred pixels. In this project, we will focus on leveraging Generative Adversarial Networks (GANs) to effectively deblur images affected by motion blur. A GAN-based Tensorflow model is defined, training and evaluating by GoPro dataset which comprises paired street view images featuring both clear and blurred versions. This adversarial training process between Discriminator and Generator helps to produce increasingly realistic images over time. Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) are the two evaluation metrics used to provide quantitative measures of image quality, allowing us to evaluate the effectiveness of the deblurring process. Mean PSNR in 29.1644 and mean SSIM in 0.7459 with average 4.6921 seconds deblurring time are achieved in this project. The blurry pixels are sharper in the output of GAN model shows a good image restoration effect in real world applications.


[88] 2412.19494

Retrieval-augmented Generation for GenAI-enabled Semantic Communications

Semantic communication (SemCom) is an emerging paradigm aiming at transmitting only task-relevant semantic information to the receiver, which can significantly improve communication efficiency. Recent advancements in generative artificial intelligence (GenAI) have empowered GenAI-enabled SemCom (GenSemCom) to further expand its potential in various applications. However, current GenSemCom systems still face challenges such as semantic inconsistency, limited adaptability to diverse tasks and dynamic environments, and the inability to leverage insights from past transmission. Motivated by the success of retrieval-augmented generation (RAG) in the domain of GenAI, this paper explores the integration of RAG in GenSemCom systems. Specifically, we first provide a comprehensive review of existing GenSemCom systems and the fundamentals of RAG techniques. We then discuss how RAG can be integrated into GenSemCom. Following this, we conduct a case study on semantic image transmission using an RAG-enabled diffusion-based SemCom system, demonstrating the effectiveness of the proposed integration. Finally, we outline future directions for advancing RAG-enabled GenSemCom systems.


[89] 2412.19549

Performance Evaluation of IoT LoRa Networks on Mars Through ns-3 Simulations

In recent years, there has been a significant surge of interest in Mars exploration, driven by the planet's potential for human settlement and its proximity to Earth. In this paper, we explore the performance of the LoRaWAN technology on Mars, to study whether commercial off-the-shelf IoT products, designed and developed on Earth, can be deployed on the Martian surface. We use the ns-3 simulator to model various environmental conditions, primarily focusing on the Free Space Path Loss (FSPL) and the impact of Martian dust storms. Simulation results are given with respect to Earth, as a function of the distance, packet size, offered traffic, and the impact of Mars' atmospheric perturbations. We show that LoRaWAN can be a viable communication solution on Mars, although the performance is heavily affected by the extreme Martian environment over long distances.


[90] 2412.19552

Contrast-Optimized Basis Functions for Self-Navigated Motion Correction in Quantitative MRI

Purpose: The long scan times of quantitative MRI techniques make motion artifacts more likely. For MR-Fingerprinting-like approaches, this problem can be addressed with self-navigated retrospective motion correction based on reconstructions in a singular value decomposition (SVD) subspace. However, the SVD promotes high signal intensity in all tissues, which limits the contrast between tissue types and ultimately reduces the accuracy of registration. The purpose of this paper is to rotate the subspace for maximum contrast between two types of tissue and improve the accuracy of motion estimates. Methods: A subspace is derived that promotes contrasts between brain parenchyma and CSF, achieved through the generalized eigendecomposition of mean autocorrelation matrices, followed by a Gram-Schmidt process to maintain orthogonality. We tested our motion correction method on 85 scans with varying motion levels, acquired with a 3D hybrid-state sequence optimized for quantitative magnetization transfer imaging. Results: A comparative analysis shows that the contrast-optimized basis significantly improve the parenchyma-CSF contrast, leading to smoother motion estimates and reduced artifacts in the quantitative maps. Conclusion: The proposed contrast-optimized subspace improves the accuracy of the motion estimation.


[91] 2412.19553

Structural Similarity in Deep Features: Image Quality Assessment Robust to Geometrically Disparate Reference

Image Quality Assessment (IQA) with references plays an important role in optimizing and evaluating computer vision tasks. Traditional methods assume that all pixels of the reference and test images are fully aligned. Such Aligned-Reference IQA (AR-IQA) approaches fail to address many real-world problems with various geometric deformations between the two images. Although significant effort has been made to attack Geometrically-Disparate-Reference IQA (GDR-IQA) problem, it has been addressed in a task-dependent fashion, for example, by dedicated designs for image super-resolution and retargeting, or by assuming the geometric distortions to be small that can be countered by translation-robust filters or by explicit image registrations. Here we rethink this problem and propose a unified, non-training-based Deep Structural Similarity (DeepSSIM) approach to address the above problems in a single framework, which assesses structural similarity of deep features in a simple but efficient way and uses an attention calibration strategy to alleviate attention deviation. The proposed method, without application-specific design, achieves state-of-the-art performance on AR-IQA datasets and meanwhile shows strong robustness to various GDR-IQA test cases. Interestingly, our test also shows the effectiveness of DeepSSIM as an optimization tool for training image super-resolution, enhancement and restoration, implying an even wider generalizability. \footnote{Source code will be made public after the review is completed.


[92] 2412.19585

Ultralight Signal Classification Model for Automatic Modulation Recognition

The growing complexity of radar signals demands responsive and accurate detection systems that can operate efficiently on resource-constrained edge devices. Existing models, while effective, often rely on substantial computational resources and large datasets, making them impractical for edge deployment. In this work, we propose an ultralight hybrid neural network optimized for edge applications, delivering robust performance across unfavorable signal-to-noise ratios (mean accuracy of 96.3% at 0 dB) using less than 100 samples per class, and significantly reducing computational overhead.


[93] 2412.19705

Noise Sensitivity of the Semidefinite Programs for Direct Data-Driven LQR

In this paper, we study the noise sensitivity of the semidefinite program (SDP) proposed for direct data-driven infinite-horizon linear quadratic regulator (LQR) problem for discrete-time linear time-invariant systems. While this SDP is shown to find the true LQR controller in the noise-free setting, we show that it leads to a trivial solution with zero gain matrices when data is corrupted by noise, even when the noise is arbitrarily small. We then study a variant of the SDP that includes a robustness promoting regularization term and prove that regularization does not fully eliminate the sensitivity issue. In particular, the solution of the regularized SDP converges in probability also to a trivial solution.


[94] 2412.19748

UAV-Enabled Secure ISAC Against Dual Eavesdropping Threats: Joint Beamforming and Trajectory Design

In this work, we study an unmanned aerial vehicle (UAV)-enabled secure integrated sensing and communication (ISAC) system, where a UAV serves as an aerial base station (BS) to simultaneously perform communication with a user and detect a target on the ground, while a dual-functional eavesdropper attempts to intercept the signals for both sensing and communication. Facing the dual eavesdropping threats, we aim to enhance the average achievable secrecy rate for the communication user by jointly designing the UAV trajectory together with the transmit information and sensing beamforming, while satisfying the requirements on sensing performance and sensing security, as well as the UAV power and flight constraints. To address the non-convex nature of the optimization problem, we employ the alternating optimization (AO) strategy, jointly with the successive convex approximation (SCA) and semidefinite relaxation (SDR) methods. Numerical results validate the proposed approach, demonstrating its ability to achieve a high secrecy rate while meeting the required sensing and security constraints.


[95] 2412.19785

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed. Our extensive experiments demonstrate that the tokenizer significantly reduces inference time, while prompt-tuning enhances accuracy across various Whisper model sizes, including Small, Medium, and Large. Together, these techniques achieve a balance between optimal WER and inference speed.