Cyclic frosting and defrosting operations constitute a common characteristic of air-source heat pumps in cold climates during winter. Simulation models that can capture simultaneous heat and mass transfer phenomena associated with frost/defrost behaviors and their impact on the overall heat pump system performance are of critical importance to improved controls of heat delivery and frost mitigation. This paper presents a novel frost formulation using an enthalpy method to systematically capture all phase-change behaviors including frost formation and melting, retained water refreezing and melting, and water drainage during cyclic frosting and defrosting operations. A Fuzzy modeling approach is proposed to smoothly switch source terms when evaluating the dynamics of frost and water mediums for numerical robustness. The proposed frost/defrost model is incorporated into a flat-tube outdoor heat exchanger model of an automotive heat pump system model to investigate system responses under cyclic operations of frosting and reverse-cycle defrosting.
Many of the existing TTS systems cannot accurately synthesize text containing a variety of numerical formats, resulting in reduced intelligibility of the synthesized speech. This research aims to develop a numerical format classifier that can classify six types of numeric contexts. Experiments were carried out using the proposed context-based feature extraction technique, which is focused on extracting keywords, punctuation marks, and symbols as the features of the numbers. Support Vector Machine, K-Nearest Neighbors Linear Discriminant Analysis, and Decision Tree were used as classifiers. We have used the 10-fold cross-validation technique to determine the classification accuracy in terms of recall and precision. It can be found that the proposed solution is better than the existing feature extraction technique with improvement to the classification accuracy by 30% to 37%. The use of the number format classification can increase the intelligibility of the TTS systems.
Automatic Speech Recognition (ASR) systems in the clinical domain face significant challenges, notably the need to recognise specialised medical vocabulary accurately and meet stringent precision requirements. We introduce United-MedASR, a novel architecture that addresses these challenges by integrating synthetic data generation, precision ASR fine-tuning, and advanced semantic enhancement techniques. United-MedASR constructs a specialised medical vocabulary by synthesising data from authoritative sources such as ICD-10 (International Classification of Diseases, 10th Revision), MIMS (Monthly Index of Medical Specialties), and FDA databases. This enriched vocabulary helps finetune the Whisper ASR model to better cater to clinical needs. To enhance processing speed, we incorporate Faster Whisper, ensuring streamlined and high-speed ASR performance. Additionally, we employ a customised BART-based semantic enhancer to handle intricate medical terminology, thereby increasing accuracy efficiently. Our layered approach establishes new benchmarks in ASR performance, achieving a Word Error Rate (WER) of 0.985% on LibriSpeech test-clean, 0.26% on Europarl-ASR EN Guest-test, and demonstrating robust performance on Tedlium (0.29% WER) and FLEURS (0.336% WER). Furthermore, we present an adaptable architecture that can be replicated across different domains, making it a versatile solution for domain-specific ASR systems.
Ensuring accurate call prioritisation is essential for optimising the efficiency and responsiveness of mental health helplines. Currently, call operators rely entirely on the caller's statements to determine the priority of the calls. It has been shown that entirely subjective assessment can lead to errors. Furthermore, it is a missed opportunity not to utilise the voice properties readily available during the call to aid in the evaluation. Incorrect prioritisation can result in delayed assistance for high-risk individuals, resource misallocation, increased mental health deterioration, loss of trust, and potential legal consequences. It is vital to address these risks to guarantee the reliability and effectiveness of mental health services. This study delves into the potential of using machine learning, a branch of Artificial Intelligence, to estimate call priority from the callers' voices for users of mental health phone helplines. After analysing 459 call records from a mental health helpline, we achieved a balanced accuracy of 92\%, showing promise in aiding the call operators' efficiency in call handling processes and improving customer satisfaction.
Three-dimensional (3D) ultrasound imaging can overcome the limitations of conventional two dimensional (2D) ultrasound imaging in structural observation and measurement. However, conducting volumetric ultrasound imaging for large-sized organs still faces difficulties including long acquisition time, inevitable patient movement, and 3D feature recognition. In this study, we proposed a real-time volumetric free-hand ultrasound imaging system optimized for the above issues and applied it to the clinical diagnosis of scoliosis. This study employed an incremental imaging method coupled with algorithmic acceleration to enable real-time processing and visualization of the large amounts of data generated when scanning large-sized organs. Furthermore, to deal with the difficulty of image feature recognition, we proposed two tissue segmentation algorithms to reconstruct and visualize the spinal anatomy in 3D space by approximating the depth at which the bone structures are located and segmenting the ultrasound images at different depths. We validated the adaptability of our system by deploying it to multiple models of ultra-sound equipment and conducting experiments using different types of ultrasound probes. We also conducted experiments on 6 scoliosis patients and 10 normal volunteers to evaluate the performance of our proposed method. Ultrasound imaging of a volunteer spine from shoulder to crotch (more than 500 mm) was performed in 2 minutes, and the 3D imaging results displayed in real-time were compared with the corresponding X-ray images with a correlation coefficient of 0.96 in spinal curvature. Our proposed volumetric ultrasound imaging system might hold the potential to be clinically applied to other large-sized organs.
Time-resolved high-resolution X-ray Computed Tomography (4D $\mu$CT) is an imaging technique that offers insight into the evolution of dynamic processes inside materials that are opaque to visible light. Conventional tomographic reconstruction techniques are based on recording a sequence of 3D images that represent the sample state at different moments in time. This frame-based approach limits the temporal resolution compared to dynamic radiography experiments due to the time needed to make CT scans. Moreover, it leads to an inflation of the amount of data and thus to costly post-processing computations to quantify the dynamic behaviour from the sequence of time frames, hereby often ignoring the temporal correlations of the sample structure. Our proposed 4D $\mu$CT reconstruction technique, named DYRECT, estimates individual attenuation evolution profiles for each position in the sample. This leads to a novel memory-efficient event-based representation of the sample, using as little as three image volumes: its initial attenuation, its final attenuation and the transition times. This third volume represents local events on a continuous timescale instead of the discrete global time frames. We propose a method to iteratively reconstruct the transition times and the attenuation volumes. The dynamic reconstruction technique was validated on synthetic ground truth data and experimental data, and was found to effectively pinpoint the transition times in the synthetic dataset with a time resolution corresponding to less than a tenth of the amount of projections required to reconstruct traditional $\mu$CT time frames.
We tackle the problem of providing closed-loop stability guarantees with a scalable data-driven design. We combine virtual reference feedback tuning with dissipativity constraints on the controller for closed-loop stability. The constraints are formulated as a set of linear inequalities in the frequency domain. This leads to a convex problem that is scalable with respect to the length of the data and the complexity of the controller. An extension of virtual reference feedback tuning to include disturbance dynamics is also discussed. The proposed data-driven control design is illustrated by a soft gripper impedance control example.
Dynamic metasurface antennas (DMAs) beamform through low-powered components that enable reconfiguration of each radiating element. Previous research on a single-user multiple-input-single-output (MISO) system with a dynamic metasurface antenna at the transmitter has focused on maximizing the beamforming gain at a fixed operating frequency. The DMA, however, has a frequency-selective response that leads to magnitude degradation for frequencies away from the resonant frequency of each element. This causes reduction in beamforming gain if the DMA only operates at a fixed frequency. We exploit the frequency reconfigurability of the DMA to dynamically optimize both the operating frequency and the element configuration, maximizing the beamforming gain. We leverage this approach to develop a single-shot beam training procedure using a DMA sub-array architecture that estimates the receiver's angular direction with a single OFDM pilot signal. We evaluate the beamforming gain performance of the DMA array using the receiver's angular direction estimate obtained from beam training. Our results show that it is sufficient to use a limited number of resonant frequency states to do both beam training and beamforming instead of using an infinite resolution DMA beamformer.
This paper investigates the maximum downlink spectral efficiency of low earth orbit (LEO) constellations. Spectral efficiency, in this context, refers to the sum rate of the entire network per unit spectrum per unit area on the earth's surface. For practicality all links employ single-user codebooks and treat interference as noise. To estimate the maximum achievable spectral efficiency, we propose and analyze a regular configuration, which deploys satellites and ground terminals in hexagonal lattices. Additionally, for wideband networks with arbitrary satellite configurations, we introduce a subband allocation algorithm aimed at maximizing the overall spectral efficiency. Simulation results indicate that the regular configuration is more efficient than random configurations. As the number of randomly placed satellites increases within an area, the subband allocation algorithm achieves a spectral efficiency that approaches the spectral efficiency achieved by the regular configuration. Further improvements are demonstrated by reconfiguring associations so that nearby transmitters avoid pointing to the same area.
We study phenomena where some eigenvectors of a graph Laplacian are largely confined in small subsets of the graph. These localization phenomena are similar to those generally termed Anderson Localization in the Physics literature, and are related to the complexity of the structure of large graphs in still unexplored ways. Using spectral perturbation theory and pseudo-spectrum analysis, we explain how the presence of localized eigenvectors gives rise to fragilities (low robustness margins) to unmodeled node or link dynamics. Our analysis is demonstrated by examples of networks with relatively low complexity, but with features that appear to induce eigenvector localization. The implications of this newly-discovered fragility phenomenon are briefly discussed.
Speech is a hierarchical collection of text, prosody, emotions, dysfluencies, etc. Automatic transcription of speech that goes beyond text (words) is an underexplored problem. We focus on transcribing speech along with non-fluencies (dysfluencies). The current state-of-the-art pipeline SSDM suffers from complex architecture design, training complexity, and significant shortcomings in the local sequence aligner, and it does not explore in-context learning capacity. In this work, we propose SSDM 2.0, which tackles those shortcomings via four main contributions: (1) We propose a novel \textit{neural articulatory flow} to derive highly scalable speech representations. (2) We developed a \textit{full-stack connectionist subsequence aligner} that captures all types of dysfluencies. (3) We introduced a mispronunciation prompt pipeline and consistency learning module into LLM to leverage dysfluency \textit{in-context pronunciation learning} abilities. (4) We curated Libri-Dys and open-sourced the current largest-scale co-dysfluency corpus, \textit{Libri-Co-Dys}, for future research endeavors. In clinical experiments on pathological speech transcription, we tested SSDM 2.0 using nfvPPA corpus primarily characterized by \textit{articulatory dysfluencies}. Overall, SSDM 2.0 outperforms SSDM and all other dysfluency transcription models by a large margin. See our project demo page at \url{https://berkeley-speech-group.github.io/SSDM2.0/}.
Driven by global climate goals, an increasing amount of Renewable Energy Sources (RES) is currently being installed worldwide. Especially in the context of offshore wind integration, hybrid AC/DC grids are considered to be the most effective technology to transmit this RES power over long distances. As hybrid AC/DC systems develop, they are expected to become increasingly complex and meshed as the current AC system. Nevertheless, there is still limited literature on how to optimize hybrid AC/DC topologies while minimizing the total power generation cost. For this reason, this paper proposes a methodology to optimize the steady-state switching states of transmission lines and busbar configurations in hybrid AC/DC grids. The proposed optimization model includes optimal transmission switching (OTS) and busbar splitting (BS), which can be applied to both AC and DC parts of hybrid AC/DC grids. To solve the problem, a scalable and exact nonlinear, non-convex model using a big M approach is formulated. In addition, convex relaxations and linear approximations of the model are tested, and their accuracy, feasibility, and optimality are analyzed. The numerical experiments show that a solution to the combined OTS/BS problem can be found in acceptable computation time and that the investigated relaxations and linearisations provide AC feasible results.
This work studies the problem of jointly estimating unknown parameters from Kronecker-structured multidimensional signals, which arises in applications like intelligent reflecting surface (IRS)-aided channel estimation. Exploiting the Kronecker structure, we decompose the estimation problem into smaller, independent subproblems across each dimension. Each subproblem is posed as a sparse recovery problem using basis expansion and solved using a novel off-grid sparse Bayesian learning (SBL)-based algorithm. Additionally, we derive probabilistic error bounds for the decomposition, quantify its denoising effect, and provide convergence analysis for off-grid SBL. Our simulations show that applying the algorithm to IRS-aided channel estimation improves accuracy and runtime compared to state-of-the-art methods through the low-complexity and denoising benefits of the decomposition step and the high-resolution estimation capabilities of off-grid SBL.
Reconfigurable intelligent surfaces (RISs) have shown the potential to improve signal-to-interference-plus-noise ratio (SINR) related coverage, especially at high-frequency communications. However, assessing electromagnetic filed exposure (EMFE) and establishing EMFE regulations in RIS-assisted large-scale networks are still open issues. This paper proposes a framework to characterize SINR and EMFE in such networks for downlink and uplink scenarios. Particularly, we carefully consider the association rule with the presence of RISs, accurate antenna pattern at base stations (BSs), fading model, and power control mechanism at mobile devices in the system model. Under the proposed framework, we derive the marginal and joint distributions of SINR and EMFE in downlink and uplink, respectively. The first moment of EMFE is also provided. Additionally, we design the compliance distance (CD) between a BS/RIS and a user to comply with the EMFE regulations. To facilitate efficient identification, we further provide approximate closed-form expressions for CDs. From numerical results of the marginal distributions, we find that in the downlink scenario, deploying RISs may not always be beneficial, as the improved SINR comes at the cost of increased EMFE. However, in the uplink scenario, RIS deployment is promising to enhance coverage while still maintaining EMFE compliance. By simultaneously evaluating coverage and compliance metrics through joint distributions, we demonstrate the feasibility of RISs in improving uplink and downlink performance. Insights from this framework can contribute to establishing EMFE guidelines and achieving a balance between coverage and compliance when deploying RISs.
The advent of deep learning and recurrent neural networks revolutionized the field of time-series processing. Therefore, recent research on spectrum prediction has focused on the use of these tools. However, spectrum prediction, which involves forecasting wireless spectrum availability, is an older field where many "classical" tools were considered around the 2010s, such as Markov models. This work revisits high-order Markov models for spectrum prediction in dynamic wireless environments. We introduce a framework to address mismatches between sensing length and model order as well as state-space complexity arising with large order. Furthermore, we extend this Markov framework by enabling fine-tuning of the probability transition matrix through gradient-based supervised learning, offering a hybrid approach that bridges probabilistic modeling and modern machine learning. Simulations on real-world Wi-Fi traffic demonstrate the competitive performance of high-order Markov models compared to deep learning methods, particularly in scenarios with constrained datasets containing outliers.
Traditional deep learning methods in medical imaging often focus solely on segmentation or classification, limiting their ability to leverage shared information. Multi-task learning (MTL) addresses this by combining both tasks through shared representations but often struggles to balance local spatial features for segmentation and global semantic features for classification, leading to suboptimal performance. In this paper, we propose a simple yet effective UNet-based MTL model, where features extracted by the encoder are used to predict classification labels, while the decoder produces the segmentation mask. The model introduces an advanced encoder incorporating a novel ResFormer block that integrates local context from convolutional feature extraction with long-range dependencies modeled by the Transformer. This design captures broader contextual relationships and fine-grained details, improving classification and segmentation accuracy. To enhance classification performance, multi-scale features from different encoder levels are combined to leverage the hierarchical representation of the input image. For segmentation, the features passed to the decoder via skip connections are refined using a novel dilated feature enhancement (DFE) module, which captures information at different scales through three parallel convolution branches with varying dilation rates. This allows the decoder to detect lesions of varying sizes with greater accuracy. Experimental results across multiple medical datasets confirm the superior performance of our model in both segmentation and classification tasks, compared to state-of-the-art single-task and multi-task learning methods.
In the field of Maritime Autonomous Surface Ships (MASS), the accurate modeling of ship maneuvering motion for harbor maneuvers is a crucial technology. Non-parametric system identification (SI) methods, which do not require prior knowledge of the target ship, have the potential to produce accurate maneuvering models using observed data. However, the modeling accuracy significantly depends on the distribution of the available data. To address these issues, we propose a probabilistic prediction method of maneuvering motion that incorporates ensemble learning into a non-parametric SI using feedforward neural networks. This approach captures the epistemic uncertainty caused by insufficient or unevenly distributed data. In this paper, we show the prediction accuracy and uncertainty prediction results for various unknown scenarios, including port navigation, zigzag, turning, and random control maneuvers, assuming that only port navigation data is available. Furthermore, this paper demonstrates the utility of the proposed method as a maneuvering simulator for assessing heading-keeping PD control. As a result, it was confirmed that the proposed method can achieve high accuracy if training data with similar state distributions is provided, and that it can also predict high uncertainty for states that deviate from the training data distribution. In the performance evaluation of PD control, it was confirmed that considering worst-case scenarios reduces the possibility of overestimating performance compared to the true system. Finally, we show the results of applying the proposed method to full-scale ship data, demonstrating its applicability to full-scale ships.
As the demand for underwater communication continues to grow, underwater acoustic RIS (UARIS), as an emerging paradigm in underwater acoustic communication (UAC), can significantly improve the communication rate of underwater acoustic systems. However, in open underwater environments, the location of the source node is highly susceptible to being obtained by eavesdropping nodes through correlation analysis, leading to the issue of location privacy in underwater communication systems, which has been overlooked by many studies. To enhance underwater communication and protect location privacy, we propose a novel UARIS architecture integrated with an artificial noise (AN) module. This architecture not only improves communication quality but also introduces noise to interfere with the eavesdroppers' attempts to locate the source node. We derive the Cram\'er-Rao Lower Bound (CRLB) for the localization method deployed by the eavesdroppers and, based on this, model the UARIS-assisted communication scenario as a multi-objective optimization problem. This problem optimizes transmission beamforming, reflective precoding, and noise factors to maximize communication performance and location privacy protection. To efficiently solve this non-convex optimization problem, we develop an iterative algorithm based on fractional programming. Simulation results validate that the proposed system significantly enhances data transmission rates while effectively maintaining the location privacy of the source node in UAC systems.
This paper investigates an intelligent reflecting surface (IRS) aided wireless federated learning (FL) system, where an access point (AP) coordinates multiple edge devices to train a machine leaning model without sharing their own raw data. During the training process, we exploit the joint channel reconfiguration via IRS and resource allocation design to reduce the latency of a FL task. Particularly, we propose three transmission protocols for assisting the local model uploading from multiple devices to an AP, namely IRS aided time division multiple access (I-TDMA), IRS aided frequency division multiple access (I-FDMA), and IRS aided non-orthogonal multiple access (INOMA), to investigate the impact of IRS on the multiple access for FL. Under the three protocols, we minimize the per-round latency subject to a given training loss by jointly optimizing the device scheduling, IRS phase-shifts, and communicationcomputation resource allocation. For the associated problem under I-TDMA, an efficient algorithm is proposed to solve it optimally by exploiting its intrinsic structure, whereas the highquality solutions of the problems under I-FDMA and I-NOMA are obtained by invoking a successive convex approximation (SCA) based approach. Then, we further develop a theoretical framework for the performance comparison of the proposed three transmission protocols. Sufficient conditions for ensuring that I-TDMA outperforms I-NOMA and those of its opposite are unveiled, which is fundamentally different from that NOMA always outperforms TDMA in the system without IRS. Simulation results validate our theoretical findings and also demonstrate the usefulness of IRS for enhancing the fundamental tradeoff between the learning latency and learning accuracy.
Scalable coding, which can adapt to channel bandwidth variation, performs well in today's complex network environment. However, most existing scalable compression methods face two challenges: reduced compression performance and insufficient scalability. To overcome the above problems, this paper proposes a learned fine-grained scalable image compression framework, namely DeepFGS. Specifically, we introduce a feature separation backbone to divide the image information into basic and scalable features, then redistribute the features channel by channel through an information rearrangement strategy. In this way, we can generate a continuously scalable bitstream via one-pass encoding. For entropy coding, we design a mutual entropy model to fully explore the correlation between the basic and scalable features. In addition, we reuse the decoder to reduce the parameters and computational complexity. Experiments demonstrate that our proposed DeepFGS outperforms previous learning-based scalable image compression models and traditional scalable image codecs in both PSNR and MS-SSIM metrics.
In the millimeter waves (mmWave) bands considered for 5G and beyond, the use of very high frequencies results in the interruption of communication whenever there is no line of sight between the transmitter and the receiver. Blockages have been modeled in the literature so far using tools such as stochastic geometry and random shape theory. Using these tools, in this paper, we characterize the lengths of the segments in line-of-sight (LOS) and in non-line-of-sight (NLOS) statistically in an urban scenario where buildings (with random positions, lengths, and heights) are deployed in parallel directions configuring streets.
The mmWave bands, considered to support the forthcoming generation of mobile communications technologies, have a well-known vulnerability to blockages. Recent works in the literature analyze the blockage probability considering independence or correlation among the blocking elements of the different links. In this letter, we characterize the effect of blockages and their correlation on the ergodic capacity. We carry out the analysis for urban scenarios, where the considered blocking elements are buildings that are primarily parallel to the streets. We also present numerical simulations based on actual building features of the city of Chicago to validate the obtained expressions.
As irregularly structured data representations, graphs have received a large amount of attention in recent years and have been widely applied to various real-world scenarios such as social, traffic, and energy settings. Compared to non-graph algorithms, numerous graph-based methodologies benefited from the strong power of graphs for representing high-dimensional and non-Euclidean data. In the field of Graph Signal Processing (GSP), analogies of classical signal processing concepts, such as shifting, convolution, filtering, and transformations are developed. However, many GSP techniques usually postulate the graph is static in both signal and typology. This assumption hinders the effectiveness of GSP methodologies as the assumption ignores the time-varying properties in numerous real-world systems. For example, in the traffic network, the signal on each node varies over time and contains underlying temporal correlation and patterns worthy of analysis. To tackle this challenge, more and more work are being done recently to investigate the processing of time-varying graph signals. They cope with time-varying challenges from three main directions: 1) graph time-spectral filtering, 2) multi-variate time-series forecasting, and 3) spatiotemporal graph data mining by neural networks, where non-negligible progress has been achieved. Despite the success of signal processing and learning over time-varying graphs, there is no survey to compare and conclude the current methodology for GSP and graph learning. To compensate for this, in this paper, we aim to review the development and recent progress on signal processing and learning over time-varying graphs, and compare their advantages and disadvantages from both the methodological and experimental side, to outline the challenges and potential research directions for future research.
In recent years, large language models have made significant advancements in the field of natural language processing, yet there are still inadequacies in specific domain knowledge and applications. This paper Proposes MaintAGT, a professional large model for intelligent operations and maintenance, aimed at addressing this issue. The system comprises three key components: a signal-to-text model, a pure text model, and a multimodal model. Firstly, the signal-to-text model was designed to convert raw signal data into textual descriptions, bridging the gap between signal data and text-based analysis. Secondly, the pure text model was fine-tuned using the GLM4 model with specialized knowledge to enhance its understanding of domain-specific texts. Finally, these two models were integrated to develop a comprehensive multimodal model that effectively processes and analyzes both signal and textual data.The dataset used for training and evaluation was sourced from academic papers, textbooks, international standards, and vibration analyst training materials, undergoing meticulous preprocessing to ensure high-quality data. As a result, the model has demonstrated outstanding performance across multiple intelligent operations and maintenance tasks, providing a low-cost, high-quality method for constructing large-scale monitoring signal-text description-fault pattern datasets. Experimental results indicate that the model holds significant advantages in condition monitoring, signal processing, and fault diagnosis.In the constructed general test set, MaintAGT achieved an accuracy of 70%, surpassing all existing general large language models and reaching the level of an ISO Level III human vibration analyst.This advancement signifies a crucial step forward from traditional maintenance practices toward intelligent and AI-driven maintenance solutions.
Online model predictive control (MPC) for piecewise affine (PWA) systems requires the online solution to an optimization problem that implicitly optimizes over the switching sequence of PWA regions, for which the computational burden can be prohibitive. Alternatively, the computation can be moved offline using explicit MPC; however, the online memory requirements and the offline computation can then become excessive. In this work we propose a solution in between online and explicit MPC, addressing the above issues by partially dividing the computation between online and offline. To solve the underlying MPC problem, a policy, learned offline, specifies the sequence of PWA regions that the dynamics must follow, thus reducing the complexity of the remaining optimization problem that solves over only the continuous states and control inputs. We provide a condition, verifiable during learning, that guarantees feasibility of the learned policy's output, such that an optimal continuous control input can always be found online. Furthermore, a method for iteratively generating training data offline allows the feasible policy to be learned efficiently, reducing the offline computational burden. A numerical experiment demonstrates the effectiveness of the method compared to both online and explicit MPC.
The Sinusoidal Input Describing Function (SIDF) is an effective tool for control system analysis and design, with its reliability directly impacting the performance of the designed control systems. This study enhances the reliability of SIDF analysis and the performance of closed-loop reset feedback control systems, presenting two main contributions. First, it introduces a method to identify frequency ranges where SIDF analysis becomes inaccurate. Second, these identified ranges correlate with high-magnitude, high-order harmonics that can degrade system performance. To address this, a shaped reset control strategy is proposed, which incorporates a shaping filter to tune reset actions and reduce high-order harmonics. Then, a frequency-domain design procedure of a PID shaping filter in a reset control system is outlined as a case study. The PID filter effectively reduces high-order harmonics and resolves limit-cycle issues under step inputs. Finally, simulations and experimental results on a precision motion stage validate the efficacy of the proposed shaped reset control, showing enhanced SIDF analysis accuracy, improved steady-state precision over linear and reset controllers, and elimination of limit cycles under step inputs.
The development of sixth-generation (6G) communication technologies is confronted with the significant challenge of spectrum resource shortage. To alleviate this issue, we propose a novel simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided multiple-input multiple-output (MIMO) cognitive radio (CR) system. Specifically, the underlying secondary network in the proposed CR system reuses the same frequency resources occupied by the primary network with the help of the STAR-RIS. The secondary network sum rate maximization problem is first formulated for the STAR-RIS aided MIMO CR system. The adoption of STAR-RIS necessitates an intricate beamforming design for the considered system due to its large number of coupled coefficients. The block coordinate descent method is employed to address the formulated optimization problem. In each iteration, the beamformers at the secondary base station (SBS) are optimized by solving a quadratically constrained quadratic program (QCQP) problem. Concurrently, the STAR-RIS passive beamforming problem is resolved using tailored algorithms designed for the two phase-shift models: 1) For the independent phase-shift model, a successive convex approximation-based algorithm is proposed. 2) For the coupled phase-shift model, a penalty dual decomposition-based algorithm is conceived, in which the phase shifts and amplitudes of the STAR-RIS elements are optimized using closed-form solutions. Simulation results show that: 1) The proposed STAR-RIS aided CR communication framework can significantly enhance the sum rate of the secondary system. 2) The coupled phase-shift model results in limited performance degradation compared to the independent phase-shift model.
Lumbar spine problems are ubiquitous, motivating research into targeted imaging for treatment planning and guided interventions. While high resolution and high contrast CT has been the modality of choice, MRI can capture both bone and soft tissue without the ionizing radiation of CT albeit longer acquisition time. The critical trade-off between contrast quality and acquisition time has motivated 'thick slice MRI', which prioritises faster imaging with high in-plane resolution but variable contrast and low through-plane resolution. We investigate a recently developed post-acquisition pipeline which segments vertebrae from thick-slice acquisitions and uses a variational autoencoder to enhance quality after an initial 3D reconstruction. We instead propose a latent space diffusion energy-based prior to leverage diffusion models, which exhibit high-quality image generation. Crucially, we mitigate their high computational cost and low sample efficiency by learning an energy-based latent representation to perform the diffusion processes. Our resulting method outperforms existing approaches across metrics including Dice and VS scores, and more faithfully captures 3D features.
A simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided integrated sensing, computing, and communication (ISCC) Internet of Robotic Things (IoRT) framework is proposed. Specifically, the full-duplex (FD) base station (BS) simultaneously receives the offloading signals from decision robots (DRs) and carries out target robot (TR) sensing. A computation rate maximization problem is formulated to optimize the sensing and receive beamformers at the BS and the STAR-RIS coefficients under the BS power constraint, the sensing signal-to-noise ratio constraint, and STAR-RIS coefficients constraints. The alternating optimization (AO) method is adopted to solve the proposed optimization problem. With fixed STAR-RIS coefficients, the sub-problem with respect to sensing and receiving beamformer at the BS is tackled with the weighted minimum mean-square error method. Given beamformers at the BS, the sub-problem with respect to STAR-RIS coefficients is tacked with the penalty method and successive convex approximation method. The overall algorithm is guaranteed to converge to at least a stationary point of the computation rate maximization problem. Our simulation results validate that the proposed STAR-RIS aided ISCC IoRT system can enhance the sum computation rate compared with the benchmark schemes.
The reliability of the electric grid has in recent years become a larger concern for regulators, planners, and consumers due to several high-impact outage events, as well as the potential for even more impactful events in the future. These concerns are largely the result of decades-old resource adequacy (RA) planning frameworks being insufficiently adapted to the current types of uncertainty faced by planners, including many sources of deep uncertainty for which probability distributions cannot be defensibly assigned. There are emerging methodologies for dealing with these new types of uncertainty in RA assessment and procurement frameworks, but their adoption has been hindered by the lack of consistent understanding of terminology related to RA and the related concept of resilience, as well as a lack of syntheses of such available methodologies. Here we provide an overview of RA and its relationship to resilience, a summary of available methods for dealing with emerging types of uncertainty faced by RA assessment, and an an overview of procurement methodologies for operationalizing RA in the context of these types of uncertainty. This paper provides a synthesis and guide for both researchers and practitioners seeking to navigate a new, much more uncertain era of power system planning.
Wideband spectrum sensing (WSS) is critical for orchestrating multitudinous wireless transmissions via spectrum sharing, but may incur excessive costs of hardware, power and computation due to the high sampling rate. In this article, a deep learning based WSS framework embedding the multicoset preprocessing is proposed to enable the low-cost sub-Nyquist sampling. A pruned convolutional attention WSS network (PCA-WSSNet) is designed to organically integrate the multicoset preprocessing and the convolutional attention mechanism as well as to reduce the model complexity remarkably via the selective weight pruning without the performance loss. Furthermore, a transfer learning (TL) strategy benefiting from the model pruning is developed to improve the robustness of PCA-WSSNet with few adaptation samples of new scenarios. Simulation results show the performance superiority of PCA-WSSNet over the state of the art. Compared with direct TL, the pruned TL strategy can simultaneously improve the prediction accuracy in unseen scenarios, reduce the model size, and accelerate the model inference.
Medical image translation is the process of converting from one imaging modality to another, in order to reduce the need for multiple image acquisitions from the same patient. This can enhance the efficiency of treatment by reducing the time, equipment, and labor needed. In this paper, we introduce a multi-resolution guided Generative Adversarial Network (GAN)-based framework for 3D medical image translation. Our framework uses a 3D multi-resolution Dense-Attention UNet (3D-mDAUNet) as the generator and a 3D multi-resolution UNet as the discriminator, optimized with a unique combination of loss functions including voxel-wise GAN loss and 2.5D perception loss. Our approach yields promising results in volumetric image quality assessment (IQA) across a variety of imaging modalities, body regions, and age groups, demonstrating its robustness. Furthermore, we propose a synthetic-to-real applicability assessment as an additional evaluation to assess the effectiveness of synthetic data in downstream applications such as segmentation. This comprehensive evaluation shows that our method produces synthetic medical images not only of high-quality but also potentially useful in clinical applications. Our code is available at github.com/juhha/3D-mADUNet.
Inverter-based distributed energy resources facilitate the advanced voltage control algorithms in the online setting with the flexibility in both active and reactive power injections. A key challenge is to continuously track the time-varying global optima with the robustness against dynamics inaccuracy and communication delay. In this paper, we introduce the disturbance-action controller by novelly formulating the voltage drop from loads as the system disturbance. The controller alternatively generates the control input and updates the parameters based on the interactions with grids. Under the linearized power flow model, we provide stability conditions of the control policy and the performance degradation to model inaccuracy. The simulation results on the radial distribution networks show the effectiveness of proposed controller under fluctuating loads and significant improvement on the robustness to these challenges. Furthermore, the ability of incorporating history information and generalization to various loads are demonstrated through extensive experiments on the parameter sensitivity.
Unit maintenance and unit commitment are two critical and interrelated aspects of electric power system operation, both of which face the challenge of coordinating efforts to enhance reliability and economic performance. This challenge becomes increasingly pronounced in the context of increased integration of renewable energy and flexible loads, such as wind power and electric vehicles, into the power system, where high uncertainty is prevalent. To tackle this issue, this paper develops a two-stage adaptive robust optimization model for the joint unit maintenance and unit commitment strategy. The first stage focuses on making joint decisions regarding unit maintenance and unit commitment, while the second stage addresses economic dispatch under the worst-case scenarios of wind power and load demand. Then a practical solution methodology is proposed to solve this model efficiently, which combines the inexact column-and-constraint generation algorithm with an outer approximation method. Finally, the economic viability and adaptability of the proposed method is demonstrated based on the RTS-79 test system.
Accurate segmentation of gross tumor volume (GTV) is essential for effective MRI-guided adaptive radiotherapy (MRgART) in head and neck cancer. However, manual segmentation of the GTV over the course of therapy is time-consuming and prone to interobserver variability. Deep learning (DL) has the potential to overcome these challenges by automatically delineating GTVs. In this study, our team, $\textit{UW LAIR}$, tackled the challenges of both pre-radiotherapy (pre-RT) (Task 1) and mid-radiotherapy (mid-RT) (Task 2) tumor volume segmentation. To this end, we developed a series of DL models for longitudinal GTV segmentation. The backbone of our models for both tasks was SegResNet with deep supervision. For Task 1, we trained the model using a combined dataset of pre-RT and mid-RT MRI data, which resulted in the improved aggregated Dice similarity coefficient (DSCagg) on an internal testing set compared to models trained solely on pre-RT MRI data. In Task 2, we introduced mask-aware attention modules, enabling pre-RT GTV masks to influence intermediate features learned from mid-RT data. This attention-based approach yielded slight improvements over the baseline method, which concatenated mid-RT MRI with pre-RT GTV masks as input. In the final testing phase, the ensemble of 10 pre-RT segmentation models achieved an average DSCagg of 0.794, with 0.745 for primary GTV (GTVp) and 0.844 for metastatic lymph nodes (GTVn) in Task 1. For Task 2, the ensemble of 10 mid-RT segmentation models attained an average DSCagg of 0.733, with 0.607 for GTVp and 0.859 for GTVn, leading us to $\textbf{achieve 1st place}$. In summary, we presented a collection of DL models that could facilitate GTV segmentation in MRgART, offering the potential to streamline radiation oncology workflows. Our code and model weights are available at https://github.com/xtie97/HNTS-MRG24-UWLAIR.
Segmenting internal structure from echocardiography is essential for the diagnosis and treatment of various heart diseases. Semi-supervised learning shows its ability in alleviating annotations scarcity. While existing semi-supervised methods have been successful in image segmentation across various medical imaging modalities, few have attempted to design methods specifically addressing the challenges posed by the poor contrast, blurred edge details and noise of echocardiography. These characteristics pose challenges to the generation of high-quality pseudo-labels in semi-supervised segmentation based on Mean Teacher. Inspired by human reflection on erroneous practices, we devise an error reflection strategy for echocardiography semi-supervised segmentation architecture. The process triggers the model to reflect on inaccuracies in unlabeled image segmentation, thereby enhancing the robustness of pseudo-label generation. Specifically, the strategy is divided into two steps. The first step is called reconstruction reflection. The network is tasked with reconstructing authentic proxy images from the semantic masks of unlabeled images and their auxiliary sketches, while maximizing the structural similarity between the original inputs and the proxies. The second step is called guidance correction. Reconstruction error maps decouple unreliable segmentation regions. Then, reliable data that are more likely to occur near high-density areas are leveraged to guide the optimization of unreliable data potentially located around decision boundaries. Additionally, we introduce an effective data augmentation strategy, termed as multi-scale mixing up strategy, to minimize the empirical distribution gap between labeled and unlabeled images and perceive diverse scales of cardiac anatomical structures. Extensive experiments demonstrate the competitiveness of the proposed method.
Quantum computing comes with the potential to push computational boundaries in various domains including, e.g., cryptography, simulation, optimization, and machine learning. Exploiting the principles of quantum mechanics, new algorithms can be developed with capabilities that are unprecedented by classical computers. However, the experimental realization of quantum devices is an active field of research with enormous open challenges, including robustness against noise and scalability. While systems and control theory plays a crucial role in tackling these challenges, the principles of quantum physics lead to a (perceived) high entry barrier for entering the field of quantum computing. This tutorial paper aims at lowering the barrier by introducing basic concepts required to understand and solve research problems in quantum systems. First, we introduce fundamentals of quantum algorithms, ranging from basic ingredients such as qubits and quantum logic gates to prominent examples and more advanced concepts, e.g., variational quantum algorithms. Next, we formalize some engineering questions for building quantum devices in the real world, which requires the careful manipulation of microscopic quantities obeying quantum effects. To this end for N-level systems, we introduce basic concepts of (bilinear) quantum systems and control theory including controllability, observability, and optimal control in a unified frame. Finally, we address the problem of noise in real-world quantum systems via robust quantum control, which relies on a set-membership uncertainty description frequently employed in control.
This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings, which is crucial for characterizing teaching tasks. In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition. However, analyzing and quantifying this feedback is challenging due to its unstructured and specialized nature. Automated systems are essential to manage these complexities at scale, allowing for the creation of structured datasets that enhance feedback analysis and improve surgical education. Our framework integrates voice activity detection, speaker diarization, and automated speech recaognition, with a novel enhancement that 1) removes hallucinations (non-existent utterances generated during speech recognition fueled by noise in the operating room) and 2) separates speech from trainers and trainees using few-shot voice samples. These aspects are vital for reconstructing accurate surgical dialogues and understanding the roles of operating room participants. Using data from 33 real-world surgeries, we demonstrated the system's capability to reconstruct surgical teaching dialogues and detect feedback instances effectively (F1 score of 0.79+/-0.07). Moreover, our hallucination removal step improves feedback detection performance by ~14%. Evaluation on downstream clinically relevant tasks of predicting Behavioral Adjustment of trainees and classifying Technical feedback, showed performances comparable to manual annotations with F1 scores of 0.82+/0.03 and 0.81+/0.03 respectively. These results highlight the effectiveness of our framework in supporting clinically relevant tasks and improving over manual methods.
Adenoid hypertrophy stands as a common cause of obstructive sleep apnea-hypopnea syndrome in children. It is characterized by snoring, nasal congestion, and growth disorders. Computed Tomography (CT) emerges as a pivotal medical imaging modality, utilizing X-rays and advanced computational techniques to generate detailed cross-sectional images. Within the realm of pediatric airway assessments, CT imaging provides an insightful perspective on the shape and volume of enlarged adenoids. Despite the advances of deep learning methods for medical imaging analysis, there remains an emptiness in the segmentation of adenoid hypertrophy in CT scans. To address this research gap, we introduce TSUBF-Nett (Trans-Spatial UNet-like Network based on Bi-direction Fusion), a 3D medical image segmentation framework. TSUBF-Net is engineered to effectively discern intricate 3D spatial interlayer features in CT scans and enhance the extraction of boundary-blurring features. Notably, we propose two innovative modules within the U-shaped network architecture:the Trans-Spatial Perception module (TSP) and the Bi-directional Sampling Collaborated Fusion module (BSCF).These two modules are in charge of operating during the sampling process and strategically fusing down-sampled and up-sampled features, respectively. Furthermore, we introduce the Sobel loss term, which optimizes the smoothness of the segmentation results and enhances model accuracy. Extensive 3D segmentation experiments are conducted on several datasets. TSUBF-Net is superior to the state-of-the-art methods with the lowest HD95: 7.03, IoU:85.63, and DSC: 92.26 on our own AHSD dataset. The results in the other two public datasets also demonstrate that our methods can robustly and effectively address the challenges of 3D segmentation in CT scans.
A control framework is presented to solve the rendezvous and proximity operations (RPO) problem of the EP-Gemini mission. In this mission, a CubeSat chaser is controlled to approach and circumnavigate the other uncooperative CubeSat target. Such a problem is challenging because the chaser operates on a single electric propulsion thruster, for which coupling between attitude control and thrust vector, and charging of the electric propulsion system must be taken into consideration. In addition, the access to relative states in real time is not achievable due to the onboard hardware constraints of the two CubeSats. The developed control framework addresses these limitations by applying four modularized maneuver blocks to correct the chaser's mean orbit elements in sequence. The control framework is based on a relative motion called safety ellipse to ensure a low collision risk. The complete EP-Gemini mission is demonstrated by the implementation of the proposed control framework in a numerical simulation that includes high order perturbations for low Earth orbit. The simulation result shows that a safety ellipse is established after a 41-day RPO maneuver, which consumes 44$\%$ of the total fuel in terms of $\Delta V$. The resulting 3-dimensional safety ellipse circumnavigates the target with an approximate dimension of 14 km $\times$ 27 km $\times$ 8 km.
Vessel dynamics simulation is vital in studying the relationship between geometry and vascular disease progression. Reliable dynamics simulation relies on high-quality vascular meshes. Most of the existing mesh generation methods highly depend on manual annotation, which is time-consuming and laborious, usually facing challenges such as branch merging and vessel disconnection. This will hinder vessel dynamics simulation, especially for the population study. To address this issue, we propose a deep learning-based method, dubbed as DVasMesh to directly generate structured hexahedral vascular meshes from vascular images. Our contributions are threefold. First, we propose to formally formulate each vertex of the vascular graph by a four-element vector, including coordinates of the centerline point and the radius. Second, a vectorized graph template is employed to guide DVasMesh to estimate the vascular graph. Specifically, we introduce a sampling operator, which samples the extracted features of the vascular image (by a segmentation network) according to the vertices in the template graph. Third, we employ a graph convolution network (GCN) and take the sampled features as nodes to estimate the deformation between vertices of the template graph and target graph, and the deformed graph template is used to build the mesh. Taking advantage of end-to-end learning and discarding direct dependency on annotated labels, our DVasMesh demonstrates outstanding performance in generating structured vascular meshes on cardiac and cerebral vascular images. It shows great potential for clinical applications by reducing mesh generation time from 2 hours (manual) to 30 seconds (automatic).
Low-latency and high-precision vehicle localization plays a significant role in enhancing traffic safety and improving traffic management for intelligent transportation. However, in complex road environments, the low latency and high precision requirements could not always be fulfilled due to the high complexity of localization computation. To tackle this issue, we propose a road-aware localization mechanism in heterogeneous networks (HetNet) of the mobile communication system, which enables real-time acquisition of vehicular position information, including the vehicular current road, segment within the road, and coordinates. By employing this multi-scale localization approach, the computational complexity can be greatly reduced while ensuring accurate positioning. Specifically, to reduce positioning search complexity and ensure positioning precision, roads are partitioned into low-dimensional segments with unequal lengths by the proposed singular point (SP) segmentation method. To reduce feature-matching complexity, distinctive salient features (SFs) are extracted sparsely representing roads and segments, which can eliminate redundant features while maximizing the feature information gain. The Cram\'er-Rao Lower Bound (CRLB) of vehicle positioning errors is derived to verify the positioning accuracy improvement brought from the segment partition and SF extraction. Additionally, through SF matching by integrating the inclusion and adjacency position relationships, a multi-scale vehicle localization (MSVL) algorithm is proposed to identify vehicular road signal patterns and determine the real-time segment and coordinates. Simulation results show that the proposed multi-scale localization mechanism can achieve lower latency and high precision compared to the benchmark schemes.
As the adoption of distributed energy resources grows, power systems are becoming increasingly complex and vulnerable to disruptions, such as natural disasters and cyber-physical threats. Peer-to-peer (P2P) energy markets offer a practical solution to enhance reliability and resilience during power outages while providing monetary and technical benefits to prosumers and consumers. This paper explores the advantages of P2P energy exchanges in active distribution networks using a double auction mechanism, focusing on improving system resilience during outages. Two pricing mechanisms distribution locational marginal price (DLMP) and average price mechanism are used to complement each other in facilitating efficient energy exchange. DLMP serves as a price signal that reflects network conditions and acts as an upper bound for bidding in the P2P market. Meanwhile, prosumers and consumers submit bids in the market and agree on energy transactions based on average transaction prices, ensuring fast matching and fair settlements. Simulation results indicate that during emergency operation modes, DLMP prices increase, leading to higher average transaction prices. Prosumers benefit from increased market clearing prices, while consumers experience uninterrupted electricity supply.
Recent advancements, net-zero emission policies, along with declining costs of renewable energy, battery storage, and electric vehicles (EVs), are accelerating the transition toward cleaner, more resilient energy systems. This paper conducts a comprehensive techno-economic analysis of Net-Zero Energy Buildings (NZEBs) within Florida's energy transition by 2050. The analysis focuses on the financial advantages of integrating rooftop photovoltaic (PV) systems, battery storage, and EVs collectively compared to reliance on grid electricity for both existing and newly built homes in Orlando, Florida. By leveraging federal incentives like the Investment Tax Credit and considering energy efficiency improvements, residents can achieve significant savings. Simulation results show that existing homes with a 9.5 kW PV system and 42.2 kWh battery are projected to generate positive returns by 2029, while newly constructed homes meet this threshold as early as 2024. Also, rooftop solar used to charge an EV can save up to 100 dollars per month for residents compared to gasoline. Combining PV and battery storage not only lowers electric bills but also enhances grid independence and resilience against grid outages. Beyond individual savings, NZEBs contribute to grid stability by reducing electricity demand and supporting utility-scale renewable applications. These advancements lower infrastructure costs, help Florida residents and utilities align with national decarbonization goals, retain approximately $23 Billion dollars within the state and foster progress toward a sustainable, low-carbon future.
The decreasing costs of photovoltaic (PV) systems and battery storage, alongside the rapid rise of electric vehicles (EVs), present a unique opportunity to revolutionize energy use in apartment complexes. Generating electricity via PV and batteries is currently cheaper and greener than relying on grid power, which is often expensive. Yet, residents in multi-building apartment complexes typically lack access to fast EV charging infrastructure. To this end, this paper investigates the feasibility and energy management of deploying commercial PV-powered battery storage and EV fast chargers within apartment complexes in Orlando, Florida, operated by complex owners. By modeling the complex as a grid-connected microgrid, it aims to meet residents' energy needs, provide backup power during emergencies, and introduce a profitable business model for property owners. To address PV power generation uncertainty, a distributionally robust chance-constrained optimization method using the Wasserstein metric is employed, ensuring robust and reliable operation. The techno-economic analysis reveals that EVs powered by PV and batteries are more cost-effective and environmentally friendly than gasoline vehicles that EV owners can save up to 100 dollars per month by saving on fuel costs. The results also show that integrating PV and battery systems reduces operational costs, lowers emissions, increases resilience, and supports EV adoption while offering a profitable business model for property owners. These findings highlight a practical and sustainable framework for advancing clean energy use in residential complexes.
In medical imaging, efficient segmentation of colon polyps plays a pivotal role in minimally invasive solutions for colorectal cancer. This study introduces a novel approach employing two parallel encoder branches within a network for polyp segmentation. One branch of the encoder incorporates the dual convolution blocks that have the capability to maintain feature information over increased depths, and the other block embraces the single convolution block with the addition of the previous layer's feature, offering diversity in feature extraction within the encoder, combining them before transpose layers with a depth-wise concatenation operation. Our model demonstrated superior performance, surpassing several established deep-learning architectures on the Kvasir and CVC-ClinicDB datasets, achieved a Dice score of 0.919, a mIoU of 0.866 for the Kvasir dataset, and a Dice score of 0.931 and a mIoU of 0.891 for the CVC-ClinicDB. The visual and quantitative results highlight the efficacy of our model, potentially setting a new model in medical image segmentation.
In this paper, a novel continuous-aperture array (CAPA)-based wireless communication architecture is proposed, which relies on an electrically large aperture with a continuous current distribution. First, an existing prototype of CAPA is reviewed, followed by the potential benefits and key motivations for employing CAPAs in wireless communications. Then, three practical hardware implementation approaches for CAPAs are introduced based on electronic, optical, and acoustic materials. Furthermore, several beamforming approaches are proposed to optimize the continuous current distributions of CAPAs, which are fundamentally different from those used for conventional spatially discrete arrays (SPDAs). Numerical results are provided to demonstrate their key features in low complexity and near-optimality. Based on these proposed approaches, the performance gains of CAPAs over SPDAs are revealed in terms of channel capacity as well as diversity-multiplexing gains. Finally, several open research problems in CAPA are highlighted.
This paper investigates the problem of controlling nonlinear dynamical systems subject to state and input constraints while minimizing time-varying and a priori unknown cost functions. We propose a modular approach that combines the online convex optimization framework and reference governors to solve this problem. Our method is general in the sense that we do not limit our analysis to a specific choice of online convex optimization algorithm or reference governor. We show that the dynamic regret of the proposed framework is bounded linearly in both the dynamic regret and the path length of the chosen online convex optimization algorithm, even though the online convex optimization algorithm does not account for the underlying dynamics. We prove that a linear bound with respect to the online convex optimization algorithm's dynamic regret is optimal, i.e., cannot be improved upon. Furthermore, for a standard class of online convex optimization algorithms, our proposed framework attains a bound on its dynamic regret that is linear only in the variation of the cost functions, which is known to be an optimal bound. Finally, we demonstrate implementation and flexibility of the proposed framework by comparing different combinations of online convex optimization algorithms and reference governors to control a nonlinear chemical reactor in a numerical experiment.
Non-orthogonal multiple access (NOMA) is widely viewed as a potential candidate for providing enhanced multiple access in future mobile networks by eliminating the orthogonal distribution of radio resources amongst the users. Nevertheless, the performance of NOMA can be significantly improved by combining it with other sophisticated technologies such as wireless data caching and device-to-device (D2D) communications. In this letter, we propose a novel cellular system model which integrates uplink NOMA with cache based device-to-device (D2D) communications. The proposed system would enable a cellular user to upload data file to base station while simultaneously exchanging useful cache content with another nearby user. We maximize the system sum rate by deriving closed form solutions for optimal power allocation. Simulation results demonstrate the superior performance of our proposed model over other potential combinations of uplink NOMA and D2D communications.
Despite the success of the O-RAN Alliance in developing a set of interoperable interfaces, development of unique Radio Access Network (RAN) deployments remains challenging. This is especially true for military communications, where deployments are highly specialized with limited volume. The construction and maintenance of the RAN, which is a real time embedded system, is an ill-defined NP problem requiring teams of specialized system engineers, with specialized knowledge of the hardware platform. In this paper, we introduce a RAN Domain Specific Language (RDSL(TM)) to formally describe use cases, constraints, and multi-vendor hardware/software abstraction to allow automation of RAN construction. In this DSL, system requirements are declarative, and performance constraints are guaranteed by construction using an automated system solver. Using our RAN system solver platform, Gabriel(TM) we show how a system engineer can confidently modify RAN functionality without knowledge of the underlying hardware. We show benefits for specific system requirements when compared to the manually optimized, default configuration of the Intel FlexRAN(TM), and conclude that DSL/automation driven construction of the RAN can lead to significant power and latency benefits when the deployment constraints are tuned for a specific case. We give examples of how constraints and requirements can be formatted in a "Kubernetes style" YAML format which allows the use of other tools, such as Ansible, to integrate the generation of these requirements into higher level automation flows such as Service Management and Orchestration (SMO).
Hybrid battery thermal management systems (HBTMS) combining active liquid cooling and passive phase change materials (PCM) cooling have shown a potential for the thermal management of lithium-ion batteries. However, the fill volume of coolant and PCM in hybrid cooling systems is limited by the size and weight of the HBTMS at high charge/discharge rates. These limitations result in reduced convective heat transfer from the coolant during discharge. The liquefaction rate of PCM is accelerated and the passive cooling effect is reduced. In this paper, we propose a compact hybrid cooling system with multi-inlet U-shaped microchannels for which the gap between channels is embedded by PCM/aluminum foam for compactness. Nanofluid cooling (NC) technology with better thermal conductivity is used. A pulsed flow function is further developed for enhanced cooling (EC) with reduced power consumption. An experimentally validated thermal-fluid dynamics model is developed to optimize operating conditions including coolant type, cooling direction, channel height, inlet flow rate, and cooling scheme. The results show that the hybrid cooling solution of NC+PCM+EC adopted by HBTMS further reduces the maximum temperature of the Li-ion battery by 3.44{\deg}C under a discharge rate of 1C at room temperature of 25{\deg}C with only a 5% increase in power consumption, compared to the conventional liquid cooling method for electric vehicles (EV). The average number of battery charges has increased by about 6 to 15 percent. The results of this study can help improve the range as well as driving safety of new energy EV.
Platforms that run artificial intelligence (AI) pipelines on edge computing resources are transforming the fields of animal ecology and biodiversity, enabling novel wildlife studies in animals' natural habitats. With emerging remote sensing hardware, e.g., camera traps and drones, and sophisticated AI models in situ, edge computing will be more significant in future AI-driven animal ecology (ADAE) studies. However, the study's objectives, the species of interest, its behaviors, range, habitat, and camera placement affect the demand for edge resources at runtime. If edge resources are under-provisioned, studies can miss opportunities to adapt the settings of camera traps and drones to improve the quality and relevance of captured data. This paper presents salient features of ADAE studies that can be used to model latency, throughput objectives, and provision edge resources. Drawing from studies that span over fifty animal species, four geographic locations, and multiple remote sensing methods, we characterized common patterns in ADAE studies, revealing increasingly complex workflows involving various computer vision tasks with strict service level objectives (SLO). ADAE workflow demands will soon exceed individual edge devices' compute and memory resources, requiring multiple networked edge devices to meet performance demands. We developed a framework to scale traces from prior studies and replay them offline on representative edge platforms, allowing us to capture throughput and latency data across edge configurations. We used the data to calibrate queuing and machine learning models that predict performance on unseen edge configurations, achieving errors as low as 19%.
Dynamic metasurface antennas (DMAs), surfaces patterned with reconfigurable metamaterial elements (meta-atoms) that couple waves from waveguides or cavities to free space, are a promising technology to realize 6G wireless base stations and access points with low cost and power consumption. Mutual coupling between the DMA's meta-atoms results in a non-linear dependence of the radiation pattern on the DMA configuration, significantly complicating modeling and optimization. Therefore, mutual coupling has to date been considered a vexing nuance that is frequently neglected in theoretical studies and deliberately mitigated in experimental prototypes. Here, we demonstrate the overlooked property of mutual coupling to boost the control over the DMA's radiation pattern. Based on a physics-compliant DMA model, we demonstrate that the radiation pattern's sensitivity to the DMA configuration significantly depends on the mutual coupling strength. We further evidence how the enhanced sensitivity under strong mutual coupling translates into a higher fidelity in radiation pattern synthesis, benefiting applications ranging from dynamic beamforming to end-to-end optimized sensing and imaging. Our insights suggest that DMA design should be fundamentally rethought to embrace the benefits of mutual coupling. We also discuss ensuing future research directions related to the frugal characterization of DMAs based on compact physics-compliant models.
Extremely large-scale multiple-input multiple-output (XL-MIMO) is gaining attention as a prominent technology for enabling the sixth-generation (6G) wireless networks. However, the vast antenna array and the huge bandwidth introduce a non-negligible beam squint effect, causing beams of different frequencies to focus at different locations. One approach to cope with this is to employ true-time-delay lines (TTDs)-based beamforming to control the range and trajectory of near-field beam squint, known as the near-field controllable beam squint (CBS) effect. In this paper, we investigate the user localization in near-field wideband XL-MIMO systems under the beam squint effect and spatial non-stationary properties. Firstly, we derive the expressions for Cram\'er-Rao Bounds (CRBs) for characterizing the performance of estimating both angle and distance. This analysis aims to assess the potential of leveraging CBS for precise user localization. Secondly, a user localization scheme combining CBS and beam training is proposed. Specifically, we organize multiple subcarriers into groups, directing beams from different groups to distinct angles or distances through the CBS to obtain the estimates of users' angles and distances. Furthermore, we design a user localization scheme based on a convolutional neural network model, namely ConvNeXt. This scheme utilizes the inputs and outputs of the CBS-based scheme to generate high-precision estimates of angle and distance. More importantly, our proposed ConvNeXt-based user localization scheme achieves centimeter-level accuracy in localization estimates.
This study focuses on building effective spoofing countermeasures (CMs) for non-native speech, specifically targeting Indonesian and Thai speakers. We constructed a dataset comprising both native and non-native speech to facilitate our research. Three key features (MFCC, LFCC, and CQCC) were extracted from the speech data, and three classic machine learning-based classifiers (CatBoost, XGBoost, and GMM) were employed to develop robust spoofing detection systems using the native and combined (native and non-native) speech data. This resulted in two types of CMs: Native and Combined. The performance of these CMs was evaluated on both native and non-native speech datasets. Our findings reveal significant challenges faced by Native CM in handling non-native speech, highlighting the necessity for domain-specific solutions. The proposed method shows improved detection capabilities, demonstrating the importance of incorporating non-native speech data into the training process. This work lays the foundation for more effective spoofing detection systems in diverse linguistic contexts.
In recent years, frequent extreme events have put forward higher requirements for improving the resilience of distribution networks (DNs). Introducing energy storage integrated with soft open point (E-SOP) is one of the effective ways to improve resilience. However, the widespread application of E-SOP is limited by its high investment cost. Based on this, we propose a cost allocation framework and optimal planning method of E-SOP in resilient DN. Firstly, a cost allocation mechanism for E-SOP based on resilience insurance service is designed; the probability of power users purchasing resilience insurance service is determined based on the expected utility theory. Then, a four-layer stochastic distributionally robust optimization (SDRO) model is developed for E-SOP planning and insurance pricing strategy, where the uncertainty in the intensity of contingent extreme events is addressed by a stochastic optimization approach, while the uncertainty in the occurrence of outages and resilience insurance purchases resulting from a specific extreme event is addressed via a distributionally robust optimization approach. Finally, The effectiveness of the proposed model is verified on the modified IEEE 33-bus DN.
This paper presents a machine-learning study for solar inverter power regulation in a remote microgrid. Machine learning models for active and reactive power control are respectively trained using an ensemble learning method. Then, unlike conventional schemes that make inferences on a central server in the far-end control center, the proposed scheme deploys the trained models on an embedded edge-computing device near the inverter to reduce the communication delay. Experiments on a real embedded device achieve matched results as on the desktop PC, with about 0.1ms time cost for each inference input.
Common imaging techniques for detecting structural defects typically require sampling at more than twice the spatial frequency to achieve a target resolution. This study introduces a novel framework for imaging structural defects using significantly fewer samples. In this framework, defects are modeled as regions where physical properties shift from their nominal values to resemble those of air, and a linear approximation is formulated to relate these binary shifts in physical properties with corresponding changes in measurements. Recovering a binary vector from linear measurements is generally an NP-hard problem. To address this challenge, this study proposes two algorithmic approaches. The first approach relaxes the binary constraint, using convex optimization to find a solution. The second approach incorporates a binary-inducing prior and employs approximate Bayesian inference to estimate the posterior probability of the binary vector given the measurements. Both algorithmic approaches demonstrate better performance compared to existing compressive sampling methods for binary vector recovery. The framework's effectiveness is illustrated through examples of eddy current sensing to image defects in metal structures.
In this paper, an edge computing-based machine-learning study is conducted for solar inverter power forecasting and droop control in a remote microgrid. The machine learning models and control algorithms are directly deployed on an edge-computing device (a smart meter-concentrator) in the microgrid rather than on a cloud server at the far-end control center, reducing the communication time the inverters need to wait. Experimental results on an ARM-based smart meter board demonstrate the feasibility and correctness of the proposed approach by comparing against the results on the desktop PC.
Nonlinear optimal control is vital for numerous applications but remains challenging for unknown systems due to the difficulties in accurately modelling dynamics and handling computational demands, particularly in high-dimensional settings. This work develops a theoretically certifiable framework that integrates a modified Koopman operator approach with model-based reinforcement learning to address these challenges. By relaxing the requirements on observable functions, our method incorporates nonlinear terms involving both states and control inputs, significantly enhancing system identification accuracy. Moreover, by leveraging the power of neural networks to solve partial differential equations (PDEs), our approach is able to achieving stabilizing control for high-dimensional dynamical systems, up to 9-dimensional. The learned value function and control laws are proven to converge to those of the true system at each iteration. Additionally, the accumulated cost of the learned control closely approximates that of the true system, with errors ranging from $10^{-5}$ to $10^{-3}$.
Compared to traditional electrodynamic loudspeakers, the parametric array loudspeaker (PAL) offers exceptional directivity for audio applications but suffers from significant nonlinear distortions due to its inherent intricate demodulation process. The Volterra filter-based approaches have been widely used to reduce these distortions, but the effectiveness is limited by its inverse filter's capability. Specifically, its pth-order inverse filter can only compensate for nonlinearities up to the pth order, while the higher-order nonlinearities it introduces continue to generate lower-order harmonics. In contrast, this paper introduces the modern deep learning methods for the first time to address nonlinear identification and compensation for PAL systems. Specifically, a feedforward variant of the WaveNet neural network, recognized for its success in audio nonlinear system modeling, is utilized to identify and compensate for distortions in a double sideband amplitude modulation-based PAL system. Experimental measurements from 250 Hz to 8 kHz demonstrate that our proposed approach significantly reduces both total harmonic distortion and intermodulation distortion of audio sound generated by PALs, achieving average reductions to 4.55% and 2.47%, respectively. This performance is notably superior to results obtained using the current state-of-the-art Volterra filter-based methods. Our work opens new possibilities for improving the sound reproduction performance of PALs.
Linear Quadratic Regulator (LQR) is often combined with feedback linearization (FBL) for nonlinear systems that have the nonlinearity additive to the input. Conventional approaches estimate and cancel the nonlinearity based on the first principle or data-driven methods such as Gaussian Processes (GPs). However, the former needs an elaborate modeling process, and the latter provides a fixed learned model, which may be suffering when the model dynamics are changing. In this letter, we take a Deep Neural Network (DNN) using a real-time-updated dataset to approximate the unknown nonlinearity while the controller is running. Spectrally normalizing the weights in each time-step, we stably incorporate the DNN prediction to an LQR controller and compensate for the nonlinear term. Leveraging the property of the bounded Lipschitz constant of the DNN, we provide theoretical analysis and locally exponential stability of the proposed controller. Simulation results show that our controller significantly outperforms Baseline controllers in trajectory tracking cases.
Integrating speech into LLM (speech-LLM) has gaining increased attention recently. The mainstream solution is to connect a well-trained speech encoder and LLM with a neural adapter. However, the length mismatch between the speech and text sequences are not well handled, leading to imperfect modality matching between the speech and text. In this work, we propose a novel neural adapter, AlignFormer, to reduce the length gap between the two modalities. AlignFormer consists of CTC and dynamic-window QFormer layers, where the CTC alignment provides the dynamic window information for qformer layers. The LLM backbone is frozen in training to preserve its text capability, especially the instruction following capability. When training with only the ASR data, the proposed AlignFormer unlocks the instruction following capability for speech-LLM and the model can perform zero-shot speech translation (ST) and speech question answering (SQA) tasks. In fact, speech-LLM with AlignFormer can theoretically perform any tasks that the LLM backbone can deal with in the speech version. To evaluate the effectiveness of the instruction-following speech-LLM, we propose to use instruction following rate (IFR) and offer a systematic perspective for the IFR evaluation. In addition, we find that the audio position in training would affect the instruction following capability of speech-LLM and conduct an in-depth study on it. Our findings show that audio-first training achieves higher IFR than instruction-first training. The AlignFormer can achieve a near 100% IFR with audio-first training and game-changing improvements from zero to non-zero IFR on some evaluation data with instruction-first training. We believe that this study is a big step towards the perfect speech and text modality matching in the LLM embedding space.
Modern wireless communication systems necessitate the development of cost-effective resource allocation strategies, while ensuring maximal system performance. While commonly realizable via efficient waterfilling schemes, ergodic-optimal policies often exhibit instantaneous resource constraint fluctuations as a result of fading variability, violating prescribed specifications possibly within unacceptable margins, inducing further operational challenges and/or costs. On the other extent, short-term-optimal policies -- commonly based on deterministic waterfilling-- while strictly maintaining operational specifications, are not only impractical and computationally demanding, but also suboptimal in a long-term sense. To address these challenges, we introduce a novel distributionally robust version of a classical point-to-point interference-free multi-terminal constrained stochastic resource allocation problem, by leveraging the Conditional Value-at-Risk (CVaR) as a coherent measure of power policy fluctuation risk. We derive closed-form dual-parameterized expressions for the CVaR-optimal resource policy, along with corresponding optimal CVaR quantile levels by capitalizing on (sampling) the underlying fading distribution. We subsequently develop two dual-domain schemes -- one model-based and one model-free -- to iteratively determine a globally-optimal resource policy. Our numerical simulations confirm the remarkable effectiveness of the proposed approach, also revealing an almost-constant character of the CVaR-optimal policy and at rather minimal ergodic rate optimality loss.
Recent speaker verification (SV) systems have shown a trend toward adopting deeper speaker embedding extractors. Although deeper and larger neural networks can significantly improve performance, their substantial memory requirements hinder training on consumer GPUs. In this paper, we explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios. Firstly, we conduct a systematic analysis of GPU memory allocation during SV system training. Empirical observations show that activations and optimizer states are the main sources of memory consumption. For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations during back-propagation, thereby significantly reducing memory usage without performance loss. For optimizer states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type. Experimental results on VoxCeleb demonstrate that the reversible variants of ResNets and DF-ResNets can perform training without the need to cache activations in GPU memory. In addition, the 8-bit versions of SGD and Adam save 75% of memory costs while maintaining performance compared to their 32-bit counterparts. Finally, a detailed comparison of memory usage and performance indicates that our proposed models achieve up to 16.2x memory savings, with nearly identical parameters and performance compared to the vanilla systems. In contrast to the previous need for multiple high-end GPUs such as the A100, we can effectively train deep speaker embedding extractors with just one or two consumer-level 2080Ti GPUs.
Given the spatial heterogeneity of land use patterns in most cities, large-scale UAM will likely be deployed in specific areas, e.g., inter-transfer traffic between suburbs and city centers. However, large-scale UAM operations connecting multiple origin-destination pairs raise concerns about air traffic safety and efficiency with respect to conflict movements, particularly at large conflict points similar to roadway junctions. In this work, we propose an operational framework that integrates route guidance and collision avoidance to achieve an elegant trade-off between air traffic safety and efficiency. The route guidance mechanism aims to optimize aircraft distribution across both spatial and temporal dimensions by regulating their paths (composed of waypoints). Given the optimized paths, the collision avoidance module aims to generate collision-free aircraft trajectories between waypoints in 3D space. To enable large-scale operations, we develop a fast approximation method to solve the optimal path planning problem and employ the velocity obstacle model for collision avoidance. The proposed route guidance strategy significantly reduces the computational requirements for collision avoidance. As far as we know, this work is one of the first to combine route guidance and collision avoidance for UAM. The results indicate that the framework can enable efficient and flexible UAM operations, such as air traffic assignment, congestion prevention, and dynamic airspace clearance. Compared to the management scheme based on air corridors, the proposed framework has considerable improvements in computational efficiency (433%), average travel speed (70.2%), and trip completion rate (130%). The proposed framework has demonstrated great potential for real-time traffic simulation and management in large-scale UAM systems.
In this paper, we propose a novel multi-functional reconfigurable intelligent surface (MF-RIS) that supports signal reflection, refraction, amplification, and target sensing simultaneously. Our MF-RIS aims to enhance integrated communication and sensing (ISAC) systems, particularly in multi-user and multi-target scenarios. Equipped with reflection and refraction components (i.e., amplifiers and phase shifters), MF-RIS is able to adjust the amplitude and phase shift of both communication and sensing signals on demand. Additionally, with the assistance of sensing elements, MF-RIS is capable of capturing the echo signals from multiple targets, thereby mitigating the signal attenuation typically associated with multi-hop links. We propose a MF-RIS-enabled multi-user and multi-target ISAC system, and formulate an optimization problem to maximize the signal-to-interference-plus-noise ratio (SINR) of sensing targets. This problem involves jointly optimizing the transmit beamforming and MF-RIS configurations, subject to constraints on the communication rate, total power budget, and MF-RIS coefficients. We decompose the formulated non-convex problem into three sub-problems, and then solve them via an efficient iterative algorithm. Simulation results demonstrate that: 1) The performance of MF-RIS varies under different operating protocols, and energy splitting (ES) exhibits the best performance in the considered MF-RIS-enabled multi-user multi-target ISAC system; 2) Under the same total power budget, the proposed MF-RIS with ES protocol attains 52.2%, 73.5% and 60.86% sensing SINR gains over active RIS, passive RIS, and simultaneously transmitting and reflecting RIS (STAR-RIS), respectively; 3) The number of sensing elements will no longer improve sensing performance after exceeding a certain number.
Accurate embryo morphology assessment is essential in assisted reproductive technology for selecting the most viable embryo. Artificial intelligence has the potential to enhance this process. However, the limited availability of embryo data presents challenges for training deep learning models. To address this, we trained two generative models using two datasets, one we created and made publicly available, and one existing public dataset, to generate synthetic embryo images at various cell stages, including 2-cell, 4-cell, 8-cell, morula, and blastocyst. These were combined with real images to train classification models for embryo cell stage prediction. Our results demonstrate that incorporating synthetic images alongside real data improved classification performance, with the model achieving 97% accuracy compared to 95% when trained solely on real data. Notably, even when trained exclusively on synthetic data and tested on real data, the model achieved a high accuracy of 94%. Furthermore, combining synthetic data from both generative models yielded better classification results than using data from a single generative model. Four embryologists evaluated the fidelity of the synthetic images through a Turing test, during which they annotated inaccuracies and offered feedback. The analysis showed the diffusion model outperformed the generative adversarial network model, deceiving embryologists 66.6% versus 25.3% and achieving lower Frechet inception distance scores.
This paper introduces a novel approach to quantify the uncertainties in fault diagnosis of motor drives using Bayesian neural networks (BNN). Conventional data-driven approaches used for fault diagnosis often rely on point-estimate neural networks, which merely provide deterministic outputs and fail to capture the uncertainty associated with the inference process. In contrast, BNNs offer a principled framework to model uncertainty by treating network weights as probability distributions rather than fixed values. It offers several advantages: (a) improved robustness to noisy data, (b) enhanced interpretability of model predictions, and (c) the ability to quantify uncertainty in the decision-making processes. To test the robustness of the proposed BNN, it has been tested under a conservative dataset of gear fault data from an experimental prototype of three fault types at first, and is then incrementally trained on new fault classes and datasets to explore its uncertainty quantification features and model interpretability under noisy data and unseen fault scenarios.
This paper presents a novel two-stage method for constructing channel knowledge maps (CKMs) specifically for A2G (Aerial-to-Ground) channels in the presence of non-cooperative interfering nodes (INs). We first estimate the interfering signal strength (ISS) at sampling locations based on total received signal strength measurements and the desired communication signal strength (DSS) map constructed with environmental topology. Next, an ISS map construction network (IMNet) is proposed, where a negative value correction module is included to enable precise reconstruction. Subsequently, we further execute signal-to-interference-plus-noise ratio map construction and IN localization. Simulation results demonstrate lower construction error of the proposed IMNet compared to baselines in the presence of interference.
As large-scale distributed energy resources are integrated into the active distribution networks (ADNs), effective energy management in ADNs becomes increasingly prominent compared to traditional distribution networks. Although advanced reinforcement learning (RL) methods, which alleviate the burden of complicated modelling and optimization, have greatly improved the efficiency of energy management in ADNs, safety becomes a critical concern for RL applications in real-world problems. Since the design and adjustment of penalty functions, which correspond to operational safety constraints, requires extensive domain knowledge in RL and power system operation, the emerging ADN operators call for a more flexible and customized approach to address the penalty functions so that the operational safety and efficiency can be further enhanced. Empowered with strong comprehension, reasoning, and in-context learning capabilities, large language models (LLMs) provide a promising way to assist safe RL for energy management in ADNs. In this paper, we introduce the LLM to comprehend operational safety requirements in ADNs and generate corresponding penalty functions. In addition, we propose an RL2 mechanism to refine the generated functions iteratively and adaptively through multi-round dialogues, in which the LLM agent adjusts the functions' pattern and parameters based on training and test performance of the downstream RL agent. The proposed method significantly reduces the intervention of the ADN operators. Comprehensive test results demonstrate the effectiveness of the proposed method.
A deployed fiber with in-house and underground sections is interrogated with a coherent correlation OTDR. The origin and propagation speed of a hammer-generated pressure wave in the underground section is detected and acoustic signals are monitored.
Distributed fiber sensing based on correlation-aided phase-sensitive optical time domain reflectometry is presented. The focus is on correlation as an enabler for high spatial resolution. Results from different applications are presented.
Safety filters ensure that control actions that are executed are always safe, no matter the controller in question. Previous work has proposed a simple and stealthy false-data injection attack for deactivating such safety filters. This attack injects false sensor measurements to bias state estimates toward the interior of a safety region, making the safety filter accept unsafe control actions. The attack does, however, require the adversary to know the dynamics of the system, the safety region used in the safety filter, and the observer gain. In this work we relax these requirements and show how a similar data-injection attack can be performed when the adversary only observes the input and output of the observer that is used by the safety filter, without any a priori knowledge about the system dynamics, safety region, or observer gain. In particular, the adversary uses the observed data to identify a state-space model that describes the observer dynamics, and then approximates a safety region in the identified embedding. We exemplify the data-driven attack on an inverted pendulum, where we show how the attack can make the system leave a safe set, even when a safety filter is supposed to stop this from happening.
This paper proposes to use similarities of audio captions for estimating audio-caption relevances to be used for training text-based audio retrieval systems. Current audio-caption datasets (e.g., Clotho) contain audio samples paired with annotated captions, but lack relevance information about audio samples and captions beyond the annotated ones. Besides, mainstream approaches (e.g., CLAP) usually treat the annotated pairs as positives and consider all other audio-caption combinations as negatives, assuming a binary relevance between audio samples and captions. To infer the relevance between audio samples and arbitrary captions, we propose a method that computes non-binary audio-caption relevance scores based on the textual similarities of audio captions. We measure textual similarities of audio captions by calculating the cosine similarity of their Sentence-BERT embeddings and then transform these similarities into audio-caption relevance scores using a logistic function, thereby linking audio samples through their annotated captions to all other captions in the dataset. To integrate the computed relevances into training, we employ a listwise ranking objective, where relevance scores are converted into probabilities of ranking audio samples for a given textual query. We show the effectiveness of the proposed method by demonstrating improvements in text-based audio retrieval compared to methods that use binary audio-caption relevances for training.
In a recent paper, we presented the KU Leuven audiovisual, gaze-controlled auditory attention decoding (AV-GC-AAD) dataset, in which we recorded electroencephalography (EEG) signals of participants attending to one out of two competing speakers under various audiovisual conditions. The main goal of this dataset was to disentangle the direction of gaze from the direction of auditory attention, in order to reveal gaze-related shortcuts in existing spatial AAD algorithms that aim to decode the (direction of) auditory attention directly from the EEG. Various methods based on spatial AAD do not achieve significant above-chance performances on our AV-GC-AAD dataset, indicating that previously reported results were mainly driven by eye gaze confounds in existing datasets. Still, these adverse outcomes are often discarded for reasons that are attributed to the limitations of the AV-GC-AAD dataset, such as the limited amount of data to train a working model, too much data heterogeneity due to different audiovisual conditions, or participants allegedly being unable to focus their auditory attention under the complex instructions. In this paper, we present the results of the linear stimulus reconstruction AAD algorithm and show that high AAD accuracy can be obtained within each individual condition and that the model generalizes across conditions, across new subjects, and even across datasets. Therefore, we eliminate any doubts that the inadequacy of the AV-GC-AAD dataset is the primary reason for the (spatial) AAD algorithms failing to achieve above-chance performance when compared to other datasets. Furthermore, this report provides a simple baseline evaluation procedure (including source code) that can serve as the minimal benchmark for all future AAD algorithms evaluated on this dataset.
This paper focuses on identification of the state noise density of a linear time-varying system described by the state-space model with the known measurement noise density. For this purpose, a novel method extending the capabilities of the measurement difference method (MDM) is proposed. The proposed method is based on the enhanced MDM residue calculation being a sum of the state and measurement noise, and on the construction of the residue sample kernel density. The state noise density is then estimated by the density deconvolution algorithm utilising the Fourier transform. The developed method is supplemented with automatic selection of the deconvolution user-defined parameters based on the proposed method of the noise moment equality. The state noise density estimation performance is evaluated in numerical examples and supplemented with the MALAB example implementation.
Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models to reveal shared interpretable concepts. These concepts are passed to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.
Delay-Doppler (DD) signal processing has emerged as a powerful tool for analyzing multipath and time-varying channel effects. Due to the inherent sparsity of the wireless channel in the DD domain, compressed sensing (CS) based techniques, such as orthogonal matching pursuit (OMP), are commonly used for channel estimation. However, many of these methods assume integer Doppler shifts, which can lead to performance degradation in the presence of fractional Doppler. In this paper, we propose a windowed dictionary design technique while we develop a delay-aware orthogonal matching pursuit (DA-OMP) algorithm that mitigates the impact of fractional Doppler shifts on DD domain channel estimation. First, we apply receiver windowing to reduce the correlation between the columns of our proposed dictionary matrix. Second, we introduce a delay-aware interference block to quantify the interference caused by fractional Doppler. This approach removes the need for a pre-determined stopping criterion, which is typically based on the number of propagation paths, in conventional OMP algorithm. Our simulation results confirm the effective performance of our proposed DA-OMP algorithm using the proposed windowed dictionary in terms of normalized mean square error (NMSE) of the channel estimate. In particular, our proposed DA-OMP algorithm demonstrates substantial gains compared to standard OMP algorithm in terms of channel estimation NMSE with and without windowed dictionary.
Scoliosis is traditionally assessed based solely on 2D lateral deviations, but recent studies have also revealed the importance of other imaging planes in understanding the deformation of the spine. Consequently, extracting the spinal geometry in 3D would help quantify these spinal deformations and aid diagnosis. In this study, we propose an automated general framework to estimate the 3D spine shape from 2D DXA scans. We achieve this by explicitly predicting the sagittal view of the spine from the DXA scan. Using these two orthogonal projections of the spine (coronal in DXA, and sagittal from the prediction), we are able to describe the 3D shape of the spine. The prediction is learnt from over 30k paired images of DXA and MRI scans. We assess the performance of the method on a held out test set, and achieve high accuracy.
Deep learning models are widely used to process Computed Tomography (CT) data in the automated screening of pulmonary diseases, significantly reducing the workload of physicians. However, the three-dimensional nature of CT volumes involves an excessive number of voxels, which significantly increases the complexity of model processing. Previous screening approaches often overlook this issue, which undoubtedly reduces screening efficiency. Towards efficient and effective screening, we design a hierarchical approach to reduce the computational cost of pulmonary disease screening. The new approach re-organizes the screening workflows into three steps. First, we propose a Computed Tomography Volume Compression (CTVC) method to select a small slice subset that comprehensively represents the whole CT volume. Second, the selected CT slices are used to detect pulmonary diseases coarsely via a lightweight classification model. Third, an uncertainty measurement strategy is applied to identify samples with low diagnostic confidence, which are re-detected by radiologists. Experiments on two public pulmonary disease datasets demonstrate that our approach achieves comparable accuracy and recall while reducing the time by 50%-70% compared with the counterparts using full CT volumes. Besides, we also found that our approach outperforms previous cutting-edge CTVC methods in retaining important indications after compression.
Developing methods to process irregularly structured data is crucial in applications like gene-regulatory, brain, power, and socioeconomic networks. Graphs have been the go-to algebraic tool for modeling the structure via nodes and edges capturing their interactions, leading to the establishment of the fields of graph signal processing (GSP) and graph machine learning (GML). Key graph-aware methods include Fourier transform, filtering, sampling, as well as topology identification and spatiotemporal processing. Although versatile, graphs can model only pairwise dependencies in the data. To this end, topological structures such as simplicial and cell complexes have emerged as algebraic representations for more intricate structure modeling in data-driven systems, fueling the rapid development of novel topological-based processing and learning methods. This paper first presents the core principles of topological signal processing through the Hodge theory, a framework instrumental in propelling the field forward thanks to principled connections with GSP-GML. It then outlines advances in topological signal representation, filtering, and sampling, as well as inferring topological structures from data, processing spatiotemporal topological signals, and connections with topological machine learning. The impact of topological signal processing and learning is finally highlighted in applications dealing with flow data over networks, geometric processing, statistical ranking, biology, and semantic communication.
An analogue of the describing function method is developed using square waves rather than sinusoids. Static nonlinearities map square waves to square waves, and their behavior is characterized by their response to square waves of varying amplitude - their amplitude response. The output of an LTI system to a square wave input is approximated by a square wave, to give an analogue of the describing function. The classical describing function method for predicting oscillations in feedback interconnections is generalized to this square wave setting, and gives accurate predictions when oscillations are approximately square.
Recent studies showed that network slices (NSs), which are logical networks supported by shared physical networks, can experience service interference due to sharing of physical and virtual resources. Thus, from the perspective of providing end-to-end (E2E) service quality assurance in 5G/6G systems, it is crucial to discover possible service interference among the NSs in a timely manner and isolate the potential issues before they can lead to violations of service quality agreements. We study the problem of detecting service interference among NSs in 5G/6G systems, only using E2E key performance indicator measurements, and propose a new algorithm. Our numerical studies demonstrate that, even when the service interference among NSs is weak to moderate, provided that a reasonable number of measurements are available, the proposed algorithm can correctly identify most of shared resources that can cause service interference among the NSs that utilize the shared resources.
Private 5G networks provide enhanced security, a wide range of optimized services through network slicing, reduced latency, and support for many IoT devices in a specific area, all under the owner's full control. Higher security and privacy to protect sensitive data is the most significant advantage of private networks, in e.g., smart hospitals. For long-term sustainability and cost-effectiveness of private 5G networks, analyzing and understanding the energy consumption variation holds a greater significance in reaching toward green private network architecture for 6G. This paper addresses this research gap by providing energy profiling of network components using an experimental laboratory setup that mimics real private 5G networks under various network conditions, which is a missing aspect in the existing literature.
This paper presents a novel approach to designing millimeter-wave (mmWave) cellular communication systems, based on joint phase time array (JPTA) radio frequency (RF) frontend architecture. JPTA architecture comprises time-delay components appended to conventional phase shifters, which offer extra degrees of freedom to be exploited for designing frequency-selective analog beams. Hence, a mmWave device equipped with JPTA can receive and transmit signals in multiple directions in a single time slot per RF chain, one direction per frequency subband, which alleviates the traditional constraint of one analog beam per transceiver chain per time slot. The utilization of subband-specific analog beams offers a new opportunity in designing mmWave systems, allowing for enhanced cell capacity and reduced pilot overhead. To understand the practical feasibility of JPTA, a few challenges and system design considerations are discussed in relation to the performance and complexity of the JPTA systems. For example, frequency-selective beam gain losses are present for the subband analog beams, e.g., up to 1 dB losses for 2 subband cases, even with the state-of-the-art JPTA delay and phase optimization methods. Despite these side effects, system-level analysis reveals that the JPTA system is capable of improving cell capacity: the 5%-tile throughput by up to 65%.
The communication control delay between the inverters and the power plant controller can be caused by several factors related to the communication link between them. Under undesirable conditions, high delay values can produce oscillations in the wind power plant that can affect the rest of the power system. In this work, we present a new robust methodology for wind turbines to estimate the value of the communication control delay using PMU data. Several scenarios are considered where external faults are simulated and the performance of the algorithm is evaluated based on dynamic state estimation of the mathematical model of the wind turbine. In this paper, we have shown that the characterization of the delay can be performed offering the transmission system operator an online tool to identify the most suited communication delay for the plant controller models used in dynamic studies.
Transfer learning is an umbrella term for machine learning approaches that leverage knowledge gained from solving one problem (the source domain) to improve speed, efficiency, and data requirements in solving a different but related problem (the target domain). The performance of the transferred model in the target domain is typically measured via some notion of loss function in the target domain. This paper focuses on effectively transferring control logic from a source control system to a target control system while providing approximately similar behavioral guarantees in both domains. However, in the absence of a complete characterization of behavioral specifications, this problem cannot be captured in terms of loss functions. To overcome this challenge, we use (approximate) simulation relations to characterize observational equivalence between the behaviors of two systems. Simulation relations ensure that the outputs of both systems, equipped with their corresponding controllers, remain close to each other over time, and their closeness can be quantified {\it a priori}. By parameterizing simulation relations with neural networks, we introduce the notion of \emph{neural simulation relations}, which provides a data-driven approach to transfer any synthesized controller, regardless of the specification of interest, along with its proof of correctness. Compared with prior approaches, our method eliminates the need for a closed-loop mathematical model and specific requirements for both the source and target systems. We also introduce validity conditions that, when satisfied, guarantee the closeness of the outputs of two systems equipped with their corresponding controllers, thus eliminating the need for post-facto verification. We demonstrate the effectiveness of our approach through case studies involving a vehicle and a double inverted pendulum.
Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we also discuss some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate objective functions frequently used and discuss how audio-visual data is exploited in the optimization process, i.e., the different methodologies for representing knowledge in the audio-visual domain. In fact, we focus on how human-understandable mechanisms, i.e., structured knowledge that reflects comprehensible knowledge, can guide the learning process. Most importantly, we provide a summarization of the recent progress of Audio-Visual Correlation Learning (AVCL) and discuss the future research directions.
EEG signals have emerged as a powerful tool in affective brain-computer interfaces, playing a crucial role in emotion recognition. However, current deep transfer learning-based methods for EEG recognition face challenges due to the reliance of both source and target data in model learning, which significantly affect model performance and generalization. To overcome this limitation, we propose a novel framework (PL-DCP) and introduce the concepts of feature disentanglement and prototype inference. The dual prototyping mechanism incorporates both domain and class prototypes: domain prototypes capture individual variations across subjects, while class prototypes represent the ideal class distributions within their respective domains. Importantly, the proposed PL-DCP framework operates exclusively with source data during training, meaning that target data remains completely unseen throughout the entire process. To address label noise, we employ a pairwise learning strategy that encodes proximity relationships between sample pairs, effectively reducing the influence of mislabeled data. Experimental validation on the SEED and SEED-IV datasets demonstrates that PL-DCP, despite not utilizing target data during training, achieves performance comparable to deep transfer learning methods that require both source and target data. This highlights the potential of PL-DCP as an effective and robust approach for EEG-based emotion recognition.
Rolling bearings play a crucial role in industrial machinery, directly influencing equipment performance, durability, and safety. However, harsh operating conditions, such as high speeds and temperatures, often lead to bearing malfunctions, resulting in downtime, economic losses, and safety hazards. This paper proposes the Residual Attention Single-Head Vision Transformer Network (RA-SHViT-Net) for fault diagnosis in rolling bearings. Vibration signals are transformed from the time to frequency domain using the Fast Fourier Transform (FFT) before being processed by RA-SHViT-Net. The model employs the Single-Head Vision Transformer (SHViT) to capture local and global features, balancing computational efficiency and predictive accuracy. To enhance feature extraction, the Adaptive Hybrid Attention Block (AHAB) integrates channel and spatial attention mechanisms. The network architecture includes Depthwise Convolution, Single-Head Self-Attention, Residual Feed-Forward Networks (Res-FFN), and AHAB modules, ensuring robust feature representation and mitigating gradient vanishing issues. Evaluation on the Case Western Reserve University and Paderborn University datasets demonstrates the RA-SHViT-Net's superior accuracy and robustness in complex, noisy environments. Ablation studies further validate the contributions of individual components, establishing RA-SHViT-Net as an effective tool for early fault detection and classification, promoting efficient maintenance strategies in industrial settings. Keywords: rolling bearings, fault diagnosis, Vision Transformer, attention mechanism, noisy environments, Fast Fourier Transform (FFT)
Real-time monitoring of critical parameters is essential for energy systems' safe and efficient operation. However, traditional sensors often fail and degrade in harsh environments where physical sensors cannot be placed (inaccessible locations). In addition, there are important parameters that cannot be directly measured by sensors. We need machine learning (ML)-based real-time monitoring in those remote locations to ensure system operations. However, traditional ML models struggle to process continuous sensor profile data to fit model requirements, leading to the loss of spatial relationships. Another challenge for real-time monitoring is ``dataset shift" and the need for frequent retraining under varying conditions, where extensive retraining prohibits real-time inference. To resolve these challenges, this study addressed the limitations of real-time monitoring methods by enabling monitoring in locations where physical sensors are impractical to deploy. Our proposed approach, utilizing Multi-Input Operator Network virtual sensors, leverages deep learning to seamlessly integrate diverse data sources and accurately predict key parameters in real-time without the need for additional physical sensors. The approach's effectiveness is demonstrated through thermal-hydraulic monitoring in a nuclear reactor subchannel, achieving remarkable accuracy.
This study presents acoustic-based methods for the control of multiple autonomous underwater vehicles (AUV). This study proposes two different models for implementing boundary and path control on low-cost AUVs using acoustic communication and a single central acoustic beacon. Two methods are presented: the Range Variation-Based (RVB) model completely relies on range data obtained by acoustic modems, whereas the Heading Estimation-Based (HEB) model uses ranges and range rates to estimate the position of the central boundary beacon and perform assigned behaviors. The models are tested on two boundary control behaviors: Fencing and Milling. Fencing behavior ensures AUVs return within predefined boundaries, while Milling enables the AUVs to move cyclically on a predefined path around the beacon. Models are validated by successfully performing the boundary control behaviors in simulations, pool tests, including artificial underwater currents, and field tests conducted in the ocean. All tests were performed with fully autonomous platforms, and no external input or sensor was provided to the AUVs during validation. Quantitative and qualitative analyses are presented in the study, focusing on the effect and application of a multi-robot system.
This work tackles the fidelity objective in the perceptual super-resolution~(SR). Specifically, we address the shortcomings of pixel-level $L_\text{p}$ loss ($\mathcal{L}_\text{pix}$) in the GAN-based SR framework. Since $L_\text{pix}$ is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this work shows that these circumventions fail to address the fundamental factor that induces blurring. Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of $L_\text{pix}$ that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship. We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with $L_\text{pix}$. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss ($L_\text{AESOP}$), a novel loss function that measures distance in the AE space, instead of the raw pixel space. Note that the AE space indicates the space after the decoder, not the bottleneck. By simply substituting $L_\text{pix}$ with $L_\text{AESOP}$, we can provide effective reconstruction guidance without compromising perceptual quality. Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify that AESOP can lead to favorable results in the perceptual SR task.
Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.
Planning safe and efficient trajectories through signal-free intersections presents significant challenges for autonomous vehicles (AVs), particularly in dynamic, multi-task environments with unpredictable interactions and an increased possibility of conflicts. This study aims to address these challenges by developing a robust, adaptive framework to ensure safety in such complex scenarios. Existing approaches often struggle to provide reliable safety mechanisms in dynamic and learn multi-task behaviors from demonstrations in signal-free intersections. This study proposes a safety-critical planning method that integrates Dynamic High-Order Control Barrier Functions (DHOCBF) with a diffusion-based model, called Dynamic Safety-Critical Diffuser (DSC-Diffuser), offering a robust solution for adaptive, safe, and multi-task driving in signal-free intersections. Our approach incorporates a goal-oriented, task-guided diffusion model, enabling the model to learn multiple driving tasks simultaneously from real-world data. To further ensure driving safety in dynamic environments, the proposed DHOCBF framework dynamically adjusts to account for the movements of surrounding vehicles, offering enhanced adaptability compared to traditional control barrier functions. Validity evaluations of DHOCBF, conducted through numerical simulations, demonstrate its robustness in adapting to variations in obstacle velocities, sizes, uncertainties, and locations, effectively maintaining driving safety across a wide range of complex and uncertain scenarios. Performance evaluations across various scenes confirm that DSC-Diffuser provides realistic, stable, and generalizable policies, equipping it with the flexibility to adapt to diverse driving tasks.
Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.
The integration of hyperspectral imaging (HSI) and LiDAR data within new linear feature spaces offers a promising solution to the challenges posed by the high-dimensionality and redundancy inherent in HSIs. This study introduces a dual linear fused space framework that capitalizes on bidirectional reversed convolutional neural network (CNN) pathways, coupled with a specialized spatial analysis block. This approach combines the computational efficiency of CNNs with the adaptability of attention mechanisms, facilitating the effective fusion of spectral and spatial information. The proposed method not only enhances data processing and classification accuracy, but also mitigates the computational burden typically associated with advanced models such as Transformers. Evaluations of the Houston 2013 dataset demonstrate that our approach surpasses existing state-of-the-art models. This advancement underscores the potential of the framework in resource-constrained environments and its significant contributions to the field of remote sensing.
This study explores the field of audio classification from raw waveform using Convolutional Neural Networks (CNNs), a method that eliminates the need for extracting specialised features in the pre-processing step. Unlike recent trends in literature, which often focuses on designing frontends or filters for only the initial layers of CNNs, our research introduces the Cosine Convolutional Neural Network (CosCovNN) replacing the traditional CNN filters with Cosine filters. The CosCovNN surpasses the accuracy of the equivalent CNN architectures with approximately $77\%$ less parameters. Our research further progresses with the development of an augmented CosCovNN named Vector Quantised Cosine Convolutional Neural Network with Memory (VQCCM), incorporating a memory and vector quantisation layer VQCCM achieves state-of-the-art (SOTA) performance across five different datasets in comparison with existing literature. Our findings show that cosine filters can greatly improve the efficiency and accuracy of CNNs in raw audio classification.
A speaker verification (SV) system offers an authentication service designed to confirm whether a given speech sample originates from a specific speaker. This technology has paved the way for various personalized applications that cater to individual preferences. A noteworthy challenge faced by SV systems is their ability to perform consistently across a range of emotional spectra. Most existing models exhibit high error rates when dealing with emotional utterances compared to neutral ones. Consequently, this phenomenon often leads to missing out on speech of interest. This issue primarily stems from the limited availability of labeled emotional speech data, impeding the development of robust speaker representations that encompass diverse emotional states. To address this concern, we propose a novel approach employing the CycleGAN framework to serve as a data augmentation method. This technique synthesizes emotional speech segments for each specific speaker while preserving the unique vocal identity. Our experimental findings underscore the effectiveness of incorporating synthetic emotional data into the training process. The models trained using this augmented dataset consistently outperform the baseline models on the task of verifying speakers in emotional speech scenarios, reducing equal error rate by as much as 3.64% relative.
MusicGen is a music generation language model (LM) that can be conditioned on textual descriptions and melodic features. We introduce MusicGen-Chord, which extends this capability by incorporating chord progression features. This model modifies one-hot encoded melody chroma vectors into multi-hot encoded chord chroma vectors, enabling the generation of music that reflects both chord progressions and textual descriptions. Furthermore, we developed MusicGen-Remixer, an application utilizing MusicGen-Chord to generate remixes of input music conditioned on textual descriptions. Both models are integrated into Replicate's web-UI using cog, facilitating broad accessibility and user-friendly controllable interaction for creating and experiencing AI-generated music.
The convergence of cross-modal adversarial learning and physics-driven methods represents a cutting-edge direction for tackling challenges in complex multi-modal tasks and scientific computing. This review focuses on systematically analyzing how these two approaches can be synergistically integrated to enhance performance and robustness across diverse application domains. By addressing key obstacles such as modality discrepancies, limited data availability, and insufficient model robustness, this paper highlights the role of physics-based optimization frameworks in facilitating efficient and interpretable adversarial perturbation generation. The review also explores significant advancements in cross-modal adversarial learning, including applications in tasks such as image cross-modal retrieval (e.g., infrared and RGB matching), scientific computing (e.g., solving partial differential equations), and optimization under physical consistency constraints in vision systems. By examining theoretical foundations and experimental outcomes, this study demonstrates the potential of combining these approaches to handle complex scenarios and improve the security of multi-modal systems. Finally, we outline future directions, proposing a novel framework that unifies physical principles with adversarial optimization, providing a pathway for researchers to develop robust and adaptable cross-modal learning methods with both theoretical and practical significance.
Many problems in navigation and tracking require increasingly accurate characterizations of the evolution of uncertainty in nonlinear systems. Nonlinear uncertainty propagation approaches based on Gaussian mixture density approximations offer distinct advantages over sampling based methods in their computational cost and continuous representation. State-of-the-art Gaussian mixture approaches are adaptive in that individual Gaussian mixands are selectively split into mixtures to yield better approximations of the true propagated distribution. Despite the importance of the splitting process to accuracy and computational efficiency, relatively little work has been devoted to mixand selection and splitting direction optimization. The first part of this work presents splitting methods that preserve the mean and covariance of the original distribution. Then, we present and compare a number of novel heuristics for selecting the splitting direction. The choice of splitting direction is informed by the initial uncertainty distribution, properties of the nonlinear function through which the original distribution is propagated, and a whitening based natural scaling method to avoid dependence of the splitting direction on the scaling of coordinates. We compare these novel heuristics to existing techniques in three distinct examples involving Cartesian to polar coordinate transformation, Keplerian orbital element propagation, and uncertainty propagation in the circular restricted three-body problem.
Data augmentation is a widely adopted technique utilized to improve the robustness of automatic speech recognition (ASR). Employing a fixed data augmentation strategy for all training data is a common practice. However, it is important to note that there can be variations in factors such as background noise, speech rate, etc. among different samples within a single training batch. By using a fixed augmentation strategy, there is a risk that the model may reach a suboptimal state. In addition to the risks of employing a fixed augmentation strategy, the model's capabilities may differ across various training stages. To address these issues, this paper proposes the method of sample-adaptive data augmentation with progressive scheduling(PS-SapAug). The proposed method applies dynamic data augmentation in a two-stage training approach. It employs hybrid normalization to compute sample-specific augmentation parameters based on each sample's loss. Additionally, the probability of augmentation gradually increases throughout the training progression. Our method is evaluated on popular ASR benchmark datasets, including Aishell-1 and Librispeech-100h, achieving up to 8.13% WER reduction on LibriSpeech-100h test-clean, 6.23% on test-other, and 5.26% on AISHELL-1 test set, which demonstrate the efficacy of our approach enhancing performance and minimizing errors.
Optimizing smart grid operations relies on critical decision-making informed by uncertainty quantification, making probabilistic forecasting a vital tool. Designing such forecasting models involves three key challenges: accurate and unbiased uncertainty quantification, workload reduction for data scientists during the design process, and limitation of the environmental impact of model training. In order to address these challenges, we introduce AutoPQ, a novel method designed to automate and optimize probabilistic forecasting for smart grid applications. AutoPQ enhances forecast uncertainty quantification by generating quantile forecasts from an existing point forecast by using a conditional Invertible Neural Network (cINN). AutoPQ also automates the selection of the underlying point forecasting method and the optimization of hyperparameters, ensuring that the best model and configuration is chosen for each application. For flexible adaptation to various performance needs and available computing power, AutoPQ comes with a default and an advanced configuration, making it suitable for a wide range of smart grid applications. Additionally, AutoPQ provides transparency regarding the electricity consumption required for performance improvements. We show that AutoPQ outperforms state-of-the-art probabilistic forecasting methods while effectively limiting computational effort and hence environmental impact. Additionally and in the context of sustainability, we quantify the electricity consumption required for performance improvements.
Renewable energies and their operation are becoming increasingly vital for the stability of electrical power grids since conventional power plants are progressively being displaced, and their contribution to redispatch interventions is thereby diminishing. In order to consider renewable energies like Wind Power (WP) for such interventions as a substitute, day-ahead forecasts are necessary to communicate their availability for redispatch planning. In this context, automated and scalable forecasting models are required for the deployment to thousands of locally-distributed onshore WP turbines. Furthermore, the irregular interventions into the WP generation capabilities due to redispatch shutdowns pose challenges in the design and operation of WP forecasting models. Since state-of-the-art forecasting methods consider past WP generation values alongside day-ahead weather forecasts, redispatch shutdowns may impact the forecast. Therefore, the present paper highlights these challenges and analyzes state-of-the-art forecasting methods on data sets with both regular and irregular shutdowns. Specifically, we compare the forecasting accuracy of three autoregressive Deep Learning (DL) methods to methods based on WP curve modeling. Interestingly, the latter achieve lower forecasting errors, have fewer requirements for data cleaning during modeling and operation while being computationally more efficient, suggesting their advantages in practical applications.
Advanced driver assistance systems (ADAS) enabled by automotive radars have significantly enhanced vehicle safety and driver experience. However, the extensive use of radars in dense road conditions introduces mutual interference, which degrades detection accuracy and reliability. Traditional interference models are limited to simple highway scenarios and cannot characterize the performance of automotive radars in dense urban environments. In our prior work, we employed stochastic geometry (SG) to develop two automotive radar network models: the Poisson line Cox process (PLCP) for dense city centers and smaller urban zones and the binomial line Cox process (BLCP) to encompass both urban cores and suburban areas. In this work, we introduce the meta-distribution (MD) framework upon these two models to distinguish the sources of variability in radar detection metrics. Additionally, we optimize the radar beamwidth and transmission probability to maximize the number of successful detections of a radar node in the network. Further, we employ a computationally efficient Chebyshev-Markov (CM) bound method for reconstructing MDs, achieving higher accuracy than the conventional Gil-Pelaez theorem. Using the framework, we analyze the specific impacts of beamwidth, detection range, and interference on radar detection performance and offer practical insights for developing adaptive radar systems tailored to diverse traffic and environmental conditions.
In this paper, we present a time domain extension of our strategy on manipulating radiated scalar Helmholtz fields and discuss two important applied scenarios, namely (1) creating personal sound zones inside a bounded domain and (2) shielded localized communication. Our strategy is based on the authors' previous works establishing the possibility and stability of controlling acoustic fields using an array of almost non-radiating coupling sources and presents a detailed Fourier synthesis approach towards a time-domain effect. We require that the array of acoustic sources creates the desired fields on the control regions while maintaining a zero field beyond a larger circumscribed sphere. This paper recalls the main theoretical results then presents the underlying Fourier synthesis paradigm and show, through relevant simulations, the performance of our strategy.
Skin cancer (SC) stands out as one of the most life-threatening forms of cancer, with its danger amplified if not diagnosed and treated promptly. Early intervention is critical, as it allows for more effective treatment approaches. In recent years, Deep Learning (DL) has emerged as a powerful tool in the early detection and skin cancer diagnosis (SCD). Although the DL seems promising for the diagnosis of skin cancer, still ample scope exists for improving model efficiency and accuracy. This paper proposes a novel approach to skin cancer detection, utilizing optimization techniques in conjunction with pre-trained networks and wavelet transformations. First, normalized images will undergo pre-trained networks such as Densenet-121, Inception, Xception, and MobileNet to extract hierarchical features from input images. After feature extraction, the feature maps are passed through a Discrete Wavelet Transform (DWT) layer to capture low and high-frequency components. Then the self-attention module is integrated to learn global dependencies between features and focus on the most relevant parts of the feature maps. The number of neurons and optimization of the weight vectors are performed using three new swarm-based optimization techniques, such as Modified Gorilla Troops Optimizer (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox optimization algorithm. Evaluation results demonstrate that optimizing weight vectors using optimization algorithms can enhance diagnostic accuracy and make it a highly effective approach for SCD. The proposed method demonstrates substantial improvements in accuracy, achieving top rates of 98.11% with the MobileNet + Wavelet + FOX and DenseNet + Wavelet + Fox combination on the ISIC-2016 dataset and 97.95% with the Inception + Wavelet + MGTO combination on the ISIC-2017 dataset, which improves accuracy by at least 1% compared to other methods.
Full waveform inversion (FWI) is able to construct high-resolution subsurface models by iteratively minimizing discrepancies between observed and simulated seismic data. However, its implementation can be rather involved for complex wave equations, objective functions, or regularization. Recently, automatic differentiation (AD) has proven to be effective in simplifying solutions of various inverse problems, including FWI. In this study, we present an open-source AD-based FWI framework (ADFWI), which is designed to simplify the design, development, and evaluation of novel approaches in FWI with flexibility. The AD-based framework not only includes forword modeling and associated gradient computations for wave equations in various types of media from isotropic acoustic to vertically or horizontally transverse isotropic elastic, but also incorporates a suite of objective functions, regularization techniques, and optimization algorithms. By leveraging state-of-the-art AD, objective functions such as soft dynamic time warping and Wasserstein distance, which are difficult to apply in traditional FWI are also easily integrated into ADFWI. In addition, ADFWI is integrated with deep learning for implicit model reparameterization via neural networks, which not only introduces learned regularization but also allows rapid estimation of uncertainty through dropout. To manage high memory demands in large-scale inversion associated with AD, the proposed framework adopts strategies such as mini-batch and checkpointing. Through comprehensive evaluations, we demonstrate the novelty, practicality and robustness of ADFWI, which can be used to address challenges in FWI and as a workbench for prompt experiments and the development of new inversion strategies.
Supporting immense throughput and ubiquitous connectivity holds paramount importance for future wireless networks. To this end, this letter focuses on how the spatial beams configured for legacy near-field (NF) users can be leveraged to serve extra NF or far-field users while ensuring the rate requirements of legacy NF users. In particular, a flexible rate splitting multiple access (RSMA) scheme is proposed to efficiently manage interference, which carefully selects a subset of legacy users to decode the common stream. Beam scheduling, power allocation, common rate allocation, and user selection are jointly optimized to maximize the sum rate of additional users. To solve the formulated discrete non-convex problem, it is split into three subproblems. The accelerated bisection searching, quadratic transform, and simulated annealing approaches are developed to attack them. Simulation results reveal that the proposed transmit scheme and algorithm achieve significant gains over three competing benchmarks.
Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.
Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot's end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot's position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches -- (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.
In this paper, we propose a novel lightweight learning from demonstration (LfD) model based on reservoir computing that can learn and generate multiple movement trajectories with prediction intervals, which we call as Context-based Echo State Network with prediction confidence (CESN+). CESN+ can generate movement trajectories that may go beyond the initial LfD training based on a desired set of conditions while providing confidence on its generated output. To assess the abilities of CESN+, we first evaluate its performance against Conditional Neural Movement Primitives (CNMP), a comparable framework that uses a conditional neural process to generate movement primitives. Our findings indicate that CESN+ not only outperforms CNMP but is also faster to train and demonstrates impressive performance in generating trajectories for extrapolation cases. In human-robot shared control applications, the confidence of the machine generated trajectory is a key indicator of how to arbitrate control sharing. To show the usability of the CESN+ for human-robot adaptive shared control, we have designed a proof-of-concept human-robot shared control task and tested its efficacy in adapting the sharing weight between the human and the robot by comparing it to a fixed-weight control scheme. The simulation experiments show that with CESN+ based adaptive sharing the total human load in shared control can be significantly reduced. Overall, the developed CESN+ model is a strong lightweight LfD system with desirable properties such fast training and ability to extrapolate to the new task parameters while producing robust prediction intervals for its output.
Space debris presents a critical challenge for the sustainability of future space missions, emphasizing the need for robust and standardized identification methods. However, a comprehensive benchmark for rocket body classification remains absent. This paper addresses this gap by introducing the RoBo6 dataset for rocket body classification based on light curves. The dataset, derived from the Mini Mega Tortora database, includes light curves for six rocket body classes: CZ-3B, Atlas 5 Centaur, Falcon 9, H-2A, Ariane 5, and Delta 4. With 5,676 training and 1,404 test samples, it addresses data inconsistencies using resampling, normalization, and filtering techniques. Several machine learning models were evaluated, including CNN and transformer-based approaches, with Astroconformer reporting the best performance. The dataset establishes a common benchmark for future comparisons and advancements in rocket body classification tasks.
The decomposition of a signal is a fundamental tool in many fields of research, including signal processing, geophysics, astrophysics, engineering, medicine, and many more. By breaking down complex signals into simpler oscillatory components we can enhance the understanding and processing of the data, unveiling hidden information contained in them. Traditional methods, such as Fourier analysis and wavelet transforms, which are effective in handling mono-dimensional stationary signals struggle with non-stationary data sets and they require, this is the case of the wavelet, the selection of predefined basis functions. In contrast, the Empirical Mode Decomposition (EMD) method and its variants, such as Iterative Filtering (IF), have emerged as effective nonlinear approaches, adapting to signals without any need for a priori assumptions. To accelerate these methods, the Fast Iterative Filtering (FIF) algorithm was developed, and further extensions, such as Multivariate FIF (MvFIF) and Multidimensional FIF (FIF2), have been proposed to handle higher-dimensional data. In this work, we introduce the Multidimensional and Multivariate Fast Iterative Filtering (MdMvFIF) technique, an innovative method that extends FIF to handle data that vary simultaneously in space and time. This new algorithm is capable of extracting Intrinsic Mode Functions (IMFs) from complex signals that vary in both space and time, overcoming limitations found in prior methods. The potentiality of the proposed method is demonstrated through applications to artificial and real-life signals, highlighting its versatility and effectiveness in decomposing multidimensional and multivariate nonstationary signals. The MdMvFIF method offers a powerful tool for advanced signal analysis across many scientific and engineering disciplines.
As Artificial Intelligence (AI) technologies continue to evolve, their use in generating realistic, contextually appropriate content has expanded into various domains. Music, an art form and medium for entertainment, deeply rooted into human culture, is seeing an increased involvement of AI into its production. However, the unregulated use of AI music generation (AIGM) tools raises concerns about potential negative impacts on the music industry, copyright and artistic integrity, underscoring the importance of effective AIGM detection. This paper provides an overview of existing AIGM detection methods. To lay a foundation to the general workings and challenges of AIGM detection, we first review general principles of AIGM, including recent advancements in deepfake audios, as well as multimodal detection techniques. We further propose a potential pathway for leveraging foundation models from audio deepfake detection to AIGM detection. Additionally, we discuss implications of these tools and propose directions for future research to address ongoing challenges in the field.
We introduce Audio Atlas, an interactive web application for visualizing audio data using text-audio embeddings. Audio Atlas is designed to facilitate the exploration and analysis of audio datasets using a contrastive embedding model and a vector database for efficient data management and semantic search. The system maps audio embeddings into a two-dimensional space and leverages DeepScatter for dynamic visualization. Designed for extensibility, Audio Atlas allows easy integration of new datasets, enabling users to better understand their audio data and identify both patterns and outliers. We open-source the codebase of Audio Atlas, and provide an initial implementation containing various audio and music datasets.
In this paper, we introduce an algorithm designed to address the problem of time-optimal formation reshaping in three-dimensional environments while preventing collisions between agents. The utility of the proposed approach is particularly evident in mobile robotics, where agents benefit from being organized and navigated in formation for a variety of real-world applications requiring frequent alterations in formation shape for efficient navigation or task completion. Given the constrained operational time inherent to battery-powered mobile robots, the time needed to complete the formation reshaping process is crucial for their efficient operation, especially in case of multi-rotor Unmanned Aerial Vehicles (UAVs). The proposed Collision-Aware Time-Optimal formation Reshaping Algorithm (CAT-ORA) builds upon the Hungarian algorithm for the solution of the robot-to-goal assignment implementing the inter-agent collision avoidance through direct constraints on mutually exclusive robot-goal pairs combined with a trajectory generation approach minimizing the duration of the reshaping process. Theoretical validations confirm the optimality of CAT-ORA, with its efficacy further showcased through simulations, and a real-world outdoor experiment involving 19 UAVs. Thorough numerical analysis shows the potential of CAT-ORA to decrease the time required to perform complex formation reshaping tasks by up to 49%, and 12% on average compared to commonly used methods in randomly generated scenarios.
This letter presents a flexible rate-splitting multiple access (RSMA) framework for near-field (NF) integrated sensing and communications (ISAC). The spatial beams configured to meet the communication rate requirements of NF users are simultaneously leveraged to sense an additional NF target. A key innovation lies in its flexibility to select a subset of users for decoding the common stream, enhancing interference management and system performance. The system is designed by minimizing the Cram\'{e}r-Rao bound (CRB) for joint distance and angle estimation through optimized power allocation, common rate allocation, and user selection. This leads to a discrete, non-convex optimization problem. Remarkably, we demonstrate that the preconfigured beams are sufficient for target sensing, eliminating the need for additional probing signals. To solve the optimization problem, an iterative algorithm is proposed combining the quadratic transform and simulated annealing. Simulation results indicate that the proposed scheme significantly outperforms conventional RSMA and space division multiple access (SDMA), reducing distance and angle estimation errors by approximately 100\% and 20\%, respectively.
Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging due to the fact that the size of the joint state and action spaces are exponentially large in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm \texttt{SUBSAMPLE-MFQ} (\textbf{Subsample}-\textbf{M}ean-\textbf{F}ield-\textbf{Q}-learning) and a decentralized randomized policy for a system with $n$ agents. For $k\leq n$, our algorithm system learns a policy for the system in time polynomial in $k$. We show that this learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. We validate our method empirically on Gaussian squeeze and global exploration settings.
Remote estimation is a crucial element of real time monitoring of a stochastic process. While most of the existing works have concentrated on obtaining optimal sampling strategies, motivated by malicious attacks on cyber-physical systems, we model sensing under surveillance as a game between an attacker and a defender. This introduces strategic elements to conventional remote estimation problems. Additionally, inspired by increasing detection capabilities, we model an element of information leakage for each player. Parameterizing the game in terms of uncertainty on each side, information leakage, and cost of sampling, we consider the Stackelberg Equilibrium (SE) concept where one of the players acts as the leader and the other one as the follower. By focusing our attention on stationary probabilistic sampling policies, we characterize the SE of this game and provide simulations to show the efficacy of our results.
Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks, and their integration with speech encoder is rapidly emerging as a dominant trend in the Automatic Speech Recognition (ASR) field. Previous works mainly concentrated on leveraging LLMs for speech recognition in English and Chinese. However, their potential for addressing speech recognition challenges in low resource settings remains underexplored. Hence, in this work, we aim to explore the capability of LLMs in low resource ASR and Mandarin-English code switching ASR. We also evaluate and compare the recognition performance of LLM-based ASR systems against Whisper model. Extensive experiments demonstrate that LLM-based ASR yields a relative gain of 12.8\% over the Whisper model in low resource ASR while Whisper performs better in Mandarin-English code switching ASR. We hope that this study could shed light on ASR for low resource scenarios.
From 5G onwards, Non-Terrestrial Networks (NTNs) have emerged as a key component of future network architectures. Leveraging Low Earth Orbit (LEO) satellite constellations, NTNs are capable of building a space Internet and present a paradigm shift in delivering mobile services to even the most remote regions on Earth. However, the extensive coverage and rapid movement of LEO satellites pose unique challenges for NTN networking, including user equipment (UE) access and inter-satellite delivery, which directly impact the quality of service (QoS) and data transmission continuity. This paper offers an in-depth review of advanced NTN management technologies in the context of 6G evolution, focusing on radio resource management, mobility management, and dynamic network slicing. Building on this foundation and considering the latest trends in NTN development, we then present some innovative perspectives to emerging challenges in satellite beamforming, handover mechanisms, and inter-satellite transmissions. Lastly, we identify open research issues and propose future directions aimed at advancing satellite Internet deployment and enhancing NTN performance.
Task-oriented communication presents a promising approach to improve the communication efficiency of edge inference systems by optimizing learning-based modules to extract and transmit relevant task information. However, real-time applications face practical challenges, such as incomplete coverage and potential malfunctions of edge servers. This situation necessitates cross-model communication between different inference systems, enabling edge devices from one service provider to collaborate effectively with edge servers from another. Independent optimization of diverse edge systems often leads to incoherent feature spaces, which hinders the cross-model inference for existing task-oriented communication. To facilitate and achieve effective cross-model task-oriented communication, this study introduces a novel framework that utilizes shared anchor data across diverse systems. This approach addresses the challenge of feature alignment in both server-based and on-device scenarios. In particular, by leveraging the linear invariance of visual features, we propose efficient server-based feature alignment techniques to estimate linear transformations using encoded anchor data features. For on-device alignment, we exploit the angle-preserving nature of visual features and propose to encode relative representations with anchor data to streamline cross-model communication without additional alignment procedures during the inference. The experimental results on computer vision benchmarks demonstrate the superior performance of the proposed feature alignment approaches in cross-model task-oriented communications. The runtime and computation overhead analysis further confirm the effectiveness of the proposed feature alignment approaches in real-time applications.
During the entire training process of the ASR model, the intensity of data augmentation and the approach of calculating training loss are applied in a regulated manner based on preset parameters. For example, SpecAugment employs a predefined strength of augmentation to mask parts of the time-frequency domain spectrum. Similarly, in CTC-based multi-layer models, the loss is generally determined based on the output of the encoder's final layer during the training process. However, ignoring dynamic characteristics may suboptimally train models. To address the issue, we present a two-stage training method, known as complexity-boosted adaptive (CBA) training. It involves making dynamic adjustments to data augmentation strategies and CTC loss propagation based on the complexity of the training samples. In the first stage, we train the model with intermediate-CTC-based regularization and data augmentation without any adaptive policy. In the second stage, we propose a novel adaptive policy, called MinMax-IBF, which calculates the complexity of samples. We combine the MinMax-IBF policy to data augmentation and intermediate CTC loss regularization to continue training. The proposed CBA training approach shows considerable improvements, up to 13.4% and 14.1% relative reduction in WER on the LibriSpeech 100h test-clean and test-other dataset and also up to 6.3% relative reduction on AISHELL-1 test set, over the Conformer architecture in Wenet.
Dynamic game theory is an increasingly popular tool for modeling multi-agent, e.g. human-robot, interactions. Game-theoretic models presume that each agent wishes to minimize a private cost function that depends on others' actions. These games typically evolve over a fixed time horizon, which specifies the degree to which all agents care about the distant future. In practical settings, however, decision-makers may vary in their degree of short-sightedness. We conjecture that quantifying and estimating each agent's short-sightedness from online data will enable safer and more efficient interactions with other agents. To this end, we frame this inference problem as an inverse dynamic game. We consider a specific parametrization of each agent's objective function that smoothly interpolates myopic and farsighted planning. Games of this form are readily transformed into parametric mixed complementarity problems; we exploit the directional differentiability of solutions to these problems with respect to their hidden parameters in order to solve for agents' short-sightedness. We conduct several experiments simulating human behavior at a real-world crosswalk. The results of these experiments clearly demonstrate that by explicitly inferring agents' short-sightedness, we can recover more accurate game-theoretic models, which ultimately allow us to make better predictions of agents' behavior. Specifically, our results show up to a 30% more accurate prediction of myopic behavior compared to the baseline.
Intelligent streetlight systems divide the streetlight network into multiple sectors, activating only the streetlights in the corresponding sectors when traffic elements pass by, rather than all streetlights, effectively reducing energy waste. This strategy requires streetlights to understand their neighbor relationships to illuminate only the streetlights in their respective sectors. However, manually configuring the neighbor relationships for a large number of streetlights in complex large-scale road streetlight networks is cumbersome and prone to errors. Due to the crisscrossing nature of roads, it is also difficult to determine the neighbor relationships using GPS or communication positioning. In response to these issues, this article proposes a systematic approach to model the streetlight network as a social network and construct a neighbor relationship probabilistic graph using IoT event records of streetlights detecting traffic elements. Based on this, a multi-objective genetic algorithm based probabilistic graph clustering method is designed to discover the neighbor relationships of streetlights. Considering the characteristic that pedestrians and vehicles usually move at a constant speed on a section of a road, speed consistency is introduced as an optimization objective, which, together with traditional similarity measures, forms a multi-objective function, enhancing the accuracy of neighbor relationship discovery. Extensive experiments on simulation datasets were conducted, comparing the proposed algorithm with other probabilistic graph clustering algorithms. The results demonstrate that the proposed algorithm can more accurately identify the neighbor relationships of streetlights compared to other algorithms, effectively achieving adaptive streetlight control for traffic elements.
Neural speech codecs have gained great attention for their outstanding reconstruction with discrete token representations. It is a crucial component in generative tasks such as speech coding and large language models (LLM). However, most works based on residual vector quantization perform worse with fewer tokens due to low coding efficiency for modeling complex coupled information. In this paper, we propose a neural speech codec named FreeCodec which employs a more effective encoding framework by decomposing intrinsic properties of speech into different components: 1) a global vector is extracted as the timbre information, 2) a prosody encoder with a long stride level is used to model the prosody information, 3) the content information is from a content encoder. Using different training strategies, FreeCodec achieves state-of-the-art performance in reconstruction and disentanglement scenarios. Results from subjective and objective experiments demonstrate that our framework outperforms existing methods.
With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
This paper describes the zero-shot spontaneous style TTS system for the ISCSLP 2024 Conversational Voice Clone Challenge (CoVoC). We propose a LLaMA-based codec language model with a delay pattern to achieve spontaneous style voice cloning. To improve speech intelligibility, we introduce the Classifier-Free Guidance (CFG) strategy in the language model to strengthen conditional guidance on token prediction. To generate high-quality utterances, we adopt effective data preprocessing operations and fine-tune our model with selected high-quality spontaneous speech data. The official evaluations in the CoVoC constrained track show that our system achieves the best speech naturalness MOS of 3.80 and obtains considerable speech quality and speaker similarity results.
Coupled tensor decompositions (CTDs) perform data fusion by linking factors from different datasets. Although many CTDs have been already proposed, current works do not address important challenges of data fusion, where: 1) the datasets are often heterogeneous, constituting different "views" of a given phenomena (multimodality); and 2) each dataset can contain personalized or dataset-specific information, constituting distinct factors that are not coupled with other datasets. In this work, we introduce a personalized CTD framework tackling these challenges. A flexible model is proposed where each dataset is represented as the sum of two components, one related to a common tensor through a multilinear measurement model, and another specific to each dataset. Both the common and distinct components are assumed to admit a polyadic decomposition. This generalizes several existing CTD models. We provide conditions for specific and generic uniqueness of the decomposition that are easy to interpret. These conditions employ uni-mode uniqueness of different individual datasets and properties of the measurement model. Two algorithms are proposed to compute the common and distinct components: a semi-algebraic one and a coordinate-descent optimization method. Experimental results illustrate the advantage of the proposed framework compared with the state of the art approaches.
Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times often over several minutes or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurate hyperspectral 3D imaging method for dynamic scenes that utilizes stereo RGB cameras and an RGB projector equipped with an affordable diffraction grating film. We design spectrally multiplexed DDSL patterns that significantly reduce the number of required projector patterns, thereby accelerating acquisition speed. Additionally, we formulate an image formation model and a reconstruction method to estimate a hyperspectral image and depth map from captured stereo images. As the first practical and accurate hyperspectral 3D imaging method for dynamic scenes, we experimentally demonstrate that DDSL achieves a spectral resolution of 15.5 nm full width at half maximum (FWHM), a depth error of 4 mm, and a frame rate of 6.6 fps.
Birth Apshyxia (BA) is a severe condition characterized by insufficient supply of oxygen to a newborn during the delivery. BA is one of the primary causes of neonatal death in the world. Although there has been a decline in neonatal deaths over the past two decades, the developing world, particularly sub-Saharan Africa, continues to experience the highest under-five (<5) mortality rates. While evidence-based methods are commonly used to detect BA in African healthcare settings, they can be subject to physician errors or delays in diagnosis, preventing timely interventions. Centralized Machine Learning (ML) methods demonstrated good performance in early detection of BA but require sensitive health data to leave their premises before training, which does not guarantee privacy and security. Healthcare institutions are therefore reluctant to adopt such solutions in Africa. To address this challenge, we suggest a federated learning (FL)-based software architecture, a distributed learning method that prioritizes privacy and security by design. We have developed a user-friendly and cost-effective mobile application embedding the FL pipeline for early detection of BA. Our Federated SVM model outperformed centralized SVM pipelines and Neural Networks (NN)-based methods in the existing literature
When learning stable linear dynamical systems from data, three important properties are desirable: i) predictive accuracy, ii) provable stability, and iii) computational efficiency. Unconstrained minimization of reconstruction errors leads to high accuracy and efficiency but cannot guarantee stability. Existing methods to remedy this focus on enforcing stability while also ensuring accuracy, but do so only at the cost of increased computation. In this work, we investigate if a straightforward approach can \textit{simultaneously} offer all three desiderata of learning stable linear systems. Specifically, we consider a post-hoc approach that manipulates the spectrum of the learned system matrix after it is learned in an unconstrained fashion. We call this approach \textit{spectrum clipping} (SC) as it involves eigen decomposition and subsequent reconstruction of the system matrix after clipping all of its eigenvalues that are larger than one to one (without altering the eigenvectors). Through detailed experiments involving two different applications and publicly available benchmark datasets, we demonstrate that this simple technique can simultaneously learn highly accurate linear systems that are provably stable. Notably, we demonstrate that SC can achieve similar or better performance than strong baselines while being \textit{orders-of-magnitude} faster. We also show that SC can be readily combined with Koopman operators to learn stable nonlinear dynamics, such as those underlying complex dexterous manipulation skills involving multi-fingered robotic hands. Our codes and dataset can be found at \href{https://github.com/GT-STAR-Lab/spec_clip}{https://github.com/GT-STAR-Lab/spec\_clip}.
We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.
Full-vectorial finite difference method with perfectly matched layers boundaries is used to identify the single mode operation region of submicron rib waveguides fabricated using sili-con-on-insulator material system. Achieving high mode power confinement factors is emphasized while maintaining the single mode operation. As opposed to the case of large cross-section rib waveguides, theoretical single mode conditions have been demonstrated to hold for sub-micron waveguides with accuracy approaching 100%. Both, the deeply and the shallowly etched rib waveguides have been considered and the single mode condition for entire sub-micrometer range is presented while adhering to design specific mode confinement requirements.
In this letter, we propose a six-dimensional movable antenna (6DMA)-aided cell-free massive multiple-input multiple-output (MIMO) system to fully exploit its macro spatial diversity, where a set of distributed access points (APs), each equipped with multiple 6DMA surfaces, cooperatively serve all users in a given area. Connected to a central processing unit (CPU) via fronthaul links, 6DMA-APs can optimize their combining vectors for decoding the users' information based on either local channel state information (CSI) or global CSI shared among them. We aim to maximize the average achievable sum-rate via jointly optimizing the rotation angles of all 6DMA surfaces at all APs, based on the users' spatial distribution. Since the formulated problem is non-convex and highly non-linear, we propose a Bayesian optimization-based algorithm to solve it efficiently. Simulation results show that, by enhancing signal power and mitigating interference through reduced channel cross-correlation among users, 6DMA-APs with optimized rotations can significantly improve the average sum-rate, as compared to the conventional cell-free network with fixed-position antennas and that with only a single centralized AP with optimally rotated 6DMAs, especially when the user distribution is more spatially diverse.
Silicon is a nonlinear material and optics based on silicon makes use of these nonlinearities to realize various functionalities required for on-chip communications. This article describes foundations of these nonlinearities in silicon at length. Particularly, self phase modulation and cross phase modulation in the context of integrated on-board and on-chip communications are presented. Important published results and principles of working of these nonlinearities are presented in considerable detail for non-expert readers.
Performance optimization associated with optical modulators requires reasonably accurate predictive models for key figures of merit. Interleaved PN-junction topology offers the maximum mode/junction overlap and is the most efficient modulator in depletion-mode of operation. Due to its structure, the accurate modelling process must be fully three-dimensional, which is a nontrivial computational problem. This paper presents a rigorous 3D model for the modulation efficiency of silicon-on-insulator interleaved junction optical phase modulators with submicron dimensions. Solution of Drift-Diffusion and Poisson equations were carried out on 3D finite-element-mesh and Maxwell equations were solved using Finite-Difference-Time-Domain (FDTD) method on 3D Yee-cells. Whole of the modelling process has been detailed and all the coefficients required in the model are presented. Model validation suggests < 10% RMS error.
Open environment oriented open set model attribution of deepfake audio is an emerging research topic, aiming to identify the generation models of deepfake audio. Most previous work requires manually setting a rejection threshold for unknown classes to compare with predicted probabilities. However, models often overfit training instances and generate overly confident predictions. Moreover, thresholds that effectively distinguish unknown categories in the current dataset may not be suitable for identifying known and unknown categories in another data distribution. To address the issues, we propose a novel framework for open set model attribution of deepfake audio with rejection threshold adaptation (ReTA). Specifically, the reconstruction error learning module trains by combining the representation of system fingerprints with labels corresponding to either the target class or a randomly chosen other class label. This process generates matching and non-matching reconstructed samples, establishing the reconstruction error distributions for each class and laying the foundation for the reject threshold calculation module. The reject threshold calculation module utilizes gaussian probability estimation to fit the distributions of matching and non-matching reconstruction errors. It then computes adaptive reject thresholds for all classes through probability minimization criteria. The experimental results demonstrate the effectiveness of ReTA in improving the open set model attributes of deepfake audio.
Quality degradation is observed in underwater images due to the effects of light refraction and absorption by water, leading to issues like color cast, haziness, and limited visibility. This degradation negatively affects the performance of autonomous underwater vehicles used in marine applications. To address these challenges, we propose a lightweight phase-based transformer network with 1.77M parameters for underwater image restoration (UIR). Our approach focuses on effectively extracting non-contaminated features using a phase-based self-attention mechanism. We also introduce an optimized phase attention block to restore structural information by propagating prominent attentive features from the input. We evaluate our method on both synthetic (UIEB, UFO-120) and real-world (UIEB, U45, UCCS, SQUID) underwater image datasets. Additionally, we demonstrate its effectiveness for low-light image enhancement using the LOL dataset. Through extensive ablation studies and comparative analysis, it is clear that the proposed approach outperforms existing state-of-the-art (SOTA) methods.
Previous tone mapping methods mainly focus on how to enhance tones in low-resolution images and recover details using the high-frequent components extracted from the input image. These methods typically rely on traditional feature pyramids to artificially extract high-frequency components, such as Laplacian and Gaussian pyramids with handcrafted kernels. However, traditional handcrafted features struggle to effectively capture the high-frequency components in HDR images, resulting in excessive smoothing and loss of detail in the output image. To mitigate the above issue, we introduce a learnable Differential Pyramid Representation Network (DPRNet). Based on the learnable differential pyramid, our DPRNet can capture detailed textures and structures, which is crucial for high-quality tone mapping recovery. In addition, to achieve global consistency and local contrast harmonization, we design a global tone perception module and a local tone tuning module that ensure the consistency of global tuning and the accuracy of local tuning, respectively. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art methods, improving PSNR by 2.58 dB in the HDR+ dataset and 3.31 dB in the HDRI Haven dataset respectively compared with the second-best method. Notably, our method exhibits the best generalization ability in the non-homologous image and video tone mapping operation. We provide an anonymous online demo at https://xxxxxx2024.github.io/DPRNet/.
Learning lighting adaption is a key step in obtaining a good visual perception and supporting downstream vision tasks. There are multiple light-related tasks (e.g., image retouching and exposure correction) and previous studies have mainly investigated these tasks individually. However, we observe that the light-related tasks share fundamental properties: i) different color channels have different light properties, and ii) the channel differences reflected in the time and frequency domains are different. Based on the common light property guidance, we propose a Learning Adaptive Lighting Network (LALNet), a unified framework capable of processing different light-related tasks. Specifically, we introduce the color-separated features that emphasize the light difference of different color channels and combine them with the traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across channels. We introduce dual domain channel modulation to generate color-separated features and a wavelet followed by a vision state space module to generate color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo at https://xxxxxx2025.github.io/LALNet/.
Determining whether two sets of images belong to the same or different domain is a crucial task in modern medical image analysis and deep learning, where domain shift is a common problem that commonly results in decreased model performance. This determination is also important to evaluate the output quality of generative models, e.g., image-to-image translation models used to mitigate domain shift. Current metrics for this either rely on the (potentially biased) choice of some downstream task such as segmentation, or adopt task-independent perceptual metrics (e.g., FID) from natural imaging which insufficiently capture anatomical consistency and realism in medical images. We introduce a new perceptual metric tailored for medical images: Radiomic Feature Distance (RaD), which utilizes standardized, clinically meaningful and interpretable image features. We show that RaD is superior to other metrics for out-of-domain (OOD) detection in a variety of experiments. Furthermore, RaD outperforms previous perceptual metrics (FID, KID, etc.) for image-to-image translation by correlating more strongly with downstream task performance as well as anatomical consistency and realism, and shows similar utility for evaluating unconditional image generation. RaD also offers additional benefits such as interpretability, as well as stability and computational efficiency at low sample sizes. Our results are supported by broad experiments spanning four multi-domain medical image datasets, nine downstream tasks, six image translation models, and other factors, highlighting the broad potential of RaD for medical image analysis.
In this paper, we explore how radio frequency energy from vehicular communications can be exploited by an energy harvesting device (EHD) placed alongside the road to deliver data packets through wireless connection to a remote Access Point. Based on updated local topology knowledge, we propose a cycle-based strategy to balance harvest and transmit phases at the EHD, in order to maximize the average throughput. A theoretical derivation is carried out to determine the optimal strategy parameters setting, and used to investigate the effectiveness of the proposed approach over different scenarios, taking into account the road traffic intensity, the EHD battery capacity, the transmit power and the data rate. Results show that regular traffic patterns, as those created by vehicles platooning, can increase the obtained throughput by more than 30% with respect to irregular ones with the same average intensity. Black out probability is also derived for the former scenario. The resulting tradeoff between higher average throughput and lower black out probability shows that the proposed approach can be adopted for different applications by properly tuning the strategy parameters.
This paper proposes a framework that has the ability to animate and generate different scenarios for the mobility of a movable anchor which can follow various paths in wireless sensor networks (WSNs). When the researchers use NS-2 to simulate a single anchor-assisted localization model, they face the problem of creating the movement file of the movable anchor. The proposed framework solved this problem by allowing them to create the movement scenario regarding different trajectories. The proposed framework lets the researcher set the needed parameters for simulating various static path models, which can be displayed through the graphical user interface. The researcher can also view the mobility of the movable anchor with control of its speed and communication range. The proposed framework has been validated by comparing its results to NS-2 outputs plus comparing it against existing tools. Finally, this framework has been published on the Code Project website and downloaded by many users.
1. Obtaining data to train robust artificial intelligence (AI)-based models for species classification can be challenging, particularly for rare species. Data augmentation can boost classification accuracy by increasing the diversity of training data and is cheaper to obtain than expert-labelled data. However, many classic image-based augmentation techniques are not suitable for audio spectrograms. 2. We investigate two generative AI models as data augmentation tools to synthesise spectrograms and supplement audio data: Auxiliary Classifier Generative Adversarial Networks (ACGAN) and Denoising Diffusion Probabilistic Models (DDPMs). The latter performed particularly well in terms of both realism of generated spectrograms and accuracy in a resulting classification task. 3. Alongside these new approaches, we present a new audio data set of 640 hours of bird calls from wind farm sites in Ireland, approximately 800 samples of which have been labelled by experts. Wind farm data are particularly challenging for classification models given the background wind and turbine noise. 4. Training an ensemble of classification models on real and synthetic data combined gave 92.6% accuracy (and 90.5% with just the real data) when compared with highly confident BirdNET predictions. 5. Our approach can be used to augment acoustic signals for more species and other land-use types, and has the potential to bring about a step-change in our capacity to develop reliable AI-based detection of rare species. Our code is available at https://github.com/gibbona1/ SpectrogramGenAI.
The integration of non-terrestrial networks (NTNs) and terrestrial networks (TNs) is fundamental for extending connectivity to rural and underserved areas that lack coverage from traditional cellular infrastructure. However, this integration presents several challenges. For instance, TNs mainly operate in Time Division Duplexing (TDD). However, for NTN via satellites, TDD is complicated due to synchronization problems in large cells, and the significant impact of guard periods and long propagation delays. In this paper, we propose a novel slot allocation mechanism to enable TDD in NTN. This approach permits to allocate additional transmissions during the guard period between a downlink slot and the corresponding uplink slot to reduce the overhead, provided that they do not interfere with other concurrent transmissions. Moreover, we propose two scheduling methods to select the users that transmit based on considerations related to the Signal-to-Noise Ratio (SNR) or the propagation delay. Simulations demonstrate that our proposal can increase the network capacity compared to a benchmark scheme that does not schedule transmissions in guard periods.
A theorem is proved to verify incremental stability of a feedback system via a homotopy from a known incrementally stable system. A first corollary of that result is that incremental stability may be verified by separation of Scaled Relative Graphs, correcting two assumptions in [1, Theorem 2]. A second corollary provides an incremental version of the classical IQC stability theorem.
This paper presents a novel approach for optimal control of nonlinear stochastic systems using infinitesimal generator learning within infinite-dimensional reproducing kernel Hilbert spaces. Our learning framework leverages data samples of system dynamics and stage cost functions, with only control penalties and constraints provided. The proposed method directly learns the diffusion operator of a controlled Fokker-Planck-Kolmogorov equation in an infinite-dimensional hypothesis space. This operator models the continuous-time evolution of the probability measure of the control system's state. We demonstrate that this approach seamlessly integrates with modern convex operator-theoretic Hamilton-Jacobi-Bellman recursions, enabling a data-driven solution to the optimal control problem. Furthermore, our statistical learning framework includes nonparametric estimators for uncontrolled forward infinitesimal generators as a special case. Numerical experiments, ranging from synthetic differential equations to simulated robotic systems, showcase the advantages of our approach compared to both modern data-driven and classical nonlinear programming methods for optimal control.
We study how to synthesize a robust and safe policy for autonomous systems under signal temporal logic (STL) tasks in adversarial settings against unknown dynamic agents. To ensure the worst-case STL satisfaction, we propose STLGame, a framework that models the multi-agent system as a two-player zero-sum game, where the ego agents try to maximize the STL satisfaction and other agents minimize it. STLGame aims to find a Nash equilibrium policy profile, which is the best case in terms of robustness against unseen opponent policies, by using the fictitious self-play (FSP) framework. FSP iteratively converges to a Nash profile, even in games set in continuous state-action spaces. We propose a gradient-based method with differentiable STL formulas, which is crucial in continuous settings to approximate the best responses at each iteration of FSP. We show this key aspect experimentally by comparing with reinforcement learning-based methods to find the best response. Experiments on two standard dynamical system benchmarks, Ackermann steering vehicles and autonomous drones, demonstrate that our converged policy is almost unexploitable and robust to various unseen opponents' policies. All code and additional experimental results can be found on our project website: https://sites.google.com/view/stlgame
The importance of quantifying uncertainty in deep networks has become paramount for reliable real-world applications. In this paper, we propose a method to improve uncertainty estimation in medical Image-to-Image (I2I) translation. Our model integrates aleatoric uncertainty and employs Uncertainty-Aware Regularization (UAR) inspired by simple priors to refine uncertainty estimates and enhance reconstruction quality. We show that by leveraging simple priors on parameters, our approach captures more robust uncertainty maps, effectively refining them to indicate precisely where the network encounters difficulties, while being less affected by noise. Our experiments demonstrate that UAR not only improves translation performance, but also provides better uncertainty estimations, particularly in the presence of noise and artifacts. We validate our approach using two medical imaging datasets, showcasing its effectiveness in maintaining high confidence in familiar regions while accurately identifying areas of uncertainty in novel/ambiguous scenarios.
Channel Charting is a dimensionality reduction technique that learns to reconstruct a low-dimensional, physically interpretable map of the radio environment by taking advantage of similarity relationships found in high-dimensional channel state information. One particular family of Channel Charting methods relies on pseudo-distances between measured CSI datapoints, computed using dissimilarity metrics. We suggest several techniques to improve the performance of dissimilarity metric-based Channel Charting. For one, we address an issue related to a discrepancy between Euclidean distances and geodesic distances that occurs when applying dissimilarity metric-based Channel Charting to datasets with nonconvex low-dimensional structure. Furthermore, we incorporate the uncertainty of dissimilarities into the learning process by modeling dissimilarities not as deterministic quantities, but as probability distributions. Our framework facilitates the combination of multiple dissimilarity metrics in a consistent manner. Additionally, latent space dynamics like constrained acceleration due to physical inertia are easily taken into account thanks to changes in the training procedure. We demonstrate the achieved performance improvements for localization applications on a measured channel dataset
This paper proposes a slot-based energy storage approach for decision-making in the context of an Off-Grid telecommunication operator. We consider network systems powered by solar panels, where harvest energy is stored in a battery that can also be sold when fully charged. To reflect real-world conditions, we account for non-stationary energy arrivals and service demands that depend on the time of day, as well as the failure states of PV panel. The network operator we model faces two conflicting objectives: maintaining the operation of its infrastructure and selling (or supplying to other networks) surplus energy from fully charged batteries. To address these challenges, we developed a slot-based Markov Decision Process (MDP) model that incorporates positive rewards for energy sales, as well as penalties for energy loss and battery depletion. This slot-based MDP follows a specific structure we have previously proven to be efficient in terms of computational performance and accuracy. From this model, we derive the optimal policy that balances these conflicting objectives and maximizes the average reward function. Additionally, we present results comparing different cities and months, which the operator can consider when deploying its infrastructure to maximize rewards based on location-specific energy availability and seasonal variations.
Graph Neural Networks (GNNs) have emerged as a promising tool to handle data exhibiting an irregular structure. However, most GNN architectures perform well on homophilic datasets, where the labels of neighboring nodes are likely to be the same. In recent years, an increasing body of work has been devoted to the development of GNN architectures for heterophilic datasets, where labels do not exhibit this low-pass behavior. In this work, we create a new graph in which nodes are connected if they share structural characteristics, meaning a higher chance of sharing their labels, and then use this new graph in the GNN architecture. To do this, we compute the k-nearest neighbors graph according to distances between structural features, which are either (i) role-based, such as degree, or (ii) global, such as centrality measures. Experiments show that the labels are smoother in this newly defined graph and that the performance of GNN architectures improves when using this alternative structure.
One significant challenge in research is to collect a large amount of data and learn the underlying relationship between the input and the output variables. This paper outlines the process of collecting and validating a dataset designed to determine the angle of arrival (AoA) using Bluetooth low energy (BLE) technology. The data, collected in a laboratory setting, is intended to approximate real-world industrial scenarios. This paper discusses the data collection process, the structure of the dataset, and the methodology adopted for automating sample labeling for supervised learning. The collected samples and the process of generating ground truth (GT) labels were validated using the Texas Instruments (TI) phase difference of arrival (PDoA) implementation on the data, yielding a mean absolute error (MAE) at one of the heights without obstacles of $25.71^\circ$. The distance estimation on BLE was implemented using a Gaussian Process Regression algorithm, yielding an MAE of $0.174$m.
Transformers, known for their attention mechanisms, have proven highly effective in focusing on critical elements within complex data. This feature can effectively be used to address the time-varying channels in wireless communication systems. In this work, we introduce a channel-aware adaptive framework for semantic communication, where different regions of the image are encoded and compressed based on their semantic content. By employing vision transformers, we interpret the attention mask as a measure of the semantic contents of the patches and dynamically categorize the patches to be compressed at various rates as a function of the instantaneous channel bandwidth. Our method enhances communication efficiency by adapting the encoding resolution to the content's relevance, ensuring that even in highly constrained environments, critical information is preserved. We evaluate the proposed adaptive transmission framework using the TinyImageNet dataset, measuring both reconstruction quality and accuracy. The results demonstrate that our approach maintains high semantic fidelity while optimizing bandwidth, providing an effective solution for transmitting multi-resolution data in limited bandwidth conditions.