Medical Ultrasound (US) imaging has seen increasing demands over the past years, becoming one of the most preferred imaging modalities in clinical practice due to its affordability, portability, and real-time capabilities. However, it faces several challenges that limit its applicability, such as operator dependency, variability in interpretation, and limited resolution, which are amplified by the low availability of trained experts. This calls for the need of autonomous systems that are capable of reducing the dependency on humans for increased efficiency and throughput. Reinforcement Learning (RL) comes as a rapidly advancing field under Artificial Intelligence (AI) that allows the development of autonomous and intelligent agents that are capable of executing complex tasks through rewarded interactions with their environments. Existing surveys on advancements in the US scanning domain predominantly focus on partially autonomous solutions leveraging AI for scanning guidance, organ identification, plane recognition, and diagnosis. However, none of these surveys explore the intersection between the stages of the US process and the recent advancements in RL solutions. To bridge this gap, this review proposes a comprehensive taxonomy that integrates the stages of the US process with the RL development pipeline. This taxonomy not only highlights recent RL advancements in the US domain but also identifies unresolved challenges crucial for achieving fully autonomous US systems. This work aims to offer a thorough review of current research efforts, highlighting the potential of RL in building autonomous US solutions while identifying limitations and opportunities for further advancements in this field.
Cancer cachexia is a common metabolic disorder characterized by severe muscle atrophy which is associated with poor prognosis and quality of life. Monitoring skeletal muscle area (SMA) longitudinally through computed tomography (CT) scans, an imaging modality routinely acquired in cancer care, is an effective way to identify and track this condition. However, existing tools often lack full automation and exhibit inconsistent accuracy, limiting their potential for integration into clinical workflows. To address these challenges, we developed SMAART-AI (Skeletal Muscle Assessment-Automated and Reliable Tool-based on AI), an end-to-end automated pipeline powered by deep learning models (nnU-Net 2D) trained on mid-third lumbar level CT images with 5-fold cross-validation, ensuring generalizability and robustness. SMAART-AI incorporates an uncertainty-based mechanism to flag high-error SMA predictions for expert review, enhancing reliability. We combined the SMA, skeletal muscle index, BMI, and clinical data to train a multi-layer perceptron (MLP) model designed to predict cachexia at the time of cancer diagnosis. Tested on the gastroesophageal cancer dataset, SMAART-AI achieved a Dice score of 97.80% +/- 0.93%, with SMA estimated across all four datasets in this study at a median absolute error of 2.48% compared to manual annotations with SliceOmatic. Uncertainty metrics-variance, entropy, and coefficient of variation-strongly correlated with SMA prediction errors (0.83, 0.76, and 0.73 respectively). The MLP model predicts cachexia with 79% precision, providing clinicians with a reliable tool for early diagnosis and intervention. By combining automation, accuracy, and uncertainty awareness, SMAART-AI bridges the gap between research and clinical application, offering a transformative approach to managing cancer cachexia.
We investigate the potential of enhancing small (<20 kg) drone endurance by exploiting the high energy density of hydrocarbons using a prototype generator based on commercial-off-the-shelf (COTS) thermoelectric energy conversion technology. A proof-of-concept prototype was developed to vet design and engineering challenges and to bolster validity of resultant conclusions. The combination of the prototype performance and modeling suggests that endurance augmentation remains a difficult technical challenge with no clear immediate remedy despite many expectant alternatives. Across a sample of representative drones including ground- and air-based, multicopter and fixed wing drones, we report the following: from their current maximum values of 12%, thermoelectric (TE) generator module efficiencies must increase by over two times to achieve endurance parity with lithium batteries for VTOL multicopters. On the other hand, current TE efficiencies can compete with lithium batteries for some low power fixed wing and ground-based drones. Technical contributors for these results include weight of non-energy contributing components, low specific power and the associated tradeoff between specific power and specific energy due to fuel mass fraction, and lastly, low efficiencies.
Low-count positron emission tomography (LCPET) imaging can reduce patients' exposure to radiation but often suffers from increased image noise and reduced lesion detectability, necessitating effective denoising techniques. Diffusion models have shown promise in LCPET denoising for recovering degraded image quality. However, training such models requires large and diverse datasets, which are challenging to obtain in the medical domain. To address data scarcity and privacy concerns, we combine diffusion models with federated learning -- a decentralized training approach where models are trained individually at different sites, and their parameters are aggregated on a central server over multiple iterations. The variation in scanner types and image noise levels within and across institutions poses additional challenges for federated learning in LCPET denoising. In this study, we propose a novel noise-embedded federated learning diffusion model (Fed-NDIF) to address these challenges, leveraging a multicenter dataset and varying count levels. Our approach incorporates liver normalized standard deviation (NSTD) noise embedding into a 2.5D diffusion model and utilizes the Federated Averaging (FedAvg) algorithm to aggregate locally trained models into a global model, which is subsequently fine-tuned on local datasets to optimize performance and obtain personalized models. Extensive validation on datasets from the University of Bern, Ruijin Hospital in Shanghai, and Yale-New Haven Hospital demonstrates the superior performance of our method in enhancing image quality and improving lesion quantification. The Fed-NDIF model shows significant improvements in PSNR, SSIM, and NMSE of the entire 3D volume, as well as enhanced lesion detectability and quantification, compared to local diffusion models and federated UNet-based models.
TThe paper proposes the Consensus Augmented Lagrange Alternating Direction Inexact Newton (Consensus ALADIN) algorithm, a novel approach for solving distributed consensus optimization problems (DC). Consensus ALADIN allows each agent to independently solve its own nonlinear programming problem while coordinating with other agents by solving a consensus quadratic programming (QP) problem. Building on this, we propose Broyden-Fletcher-Goldfarb-Shanno (BFGS) Consensus ALADIN, a communication-and-computation-efficient Consensus ALADIN.BFGS Consensus ALADIN improves communication efficiency through BFGS approximation techniques and enhances computational efficiency by deriving a closed form for the consensus QP problem. Additionally, by replacing the BFGS approximation with a scaled identity matrix, we develop Reduced Consensus ALADIN, a more computationally efficient variant. We establish the convergence theory for Consensus ALADIN and demonstrate its effectiveness through application to a non-convex sensor allocation problem.
We study the problem of stabilizing an unknown partially observable linear time-invariant (LTI) system. For fully observable systems, leveraging an unstable/stable subspace decomposition approach, state-of-art sample complexity is independent from system dimension $n$ and only scales with respect to the dimension of the unstable subspace. However, it remains open whether such sample complexity can be achieved for partially observable systems because such systems do not admit a uniquely identifiable unstable subspace. In this paper, we propose LTS-P, a novel technique that leverages compressed singular value decomposition (SVD) on the ''lifted'' Hankel matrix to estimate the unstable subsystem up to an unknown transformation. Then, we design a stabilizing controller that integrates a robust stabilizing controller for the unstable mode and a small-gain-type assumption on the stable subspace. We show that LTS-P stabilizes unknown partially observable LTI systems with state-of-the-art sample complexity that is dimension-free and only scales with the number of unstable modes, which significantly reduces data requirements for high-dimensional systems with many stable modes.
Near-tissue computing requires sensor-level processing of high-resolution images, essential for real-time biomedical diagnostics and surgical guidance. To address this need, we introduce a novel Capacitive Transimpedance Amplifier-based In-Pixel Computing (CTIA-IPC) architecture. Our design leverages CTIA pixels that are widely used for biomedical imaging owing to the inherent advantages of excellent linearity, low noise, and robust operation under low-light conditions. We augment CTIA pixels with IPC to enable precise deep learning computations including multi-channel, multi-bit convolution operations along with integrated batch normalization (BN) and Rectified Linear Unit (ReLU) functionalities in the peripheral ADC (Analog to Digital Converters). This design improves the linearity of Multiply and Accumulate (MAC) operations while enhancing computational efficiency. Leveraging 3D integration to embed pixel circuitry and weight storage, CTIA-IPC maintains pixel density comparable to standard CTIA designs. Moreover, our algorithm-circuit co-design approach enables efficient real-time diagnostics and AI-driven medical analysis. Evaluated on the EndoVis tissu dataset (1280x1024), CTIA-IPC achieves approximately 12x reduction in data bandwidth, yielding segmentation IoUs of 75.91% (parts), and 28.58% (instrument)-a minimal accuracy reduction (1.3%-2.5%) compared to baseline methods. Achieving 1.98 GOPS throughput and 3.39 GOPS/W efficiency, our CTIA-IPC architecture offers a promising computational framework tailored specifically for biomedical near-tissue computing.
System identification is a fundamental problem in control and learning, particularly in high-stakes applications where data efficiency is critical. Classical approaches, such as the ordinary least squares estimator (OLS), achieve an $O(1/\sqrt{T})$ convergence rate under Gaussian noise assumptions, where $T$ is the number of samples. This rate has been shown to match the lower bound. However, in many practical scenarios, noise is known to be bounded, opening the possibility of improving sample complexity. In this work, we establish the minimax lower bound for system identification under bounded noise, proving that the $O(1/T)$ convergence rate is indeed optimal. We further demonstrate that OLS remains limited to an {$\Omega(1/\sqrt{T})$} convergence rate, making it fundamentally suboptimal in the presence of bounded noise. Finally, we instantiate two natural variations of OLS that obtain the optimal sample complexity.
In this paper, we propose a depth-aided color image inpainting method in the quaternion domain, called depth-aided low-rank quaternion matrix completion (D-LRQMC). In conventional quaternion-based inpainting techniques, the color image is expressed as a quaternion matrix by using the three imaginary parts as the color channels, whereas the real part is set to zero and has no information. Our approach incorporates depth information as the real part of the quaternion representations, leveraging the correlation between color and depth to improve the result of inpainting. In the proposed method, we first restore the observed image with the conventional LRQMC and estimate the depth of the restored result. We then incorporate the estimated depth into the real part of the observed image and perform LRQMC again. Simulation results demonstrate that the proposed D-LRQMC can improve restoration accuracy and visual quality for various images compared to the conventional LRQMC. These results suggest the effectiveness of the depth information for color image processing in quaternion domain.
Medical vision foundational models are used for a wide variety of tasks, including medical image segmentation and registration. This work evaluates the ability of these models to predict disease progression using a simple linear probe. We hypothesize that intermediate layer features of segmentation models capture structural information, while those of registration models encode knowledge of change over time. Beyond demonstrating that these features are useful for disease progression prediction, we also show that registration model features do not require spatially aligned input images. However, for segmentation models, spatial alignment is essential for optimal performance. Our findings highlight the importance of spatial alignment and the utility of foundation model features for image registration.
This paper proposes a novel robust adaptive model predictive controller for on-orbit dislodging. We consider the scenario where a servicer, equipped with a robot arm, must dislodge a client, a time-varying system composed of an underpowered jammed solar panel with a hybrid hinge system on a space station. Our approach leverages online set-membership identification to reduce the uncertainty to provide robust safety guarantees during dislodging despite bounded disturbances while balancing exploration and exploitation effectively in the parameter space. The feasibility of the developed robust adaptive MPC method is also examined through dislodging simulations and hardware experiments in zero-gravity and gravity environments, respectively. In addition, the advantages of our method are shown through comparison experiments with several state-of-the-art control schemes for both accuracy of parameter estimation and control performance.
We develop the mathematical properties of a multifractal analysis of data based on the weak scaling exponent. The advantage of this analysis is that it does not require any a priori global regularity assumption on the analyzed signal, in contrast with the previously used H{\"o}lder or p-exponents. As an illustration, we show that this technique allows one to perform a multifractal analysis of MEG signals, which records electromagnetic brain activity, that was not theoretically valid using the formerly introduced methods based on H{\"o}lder or p-exponents.
This paper presents a method, namely mmTracking, for device trajectory tracking in a millimeter wave (mmWave) communication system. In mmTracking, the base station (BS) relies on one line-of-sight (LoS) path and at least two non-line-of-sight (NLoS) paths, which are reflected off two walls respectively, of the uplink channel to track the location of a mobile device versus time. There are at least three radio frequency (RF) chains at the BS. Analog phased array with narrow and adjustable receive beam is connected to each RF chain to capture one signal path, where the angle of arrival (AoA) can be roughly estimated. Due to the carrier frequency offset between the transmitter and the BS, the Doppler frequency of each path could hardly be estimated accurately. Instead, the differences of Doppler frequencies of the three paths can be estimated with much better accuracy. Therefore, a trajectory tracking method based on the Doppler difference and AoA estimations is proposed in mmTracking. Experimental results in a typical indoor environment demonstrate that the average error of transmitter localization and trajectory tracking is less than 20 cm.
3D non-Cartesian trajectories offer several advantages over rectilinear trajectories for rapid volumetric imaging, including improved sampling efficiency and greater robustness to motion, flow, and aliasing artifacts. In this paper, we present a unified framework for designing three widely used non-Cartesian trajectories: 3D Radial, 3D Cones, and Stack-of-Spirals. Our approach is based on the idea that a non-Cartesian trajectory can be interpreted as a discretized version of an analytic coordinate defined by a set of template trajectories. Equivalently, the analytic coordinate is conceptualized as a non-Cartesian trajectory composed of an infinite number of copies of a set of template trajectories. The discretization is accomplished by constructing a continuous spiral path on a surface and sampling points along this path at unit intervals, leaving only the essential spokes/interleaves, thereby yielding the practical non-Cartesian trajectory from the analytic coordinate. One of the advantages of our approach is that the analytic density compensation factor can be readily derived using Jacobian determinants, which quantify changes in unit areas due to the transformation from the analytic coordinate to the Cartesian grid. Additionally, the proposed approach derives analytic formulae to compute the number of readouts based on prescribed parameters, allowing us to specify the trajectory's acceleration factor for a given total scan time. Furthermore, variable-density sampling can be easily incorporated, and spokes/interleaves are smoothly distributed in k-space along the derived spiral path, even for a small number of readouts. In a preliminary phantom study, the proposed method demonstrated improved sampling efficiency and image quality compared to the conventional approach.
Deep learning (DL) has introduced a new paradigm in multiple-input multiple-output (MIMO) detection, balancing performance and complexity. However, the practical deployment of DL-based detectors is hindered by poor generalization, necessitating costly retraining for different devices and scenarios. To address this challenge, this paper presents a novel knowledge transfer technique, termed learngene, for the design of a DL-based MIMO detector and proposes an efficient deployment framework. The proposed detector, SDNet, leverages zero-forcing detection outputs and least squares-estimated channel state information (CSI) as inputs. It is further optimized through a collective-individual paradigm to enhance knowledge transfer. In this paradigm, learngene, a reusable neural network (NN) segment, encapsulates detection meta-knowledge acquired from large-scale collective models trained by manufacturers. This segment can then be distributed to device-specific teams. By integrating learngene into different lightweight individual models, detection meta-knowledge is efficiently transferred across heterogeneous NNs, enabling adaptation to diverse devices and scenarios. Simulation results demonstrate that the proposed scheme enhances performance, enables rapid adaptation, and ensures high scalability, with transferred parameters comprising only 10.8% of the total model size.
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
Accurate segmentation of pulmonary vessels plays a very critical role in diagnosing and assessing various lung diseases. In clinical practice, diagnosis is typically carried out using CTPA images. However, there is a lack of high-precision pulmonary vessel segmentation algorithms for CTPA, and pulmonary vessel segmentation for NCCT poses an even greater challenge. In this study, we propose a 3D image segmentation algorithm for automated pulmonary vessel segmentation from both contrast and non-contrast CT images. In the network, we designed a Vessel Lumen Structure Optimization Module (VLSOM), which extracts the centerline of vessels and adjusts the weights based on the positional information and adds a Cl-Dice-Loss to supervise the stability of the vessels structure. In addition, we designed a method for generating vessel GT from CTPA to NCCT for training models that support both CTPA and NCCT. In this work, we used 427 sets of high-precision annotated CT data from multiple vendors and countries. Finally, our experimental model achieved Cl-Recall, Cl-DICE and Recall values of 0.879, 0.909, 0.934 (CTPA) and 0.928, 0.936, 0.955 (NCCT) respectively. This shows that our model has achieved good performance in both accuracy and completeness of pulmonary vessel segmentation. In clinical visual evaluation, our model also had good segmentation performance on various disease types and can assist doctors in medical diagnosis, verifying the great potential of this method in clinical application.
Computer vision has transformed medical diagnosis, treatment, and research through advanced image processing and machine learning techniques. Fracture classification, a critical area in healthcare, has greatly benefited from these advancements, yet accurate detection is challenged by complex patterns and image noise. Bit plane slicing enhances medical images by reducing noise interference and extracting informative features. This research explores partial denoising techniques to provide practical solutions for improved fracture analysis, ultimately enhancing patient care. The study explores deep learning model DenseNet and handcrafted feature extraction. Decision Tree and Random Forest, were employed to train and evaluate distinct image representations. These include the original image, the concatenation of the four bit planes from the LSB as well as MSB, the fully denoised image, and an image consisting of 6 bit planes from MSB and 2 denoised bit planes from LSB. The purpose of forming these diverse image representations is to analyze SNR as well as classification accuracy and identify the bit planes that contain the most informative features. Moreover, the study delves into the significance of partial denoising techniques in preserving crucial features, leading to improvements in classification results. Notably, this study shows that employing the Random Forest classifier, the partially denoised image representation exhibited a testing accuracy of 95.61% surpassing the performance of other image representations. The outcomes of this research provide valuable insights into the development of efficient preprocessing, feature extraction and classification approaches for fracture identification. By enhancing diagnostic accuracy, these advancements hold the potential to positively impact patient care and overall medical outcomes.
Accurate segmentation of ultrasound (US) images of the cervical muscles is crucial for precision healthcare. The demand for automatic computer-assisted methods is high. However, the scarcity of labeled data hinders the development of these methods. Advanced semi-supervised learning approaches have displayed promise in overcoming this challenge by utilizing labeled and unlabeled data. This study introduces a novel semi-supervised learning (SSL) framework that integrates dual neural networks. This SSL framework utilizes both networks to generate pseudo-labels and cross-supervise each other at the pixel level. Additionally, a self-supervised contrastive learning strategy is introduced, which employs a pair of deep representations to enhance feature learning capabilities, particularly on unlabeled data. Our framework demonstrates competitive performance in cervical segmentation tasks. Our codes are publicly available on https://github.com/13204942/SSL\_Cervical\_Segmentation.
Artificial intelligence (AI) is increasingly being used for medical imaging tasks. However, there can be biases in the resulting models, particularly when they were trained using imbalanced training datasets. One such example has been the strong race bias effect in cardiac magnetic resonance (CMR) image segmentation models. Although this phenomenon has been reported in a number of publications, little is known about the effectiveness of bias mitigation algorithms in this domain. We aim to investigate the impact of common bias mitigation methods to address bias between Black and White subjects in AI-based CMR segmentation models. Specifically, we use oversampling, importance reweighing and Group DRO as well as combinations of these techniques to mitigate the race bias. Furthermore, motivated by recent findings on the root causes of AI-based CMR segmentation bias, we evaluate the same methods using models trained and evaluated on cropped CMR images. We find that bias can be mitigated using oversampling, significantly improving performance for the underrepresented Black subjects whilst not significantly reducing the majority White subjects' performance. Group DRO also improves performance for Black subjects but not significantly, while reweighing decreases performance for Black subjects. Using a combination of oversampling and Group DRO also improves performance for Black subjects but not significantly. Using cropped images increases performance for both races and reduces the bias, whilst adding oversampling as a bias mitigation technique with cropped images reduces the bias further.
Gastric cancer ranks as the fifth most common and fourth most lethal cancer globally, with a dismal 5-year survival rate of approximately 20%. Despite extensive research on its pathobiology, the prognostic predictability remains inadequate, compounded by pathologists' high workload and potential diagnostic errors. Thus, automated, accurate histopathological diagnosis tools are crucial. This study employs Machine Learning and Deep Learning techniques to classify histopathological images into healthy and cancerous categories. Using handcrafted and deep features with shallow learning classifiers on the GasHisSDB dataset, we offer a comparative analysis and insights into the most robust and high-performing combinations of features and classifiers for distinguishing between normal and abnormal histopathological images without fine-tuning strategies. With the RF classifier, our approach can reach F1 of 93.4%, demonstrating its validity.
The detection of blood disorders often hinges upon the quantification of specific blood cell types. Variations in cell counts may indicate the presence of pathological conditions. Thus, the significance of developing precise automatic systems for blood cell enumeration is underscored. The investigation focuses on a novel approach termed DE-ViT. This methodology is employed in a Few-Shot paradigm, wherein training relies on a limited number of images. Two distinct datasets are utilised for experimental purposes: the Raabin-WBC dataset for Leukocyte detection and a local dataset for Schistocyte identification. In addition to the DE-ViT model, two baseline models, Faster R-CNN 50 and Faster R-CNN X 101, are employed, with their outcomes being compared against those of the proposed model. While DE-ViT has demonstrated state-of-the-art performance on the COCO and LVIS datasets, both baseline models surpassed its performance on the Raabin-WBC dataset. Moreover, only Faster R-CNN X 101 yielded satisfactory results on the SC-IDB. The observed disparities in performance may possibly be attributed to domain shift phenomena.
Hyperspectral unmixing is the analytical process of determining the pure materials and estimating the proportions of such materials composed within an observed mixed pixel spectrum. We can unmix mixed pixel spectra using linear and nonlinear mixture models. Ordinary least squares (OLS) regression serves as the foundation for many linear mixture models employed in Hyperspectral Image analysis. Though variations of OLS are implemented, studies rarely address the underlying assumptions that affect results. This paper provides an in depth discussion on the assumptions inherently endorsed by the application of OLS regression. We also examine variations of OLS models stemming from highly effective approaches in spectral unmixing -- sparse regression, iterative feature search strategies and Mathematical programming. These variations are compared to a novel unmixing approach called HySUDeB. We evaluated each approach's performance by computing the average error and precision of each model. Additionally, we provide a taxonomy of the molecular structure of each mineral to derive further understanding into the detection of the target materials.
This paper studies the problem of hybrid holographic beamforming for sum-rate maximization in a communication system assisted by a reconfigurable holographic surface. Existing methodologies predominantly rely on gradient-based or approximation techniques necessitating iterative optimization for each update of the holographic response, which imposes substantial computational overhead. To address these limitations, we establish a mathematical relationship between the mean squared error (MSE) criterion and the holographic response of the RHS to enable alternating optimization based on the minimum MSE (MMSE). Our analysis demonstrates that this relationship exhibits a quadratic dependency on each element of the holographic beamformer. Exploiting this property, we derive closed-form optimal expressions for updating the holographic beamforming weights. Our complexity analysis indicates that the proposed approach exhibits only linear complexity in terms of the RHS size, thus, ensuring scalability for large-scale deployments. The presented simulation results validate the effectiveness of our MMSE-based holographic approach, providing useful insights.
Integrated sensing and communications (ISAC) is expected to play a major role in numerous future applications, e.g., smart cities. Leveraging native radar signals like the frequency modulated continuous wave (FMCW) waveform additionally for data transmission offers a highly efficient use of valuable physical radio frequency (RF) resources allocated for automotive radar applications. In this paper, we propose the adoption of higher-order modulation formats for data modulation onto an FMCW waveform and provide a comprehensive overview of the entire signal processing chain. We evaluate the impact of each component on the overall sensing performance. While alignment algorithms are essential for removing the information signal at the sensing receiver, they also introduce significant dispersion to the received signal. We analyze this effect in detail. Notably, we demonstrate that the impact of non-constant amplitude modulation on sensing performance is statistically negligible when the complete signal processing chain is considered. This finding highlights the potential for achieving high data rates in FMCW-ISAC systems without compromising the sensing capabilities.
Current end-to-end (E2E) and plug-and-play (PnP) image reconstruction algorithms approximate the maximum a posteriori (MAP) estimate but cannot offer sampling from the posterior distribution, like diffusion models. By contrast, it is challenging for diffusion models to be trained in an E2E fashion. This paper introduces a Deep End-to-End Posterior ENergy (DEEPEN) framework, which enables MAP estimation as well as sampling. We learn the parameters of the posterior, which is the sum of the data consistency error and the negative log-prior distribution, using maximum likelihood optimization in an E2E fashion. The proposed approach does not require algorithm unrolling, and hence has a smaller computational and memory footprint than current E2E methods, while it does not require contraction constraints typically needed by current PnP methods. Our results demonstrate that DEEPEN offers improved performance than current E2E and PnP models in the MAP setting, while it also offers faster sampling compared to diffusion models. In addition, the learned energy-based model is observed to be more robust to changes in image acquisition settings.
Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based models are expected to address these problems, however, existing small-scale and private datasets limit significant performance improvements for these methods. Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. Furthermore, we propose a cross-modal interactive perception network with Mamba (CIPA) for lung tumor segmentation in PET-CT images. Specifically, we design a channel-wise rectification module (CRM) that implements a channel state space block across multi-modal features to learn correlated representations and helps filter out modality-specific noise. A dynamic cross-modality interaction module (DCIM) is designed to effectively integrate position and context information, which employs PET images to learn regional position information and serves as a bridge to assist in modeling the relationships between local features of CT images. Extensive experiments on a comprehensive benchmark demonstrate the effectiveness of our CIPA compared to the current state-of-the-art segmentation methods. We hope our research can provide more exploration opportunities for medical image segmentation. The dataset and code are available at https://github.com/mj129/CIPA.
In the evolving landscape of 6G networks, semantic communications are poised to revolutionize data transmission by prioritizing the transmission of semantic meaning over raw data accuracy. This paper presents a Vision Transformer (ViT)-based semantic communication framework that has been deliberately designed to achieve high semantic similarity during image transmission while simultaneously minimizing the demand for bandwidth. By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content at the transmitter and precisely reconstruct the images, considering real-world fading and noise consideration at the receiver. Building on the attention mechanisms inherent to ViTs, our model outperforms Convolution Neural Network (CNNs) and Generative Adversarial Networks (GANs) tailored for generating such images. The architecture based on the proposed ViT network achieves the Peak Signal-to-noise Ratio (PSNR) of 38 dB, which is higher than other Deep Learning (DL) approaches in maintaining semantic similarity across different communication environments. These findings establish our ViT-based approach as a significant breakthrough in semantic communications.
We present a novel cross-band modulation framework that combines 3D modulation in the RF domain with intensity modulation and direct detection in the optical domain, the first such integration to enhance communication reliability. By harnessing cross-band diversity, the framework optimizes symbol mapping across RF and optical links, significantly boosting mutual information (MI) and reducing symbol error probability (SEP). Two practical modulation schemes implement this framework, both using quadrature amplitude modulation in the RF subsystem. The first is a linear cross-band mapping scheme, where RF symbols are mapped to optical intensity values via an analytically tractable optimization that ensures O(1) detection complexity while minimizing SEP. The second employs a deep neural network-generated (DNN-Gen) 3D constellation with a custom loss function that adaptively optimizes symbol placement to maximize MI and minimize SEP. Although DNN-Gen incurs higher computational complexity than the linear approach, it adapts the 3D constellation to varying signal-to-noise ratios, yielding significant performance gains. Furthermore, we derive a theoretical MI benchmark for the linear scheme, offering insights into the fundamental limits of RF-optical cross-band communication. Extensive Monte Carlo simulations confirm that both schemes outperform SoA cross-band modulation techniques, including cross-band pulse amplitude modulation, with notable improvements. Additionally, DNN-Gen maintains high performance over a range of RF SNRs, lessening the need for exhaustive training at every operating condition. Overall, these results establish our cross-band modulation framework as a scalable, high-performance solution for next-generation hybrid RF-optical networks, balancing low complexity with optimized symbol mapping to maximize system reliability and efficiency.
Urban Air Mobility (UAM) offers a solution to current traffic congestion by using electric Vertical Takeoff and Landing (eVTOL) vehicles to provide on-demand air mobility in urban areas. Effective traffic management is crucial for efficient operation of UAM systems, especially for high-demand scenarios. In this paper, we present a centralized framework for conflict-free takeoff scheduling of eVTOLs in on-demand UAM systems. Specifically, we provide a scheduling policy, called VertiSync, which jointly schedules UAM vehicles for servicing trip requests and rebalancing, subject to safety margins and energy requirements. We characterize the system-level throughput of VertiSync, which determines the demand threshold at which the average waiting time transitions from being stable to being increasing over time. We show that the proposed policy maximizes throughput for sufficiently large fleet size and if the UAM network has a certain symmetry property. We demonstrate the performance of VertiSync through a case study for the city of Los Angeles, and show that it significantly reduces average passenger waiting time compared to a first-come first-serve scheduling policy.
Quantitative estimation of human joint motion in daily living spaces is essential for early detection and rehabilitation tracking of neuromusculoskeletal disorders (e.g., Parkinson's) and mitigating trip and fall risks for older adults. Existing approaches involve monitoring devices such as cameras, wearables, and pressure mats, but have operational constraints such as direct line-of-sight, carrying devices, and dense deployment. To overcome these limitations, we leverage gait-induced floor vibration to estimate lower-limb joint motion (e.g., ankle, knee, and hip flexion angles), allowing non-intrusive and contactless gait health monitoring in people's living spaces. To overcome the high uncertainty in lower-limb movement given the limited information provided by the gait-induced floor vibrations, we formulate a physics-informed graph to integrate domain knowledge of gait biomechanics and structural dynamics into the model. Specifically, different types of nodes represent heterogeneous information from joint motions and floor vibrations; Their connecting edges represent the physiological relationships between joints and forces governed by gait biomechanics, as well as the relationships between forces and floor responses governed by the structural dynamics. As a result, our model poses physical constraints to reduce uncertainty while allowing information sharing between the body and the floor to make more accurate predictions. We evaluate our approach with 20 participants through a real-world walking experiment. We achieved an average of 3.7 degrees of mean absolute error in estimating 12 joint flexion angles (38% error reduction from baseline), which is comparable to the performance of cameras and wearables in current medical practices.
With Highly Automated Driving (HAD), the driver can engage in non-driving-related tasks. In the event of a system failure, the driver is expected to reasonably regain control of the Automated Vehicle (AV). Incorrect system understanding may provoke misuse by the driver and can lead to vehicle-level hazards. ISO 21448, referred to as the standard for Safety of the Intended Functionality (SOTIF), defines misuse as usage of the system by the driver in a way not intended by the system manufacturer. Foreseeable Misuse (FM) implies anticipated system misuse based on the best knowledge about the system design and the driver behaviour. This is the underlying motivation to propose simulation-based testing of FM. The vital challenge is to perform a simulation-based testing for a SOTIF-related misuse scenario. Transverse Guidance Assist System (TGAS) is modelled for HAD. In the context of this publication, TGAS is referred to as the "system," and the driver is the human operator of the system. This publication focuses on implementing the Driver-Vehicle Interface (DVI) that permits the interactions between the driver and the system. The implementation and testing of a derived misuse scenario using the driving simulator ensure reasonable usage of the system by supporting the driver with unambiguous information on system functions and states so that the driver can conveniently perceive, comprehend, and act upon the information.
Deep Convolutional Neural Networks (CNNs) have significantly advanced deep learning, driving breakthroughs in computer vision, natural language processing, medical diagnosis, object detection, and speech recognition. Architectural innovations including 1D, 2D, and 3D convolutional models, dilated and grouped convolutions, depthwise separable convolutions, and attention mechanisms address domain-specific challenges and enhance feature representation and computational efficiency. Structural refinements such as spatial-channel exploitation, multi-path design, and feature-map enhancement contribute to robust hierarchical feature extraction and improved generalization, particularly through transfer learning. Efficient preprocessing strategies, including Fourier transforms, structured transforms, low-precision computation, and weight compression, optimize inference speed and facilitate deployment in resource-constrained environments. This survey presents a unified taxonomy that classifies CNN architectures based on spatial exploitation, multi-path structures, depth, width, dimensionality expansion, channel boosting, and attention mechanisms. It systematically reviews CNN applications in face recognition, pose estimation, action recognition, text classification, statistical language modeling, disease diagnosis, radiological analysis, cryptocurrency sentiment prediction, 1D data processing, video analysis, and speech recognition. In addition to consolidating architectural advancements, the review highlights emerging learning paradigms such as few-shot, zero-shot, weakly supervised, federated learning frameworks and future research directions include hybrid CNN-transformer models, vision-language integration, generative learning, etc. This review provides a comprehensive perspective on CNN's evolution from 2015 to 2025, outlining key innovations, challenges, and opportunities.
Control barrier functions (CBFs) play a crucial role in achieving the safety-critical control of robotic systems theoretically. However, most existing methods rely on the analytical expressions of unsafe state regions, which is often impractical for irregular and dynamic unsafe regions. In this paper, a novel CBF construction approach, called CoIn-SafeLink, is proposed based on cost-sensitive incremental random vector functional-link (RVFL) neural networks. By designing an appropriate cost function, CoIn-SafeLink achieves differentiated sensitivities to safe and unsafe samples, effectively achieving zero false-negative risk in unsafe sample classification. Additionally, an incremental update theorem for CoIn-SafeLink is proposed, enabling precise adjustments in response to changes in the unsafe region. Finally, the gradient analytical expression of the CoIn-SafeLink is provided to calculate the control input. The proposed method is validated on a 3-degree-of-freedom drone attitude control system. Experimental results demonstrate that the method can effectively learn the unsafe region boundaries and rapidly adapt as these regions evolve, with an update speed approximately five times faster than comparison methods. The source code is available at https://github.com/songqiaohu/CoIn-SafeLink.
While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.
Pre-trained Transformers, through in-context learning (ICL), have demonstrated exceptional capabilities to adapt to new tasks using example prompts without model update. Transformer-based wireless receivers, where prompts consist of the pilot data in the form of transmitted and received signal pairs, have shown high detection accuracy when pilot data are abundant. However, pilot information is often costly and limited in practice. In this work, we propose the DEcision Feedback INcontExt Detection (DEFINED) solution as a new wireless receiver design, which bypasses channel estimation and directly performs symbol detection using the (sometimes extremely) limited pilot data. The key innovation in DEFINED is the proposed decision feedback mechanism in ICL, where we sequentially incorporate the detected symbols into the prompts as pseudo-labels to improve the detection for subsequent symbols. Furthermore, we proposed another detection method where we combine ICL with Semi-Supervised Learning (SSL) to extract information from both labeled and unlabeled data during inference, thus avoiding the errors propagated during the decision feedback process of the original DEFINED. Extensive experiments across a broad range of wireless communication settings demonstrate that a small Transformer trained with DEFINED or IC-SSL achieves significant performance improvements over conventional methods, in some cases only needing a single pilot pair to achieve similar performance of the latter with more than 4 pilot pairs.
Trusted Execution Environments (TEEs) have emerged at the forefront of edge computing to combat the lack of trust between system components. Field Programmable Gate Arrays (FPGAs) are commonly used as edge computers but were not created with security as a primary consideration. Thus, FPGA-based edge computers are increasingly the target of cyberattacks. We analyze the existing literature to systematize the applications and features of FPGA-based TEEs. We identified 27 primary studies related to different types of System-on-Chip FPGA-based TEEs. Across a wide range of applications and features, the availability of extensible solutions is limited. Most solutions focus on specific features and applications, whereas few solutions focus on feature-rich, comprehensive TEEs that can be utilized across computer systems. Whether TEEs are specific or extensible, the paucity of published studies provides evidence of research gaps. This SoK delineates these gaps revealing opportunities for researchers and developers.
Digital system design lectures are mandatory in the electrical and electronics engineering curriculum. Besides HDL simulators and viewers, FPGA boards are necessary for the real implementation of HDL, which were previously costly for students. With the emergence of low-cost FPGA boards, the use of take-home labs is increasing. The COVID-19 pandemic has further accelerated this process. Traditional lab sessions have limitations, prompting the exploration of take-home lab kits to enhance learning flexibility and engagement. This study aims to evaluate the effectiveness of a low-cost take-home lab kit, consisting of a Tang Nano 9K FPGA board and a Saleae Logic Analyzer, in improving students' practical skills and sparking curiosity in digital system design. The research was conducted in the EEE 303 Digital Design lecture. Students used the Tang Nano 9K FPGA and Saleae Logic Analyzer for a term project involving PWM signal generation. Data was collected through a survey assessing the kit's impact on learning and engagement. Positive Acceptance: 75% of students agreed or strongly agreed that the take-home lab kit was beneficial. Preference for Lab Types: 60% of students preferred classical weekly lab hours over take-home labs. Increased Curiosity: 65% of students conducted additional, unassigned experiments, indicating heightened interest and engagement. The take-home lab kit effectively aids in learning practical aspects of digital system design and stimulates curiosity, though some students prefer traditional lab sessions for group work.
This paper investigates a subgradient-based algorithm to solve the system identification problem for linear time-invariant systems with non-smooth objectives. This is essential for robust system identification in safety-critical applications. While existing work provides theoretical exact recovery guarantees using optimization solvers, the design of fast learning algorithms with convergence guarantees for practical use remains unexplored. We analyze the subgradient method in this setting where the optimization problems to be solved change over time as new measurements are taken, and we establish linear convergence results for both the best and Polyak step sizes after a burn-in period. Additionally, we characterize the asymptotic convergence of the best average sub-optimality gap under diminishing and constant step sizes. Finally, we compare the time complexity of standard solvers with the subgradient algorithm and support our findings with experimental results. This is the first work to analyze subgradient algorithms for system identification with non-smooth objectives.
This paper introduces Q-learning with gradient target tracking, a novel reinforcement learning framework that provides a learned continuous target update mechanism as an alternative to the conventional hard update paradigm. In the standard deep Q-network (DQN), the target network is a copy of the online network's weights, held fixed for a number of iterations before being periodically replaced via a hard update. While this stabilizes training by providing consistent targets, it introduces a new challenge: the hard update period must be carefully tuned to achieve optimal performance. To address this issue, we propose two gradient-based target update methods: DQN with asymmetric gradient target tracking (AGT2-DQN) and DQN with symmetric gradient target tracking (SGT2-DQN). These methods replace the conventional hard target updates with continuous and structured updates using gradient descent, which effectively eliminates the need for manual tuning. We provide a theoretical analysis proving the convergence of these methods in tabular settings. Additionally, empirical evaluations demonstrate their advantages over standard DQN baselines, which suggest that gradient-based target updates can serve as an effective alternative to conventional target update mechanisms in Q-learning.
This paper presents a novel approach to motion planning for two-wheeled drones that can drive on the ground and fly in the air. Conventional methods for two-wheeled drone motion planning typically rely on gradient-based optimization and assume that obstacle shapes can be approximated by a differentiable form. To overcome this limitation, we propose a motion planning method based on Model Predictive Path Integral (MPPI) control, enabling navigation through arbitrarily shaped obstacles by switching between driving and flight modes. To handle the instability and rapid solution changes caused by mode switching, our proposed method switches the control space and utilizes the auxiliary controller for MPPI. Our simulation results demonstrate that the proposed method enables navigation in unstructured environments and achieves effective obstacle avoidance through mode switching.
This work introduces CTorch, a PyTorch-compatible, GPU-accelerated, and auto-differentiable projector toolbox designed to handle various CT geometries with configurable projector algorithms. CTorch provides flexible scanner geometry definition, supporting 2D fan-beam, 3D circular cone-beam, and 3D non-circular cone-beam geometries. Each geometry allows view-specific definitions to accommodate variations during scanning. Both flat- and curved-detector models may be specified to accommodate various clinical devices. CTorch implements four projector algorithms: voxel-driven, ray-driven, distance-driven (DD), and separable footprint (SF), allowing users to balance accuracy and computational efficiency based on their needs. All the projectors are primarily built using CUDA C for GPU acceleration, then compiled as Python-callable functions, and wrapped as PyTorch network module. This design allows direct use of PyTorch tensors, enabling seamless integration into PyTorch's auto-differentiation framework. These features make CTorch an flexible and efficient tool for CT imaging research, with potential applications in accurate CT simulations, efficient iterative reconstruction, and advanced deep-learning-based CT reconstruction.
This letter studies the impact of fluid antenna system (FAS) technology on the performance of unmanned aerial vehicle (UAV)-assisted multiuser communication networks. Specifically, we consider a scenario where a fixed-position antenna (FPA) base station (BS) serves K FAS-equipped users with the assistance of a UAV acting as an aerial relay. The BS employs rate-splitting multiple access (RSMA), while the UAV operates in half-duplex (HD) mode using the decode-and-forward (DF) strategy. For this system, we derive a compact analytical expression for the outage probability (OP) and its asymptotic behavior in the high signal-to-noise ratio (SNR) regime, leveraging the multivariate t-distribution. Our results show how deploying FAS at ground users (GUs) in UAV-aided communications improves overall system performance compared to using FPA GUs.
The rapidly growing demand for on-chip edge intelligence on resource-constrained devices has motivated approaches to reduce energy and latency of deep learning models. Spiking neural networks (SNNs) have gained particular interest due to their promise to reduce energy consumption using event-based processing. We assert that while sigma-delta encoding in SNNs can take advantage of the temporal redundancy across video frames, they still involve a significant amount of redundant computations due to processing insignificant events. In this paper, we propose a region masking strategy that identifies regions of interest at the input of the SNN, thereby eliminating computation and data movement for events arising from unimportant regions. Our approach demonstrates that masking regions at the input not only significantly reduces the overall spiking activity of the network, but also provides significant improvement in throughput and latency. We apply region masking during video object detection on Loihi 2, demonstrating that masking approximately 60% of input regions can reduce energy-delay product by 1.65x over a baseline sigma-delta network, with a degradation in mAP@0.5 by 1.09%.
Physics-informed machine learning provides an approach to combining data and governing physics laws for solving complex partial differential equations (PDEs). However, efficiently solving PDEs with varying parameters and changing initial conditions and boundary conditions (ICBCs) with theoretical guarantees remains an open challenge. We propose a hybrid framework that uses a neural network to learn B-spline control points to approximate solutions to PDEs with varying system and ICBC parameters. The proposed network can be trained efficiently as one can directly specify ICBCs without imposing losses, calculate physics-informed loss functions through analytical formulas, and requires only learning the weights of B-spline functions as opposed to both weights and basis as in traditional neural operator learning methods. We provide theoretical guarantees that the proposed B-spline networks serve as universal approximators for the set of solutions of PDEs with varying ICBCs under mild conditions and establish bounds on the generalization errors in physics-informed learning. We also demonstrate in experiments that the proposed B-spline network can solve problems with discontinuous ICBCs and outperforms existing methods, and is able to learn solutions of 3D dynamics with diverse initial conditions.
In this paper, we propose a novel federated framework for constructing the digital twin (DT) model, referring to a living and self-evolving visualization model empowered by artificial intelligence, enabled by distributed sensing under edge-cloud collaboration. In this framework, the DT model to be built at the cloud is regarded as a global one being split into and integrating from multiple functional components, i.e., partial-DTs, created at various edge servers (ESs) using feature data collected by associated sensors. Considering time-varying DT evolutions and heterogeneities among partial-DTs, we formulate an online problem that jointly and dynamically optimizes partial-DT assignments from the cloud to ESs, ES-sensor associations for partial-DT creation, and as well as computation and communication resource allocations for global-DT integration. The problem aims to maximize the constructed DT's model quality while minimizing all induced costs, including energy consumption and configuration costs, in long runs. To this end, we first transform the original problem into an equivalent hierarchical game with an upper-layer two-sided matching game and a lower-layer overlapping coalition formation game. After analyzing these games in detail, we apply the Gale-Shapley algorithm and particularly develop a switch rules-based overlapping coalition formation algorithm to obtain short-term equilibria of upper-layer and lower-layer subgames, respectively. Then, we design a deep reinforcement learning-based solution, called DMO, to extend the result into a long-term equilibrium of the hierarchical game, thereby producing the solution to the original problem. Simulations show the effectiveness of the introduced framework, and demonstrate the superiority of the proposed solution over counterparts.
This paper mainly addresses the distributed online optimization problem where the local objective functions are assumed to be convex or non-convex. First, the distributed algorithms are proposed for the convex and non-convex situations, where the one-point residual feedback technology is introduced to estimate gradient of local objective functions. Then the regret bounds of the proposed algorithms are derived respectively under the assumption that the local objective functions are Lipschitz or smooth, which implies that the regrets are sublinear. Finally, we give two numerical examples of distributed convex optimization and distributed resources allocation problem to illustrate the effectiveness of the proposed algorithm.
Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge. Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases. This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing and querying the databases. To address these issues, we propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models. Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge. We develop several mechanisms to efficiently process multiple auditory knowledge, including a CLAP-based rejection sampler and a language-audio fusion module. Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases, highlighting the effectiveness of our generation-based approach.
Intercepting dynamic objects in uncertain environments involves a significant unresolved challenge in modern robotic systems. Current control approaches rely solely on estimated information, and results lack guarantees of robustness and feasibility. In this work, we introduce a novel method to tackle the interception of targets whose motion is affected by known and bounded uncertainty. Our approach introduces new techniques of reachability analysis for rigid bodies, leveraged to guarantee feasibility of interception under uncertain conditions. We then propose a Reachability-Guaranteed Optimal Control Problem, ensuring robustness and guaranteed reachability to a target set of configurations. We demonstrate the methodology in the case study of an interception maneuver of a tumbling target in space.
Vehicle cybersecurity has emerged as a critical concern, driven by the innovation in the automotive industry, e.g., automomous, electric, or connnected vehicles. Current efforts to address these challenges are constrained by the limited computational resources of vehicles and the reliance on connected infrastructures. This motivated the foundation of Vehicle Security Operations Centers (VSOCs) that extend IT-based Security Operations Centers (SOCs) to cover the entire automotive ecosystem, both the in-vehicle and off-vehicle scopes. Security Orchestration, Automation, and Response (SOAR) tools are considered key for impelementing an effective cybersecurity solution. However, existing state-of-the-art solutions depend on infrastructure networks such as 4G, 5G, and WiFi, which often face scalability and congestion issues. To address these limitations, we propose a novel SOAR architecture EVSOAR that leverages the EV charging stations for connectivity and computing to enhance vehicle cybersecurity. Our EV-specific SOAR architecture enables real-time analysis and automated responses to cybersecurity threats closer to the EV, reducing the cellular latency, bandwidth, and interference limitations. Our experimental results demonstrate a significant improvement in latency, stability, and scalability through the infrastructure and the capacity to deploy computationally intensive applications, that are otherwise infeasible within the resource constraints of individual vehicles.
Frequency response function (FRF) measurements are widely used in Gravitational Wave (GW) detectors, e.g., for the design of controllers, calibrating signals and diagnostic problems with system dynamics. The aim of this paper is to present GraFIT: a toolbox that enables fast, inexpensive, and accurate identification of FRF measurements for GW detectors compared to the commonly used approaches, including common spectral analysis techniques. The toolbox consists of a single function to estimate the frequency response function for both open-loop and closed-loop systems and for arbitrary input and output dimensions. The toolbox is validated on two experimental case studies of the Virgo detector, illustrating more than a factor 3 reduction in standard deviation of the estimate for the same measurement times, and comparable standard deviations with up to 10 times less data for the new method with respect to the currently implemented Spectral Analysis method.
Renewable power-to-hydrogen (ReP2H) systems require rectifiers to supply power to electrolyzers (ELZs). Two main types of rectifiers, insulated-gate bipolar transistor rectifiers (IGBT-Rs) and thyristor rectifiers (TRs), offer distinct tradeoffs. IGBT-Rs provide flexible reactive power control but are costly, whereas TRs are more affordable with lower power loss but consume a large amount of uncontrollable reactive power. A mixed configuration of rectifiers in utility-scale ReP2H systems could achieve an decent tradeoff and increase overall profitability. To explore this potential, this paper proposes an optimal investment portfolio model. First, we model and compare the active and reactive power characteristics of ELZs powered by TRs and IGBT-Rs. Second, we consider the investment of ELZs, rectifiers, and var resources and coordinate the operation of renewables, energy storage, var resources, and the on-off switching and load allocation of multiple ELZs. Subsequently, a two-stage stochastic programming (SP) model based on weighted information gap decision theory (W-IGDT) is developed to address the uncertainties of the renewable power and hydrogen price, and we apply the progressive hedging (PH) algorithm to accelerate its solution. Case studies demonstrate that optimal rectifier configurations increase revenue by at most 2.56% compared with using only TRs or IGBT-Rs, as well as those in existing projects. Under the optimal portfolio, reactive power compensation investment is nearly eliminated, with a preferred TR-to-IGBT-R ratio of 3:1.
Event cameras rely on motion to obtain information about scene appearance. In other words, for event cameras, motion and appearance are seen both or neither, which are encoded in the output event stream. Previous works consider recovering these two visual quantities as separate tasks, which does not fit with the nature of event cameras and neglects the inherent relations between both tasks. In this paper, we propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance), with a single network. Starting from the event generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity, which is further combined with the contrast maximization framework, yielding a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show that our model achieves state-of-the-art performance for both optical flow (achieves 20% and 25% improvement in EPE and AE respectively in the unsupervised learning category) and intensity estimation (produces competitive results with other baselines, particularly in high dynamic range scenarios). Last but not least, our model achieves shorter inference time than all the other optical flow models and many of the image reconstruction models, while they output only one quantity. Project page: https://github.com/tub-rip/e2fai
Camera-based monitoring of vital signs, also known as imaging photoplethysmography (iPPG), has seen applications in driver-monitoring, perfusion assessment in surgical settings, affective computing, and more. iPPG involves sensing the underlying cardiac pulse from video of the skin and estimating vital signs such as the heart rate or a full pulse waveform. Some previous iPPG methods impose model-based sparse priors on the pulse signals and use iterative optimization for pulse wave recovery, while others use end-to-end black-box deep learning methods. In contrast, we introduce methods that combine signal processing and deep learning methods in an inverse problem framework. Our methods estimate the underlying pulse signal and heart rate from facial video by learning deep-network-based denoising operators that leverage deep algorithm unfolding and deep equilibrium models. Experiments show that our methods can denoise an acquired signal from the face and infer the correct underlying pulse rate, achieving state-of-the-art heart rate estimation performance on well-known benchmarks, all with less than one-fifth the number of learnable parameters as the closest competing method.
Applied category theory often studies symmetric monoidal categories (SMCs) whose morphisms represent open systems. These structures naturally accommodate complex wiring patterns, leveraging (co)monoidal structures for splitting and merging wires, or compact closed structures for feedback. A key example is the compact closed SMC of design problems (DP), which enables a compositional approach to co-design in engineering. However, in practice, the systems of interest may not be fully known. Recently, Markov categories have emerged as a powerful framework for modeling uncertain processes. In this work, we demonstrate how to integrate this perspective into the study of open systems while preserving consistency with the underlying SMC structure. To this end, we employ the change-of-base construction for enriched categories, replacing the morphisms of a symmetric monoidal $\mathcal{V}$-category $\mathcal{C}$ with parametric maps $A \to \mathcal{C}(X,Y)$ in a Markov category induced by a symmetric monoidal monad. This results in a symmetric monoidal 2-category $N_*\mathcal{C}$ with the same objects as $\mathcal{C}$ and reparametrization 2-cells. By choosing different monads, we capture various types of uncertainty. The category underlying $\mathcal{C}$ embeds into $N_*\mathcal{C}$ via a strict symmetric monoidal functor, allowing (co)monoidal and compact closed structures to be transferred. Applied to DP, this construction leads to categories of practical relevance, such as parametrized design problems for optimization, and parametrized distributions of design problems for decision theory and Bayesian learning.
A flexible recommendation and retrieval system requires music similarity in terms of multiple partial elements of musical pieces to allow users to select the element they want to focus on. A method for music similarity learning using multiple networks with individual instrumental signals is effective but faces the problem that using each clean instrumental signal as a query is impractical for retrieval systems and using separated instrumental sounds reduces accuracy owing to artifacts. In this paper, we present instrumental-part-based music similarity learning with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we designed a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which are trained using the triplet loss with masks. Experimental results showed that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input in the evaluation of an instrument that had low accuracy, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human acceptance, especially when focusing on timbre.