New articles on Computer Science


[1] 2402.18576

Improved Forecasting Using a PSO-RDV Framework to Enhance Artificial Neural Network

Decision making and planning have long relied heavily on AI-driven forecasts. The government and the general public are working to minimize the risks while maximizing benefits in the face of potential future public health uncertainties. This study used an improved method of forecasting utilizing the Random Descending Velocity Inertia Weight (RDV IW) technique to improve the convergence of Particle Swarm Optimization (PSO) and the accuracy of Artificial Neural Network (ANN). The IW technique, inspired by the motions of a golf ball, modified the particles' velocities as they approached the solution point to a parabolically descending structure. Simulation results revealed that the proposed forecasting model with [0.4, 0.9] combination of alpha and alpha_dump exhibits a 6.36% improvement in position error and 11.75% improvement in computational time compared to the old model, thus, improving its convergence. It reached the optimum level at minimal steps with 12.50% improvement as against the old model since it provides better velocity averages when speed stabilization occurs at the 24th iteration. Meanwhile, the computed p-values for NRMSE (0.04889174), MAE (0.02829063), MAPE (0.02226053), WAPE (0.01701545), and R2 (0.00000021) of the proposed algorithm are less than the set 0.05 level of significance, thus the values indicated a significant result in terms of accuracy performance. Applying the modified ANN-PSO using RDV IW technique greatly improved the new HIV/AIDS forecasting model compared with the two models.


[2] 2402.18577

Motion Guided Token Compression for Efficient Masked Video Modeling

Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.


[3] 2402.18579

Wilcoxon Nonparametric CFAR Scheme for Ship Detection in SAR Image

The parametric constant false alarm rate (CFAR) detection algorithms which are based on various statistical distributions, such as Gaussian, Gamma, Weibull, log-normal, G0 distribution, alpha-stable distribution, etc, are most widely used to detect the ship targets in SAR image at present. However, the clutter background in SAR images is complicated and variable. When the actual clutter background deviates from the assumed statistical distribution, the performance of the parametric CFAR detector will deteriorate. In addition to the parametric CFAR schemes, there is another class of nonparametric CFAR detectors which can maintain a constant false alarm rate for the target detection without the assumption of a known clutter distribution. In this work, the Wilcoxon nonparametric CFAR scheme for ship detection in SAR image is proposed and analyzed, and a closed form of the false alarm rate for the Wilcoxon nonparametric detector to determine the decision threshold is presented. By comparison with several typical parametric CFAR schemes on Radarsat-2, ICEYE-X6 and Gaofen-3 SAR images, the robustness of the Wilcoxon nonparametric detector to maintain a good false alarm performance in different detection backgrounds is revealed, and its detection performance for the weak ship in rough sea surface is improved to some extent. Moreover, the Wilcoxon nonparametric detector can suppress the false alarms resulting from the sidelobes at some degree and its detection speed is fast.


[4] 2402.18581

Multi-objective Optimal Roadside Units Deployment in Urban Vehicular Networks

The significance of transportation efficiency, safety, and related services is increasing in urban vehicular networks. Within such networks, roadside units (RSUs) serve as intermediates in facilitating communication. Therefore, the deployment of RSUs is of utmost importance in ensuring the quality of communication services. However, the optimization objectives, such as time delay and deployment cost, are commonly developed from diverse perspectives. As a result, it is possible that conflicts may arise among the objectives. Furthermore, in urban environments, the presence of various obstacles, such as buildings, gardens, lakes, and other infrastructure, poses challenges for the deployment of RSUs. Hence, the deployment encounters significant difficulties due to the existence of multiple objectives, constraints imposed by obstacles, and the necessity to explore a large-scale optimization space. To address this issue, two versions of multi-objective optimization algorithms are proposed in this paper. By utilizing a multi-population strategy and an adaptive exploration technique, the methods efficiently explore a large-scale decision-variable space. In order to mitigate the issue of an overcrowded deployment of RSUs, a calibrating mechanism is adopted to adjust RSU density during the optimization procedures. The proposed methods also take care of data offloading between vehicles and RSUs by setting up an iterative best response sequence game (IBRSG). By comparing the proposed algorithms with several state-of-the-art algorithms, the results demonstrate that our strategies perform better in both high-density and low-density urban scenarios. The results also indicate that the proposed solutions substantially improve the efficiency of vehicular networks.


[5] 2402.18582

Streamlining the Selection Phase of Systematic Literature Reviews (SLRs) Using AI-Enabled GPT-4 Assistant API

The escalating volume of academic literature presents a formidable challenge in staying updated with the newest research developments. Addressing this, this study introduces a pioneering AI-based tool, configured specifically to streamline the efficiency of the article selection phase in Systematic Literature Reviews (SLRs). Utilizing the robust capabilities of OpenAI's GPT-4 Assistant API, the tool successfully homogenizes the article selection process across a broad array of academic disciplines. Implemented through a tripartite approach consisting of data preparation, AI-mediated article assessment, and structured result presentation, this tool significantly accelerates the time-consuming task of literature reviews. Importantly, this tool could be highly beneficial in fields such as management and economics, where the SLR process involves substantial human judgment. The adoption of a standard GPT model can substantially reduce potential biases and enhance the speed and precision of the SLR selection phase. This not only amplifies researcher productivity and accuracy but also denotes a considerable stride forward in the way academic research is conducted amidst the surging body of scholarly publications.


[6] 2402.18584

Adjusting Dynamics of Hopfield Neural Network via Time-variant Stimulus

As a paradigmatic model for nonlinear dynamics studies, the Hopfield Neural Network (HNN) demonstrates a high susceptibility to external disturbances owing to its intricate structure. This paper delves into the challenge of modulating HNN dynamics through time-variant stimuli. The effects of adjustments using two distinct types of time-variant stimuli, namely the Weight Matrix Stimulus (WMS) and the State Variable Stimulus (SVS), along with a Constant Stimulus (CS) are reported. The findings reveal that deploying four WMSs enables the HNN to generate either a four-scroll or a coexisting two-scroll attractor. When combined with one SVS, four WMSs can lead to the formation of an eight-scroll or four-scroll attractor, while the integration of four WMSs and multiple SVSs can induce grid-multi-scroll attractors. Moreover, the introduction of a CS and an SVS can significantly disrupt the dynamic behavior of the HNN. Consequently, suitable adjustment methods are crucial for enhancing the network's dynamics, whereas inappropriate applications can lead to the loss of its chaotic characteristics. To empirically validate these enhancement effects, the study employs an FPGA hardware platform. Subsequently, an image encryption scheme is designed to demonstrate the practical application benefits of the dynamically adjusted HNN in secure multimedia communication. This exploration into the dynamic modulation of HNN via time-variant stimuli offers insightful contributions to the advancement of secure communication technologies.


[7] 2402.18587

At the Dawn of Generative AI Era: A Tutorial-cum-Survey on New Frontiers in 6G Wireless Intelligence

The majority of data-driven wireless research leans heavily on discriminative AI (DAI) that requires vast real-world datasets. Unlike the DAI, Generative AI (GenAI) pertains to generative models (GMs) capable of discerning the underlying data distribution, patterns, and features of the input data. This makes GenAI a crucial asset in wireless domain wherein real-world data is often scarce, incomplete, costly to acquire, and hard to model or comprehend. With these appealing attributes, GenAI can replace or supplement DAI methods in various capacities. Accordingly, this combined tutorial-survey paper commences with preliminaries of 6G and wireless intelligence by outlining candidate 6G applications and services, presenting a taxonomy of state-of-the-art DAI models, exemplifying prominent DAI use cases, and elucidating the multifaceted ways through which GenAI enhances DAI. Subsequently, we present a tutorial on GMs by spotlighting seminal examples such as generative adversarial networks, variational autoencoders, flow-based GMs, diffusion-based GMs, generative transformers, large language models, to name a few. Contrary to the prevailing belief that GenAI is a nascent trend, our exhaustive review of approximately 120 technical papers demonstrates the scope of research across core wireless research areas, including physical layer design; network optimization, organization, and management; network traffic analytics; cross-layer network security; and localization & positioning. Furthermore, we outline the central role of GMs in pioneering areas of 6G network research, including semantic/THz/near-field communications, ISAC, extremely large antenna arrays, digital twins, AI-generated content services, mobile edge computing and edge AI, adversarial ML, and trustworthy AI. Lastly, we shed light on the multifarious challenges ahead, suggesting potential strategies and promising remedies.


[8] 2402.18589

Verif.ai: Towards an Open-Source Scientific Generative Question-Answering System with Referenced and Verifiable Answers

In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and generating answers with references to the papers from which the claim was derived, and (3) a verification engine that cross-checks the generated claim and the abstract or paper from which the claim was derived, verifying whether there may have been any hallucinations in generating the claim. We are reinforcing the generative model by providing the abstract in context, but in addition, an independent set of methods and models are verifying the answer and checking for hallucinations. Therefore, we believe that by using our method, we can make scientists more productive, while building trust in the use of generative language models in scientific environments, where hallucinations and misinformation cannot be tolerated.


[9] 2402.18590

Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review

The paper underscores the significance of Large Language Models (LLMs) in reshaping recommender systems, attributing their value to unique reasoning abilities absent in traditional recommenders. Unlike conventional systems lacking direct user interaction data, LLMs exhibit exceptional proficiency in recommending items, showcasing their adeptness in comprehending intricacies of language. This marks a fundamental paradigm shift in the realm of recommendations. Amidst the dynamic research landscape, researchers actively harness the language comprehension and generation capabilities of LLMs to redefine the foundations of recommendation tasks. The investigation thoroughly explores the inherent strengths of LLMs within recommendation frameworks, encompassing nuanced contextual comprehension, seamless transitions across diverse domains, adoption of unified approaches, holistic learning strategies leveraging shared data reservoirs, transparent decision-making, and iterative improvements. Despite their transformative potential, challenges persist, including sensitivity to input prompts, occasional misinterpretations, and unforeseen recommendations, necessitating continuous refinement and evolution in LLM-driven recommender systems.


[10] 2402.18591

Stochastic contextual bandits with graph feedback: from independence number to MAS number

We consider contextual bandits with graph feedback, a class of interactive learning problems with richer structures than vanilla contextual bandits, where taking an action reveals the rewards for all neighboring actions in the feedback graph under all contexts. Unlike the multi-armed bandits setting where a growing literature has painted a near-complete understanding of graph feedback, much remains unexplored in the contextual bandits counterpart. In this paper, we make inroads into this inquiry by establishing a regret lower bound $\Omega(\sqrt{\beta_M(G) T})$, where $M$ is the number of contexts, $G$ is the feedback graph, and $\beta_M(G)$ is our proposed graph-theoretical quantity that characterizes the fundamental learning limit for this class of problems. Interestingly, $\beta_M(G)$ interpolates between $\alpha(G)$ (the independence number of the graph) and $\mathsf{m}(G)$ (the maximum acyclic subgraph (MAS) number of the graph) as the number of contexts $M$ varies. We also provide algorithms that achieve near-optimal regrets for important classes of context sequences and/or feedback graphs, such as transitively closed graphs that find applications in auctions and inventory control. In particular, with many contexts, our results show that the MAS number completely characterizes the statistical complexity for contextual bandits, as opposed to the independence number in multi-armed bandits.


[11] 2402.18592

A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader

The performance gap between memory and processor has grown rapidly. Consequently, the energy and wall-clock time costs associated with moving data between the CPU and main memory predominate the overall computational cost. The Processing-in-Memory (PIM) paradigm emerges as a promising architecture that mitigates the need for extensive data movements by strategically positioning computing units proximate to the memory. Despite the abundant efforts devoted to building a robust and highly-available PIM system, identifying PIM-friendly segments of applications poses significant challenges due to the lack of a comprehensive tool to evaluate the intrinsic memory access pattern of the segment. To tackle this challenge, we propose A$^3$PIM: an Automated, Analytic and Accurate Processing-in-Memory offloader. We systematically consider the cross-segment data movement and the intrinsic memory access pattern of each code segment via static code analyzer. We evaluate A$^3$PIM across a wide range of real-world workloads including GAP and PrIM benchmarks and achieve an average speedup of 2.63x and 4.45x (up to 7.14x and 10.64x) when compared to CPU-only and PIM-only executions, respectively.


[12] 2402.18593

Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale

As research and deployment of AI grows, the computational burden to support and sustain its progress inevitably does too. To train or fine-tune state-of-the-art models in NLP, computer vision, etc., some form of AI hardware acceleration is virtually a requirement. Recent large language models require considerable resources to train and deploy, resulting in significant energy usage, potential carbon emissions, and massive demand for GPUs and other hardware accelerators. However, this surge carries large implications for energy sustainability at the HPC/datacenter level. In this paper, we study the aggregate effect of power-capping GPUs on GPU temperature and power draw at a research supercomputing center. With the right amount of power-capping, we show significant decreases in both temperature and power draw, reducing power consumption and potentially improving hardware life-span with minimal impact on job performance. While power-capping reduces power draw by design, the aggregate system-wide effect on overall energy consumption is less clear; for instance, if users notice job performance degradation from GPU power-caps, they may request additional GPU-jobs to compensate, negating any energy savings or even worsening energy consumption. To our knowledge, our work is the first to conduct and make available a detailed analysis of the effects of GPU power-capping at the supercomputing scale. We hope our work will inspire HPCs/datacenters to further explore, evaluate, and communicate the impact of power-capping AI hardware accelerators for more sustainable AI.


[13] 2402.18595

EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration

Deep neural networks (DNNs) have achieved great breakthroughs in many fields such as image classification and natural language processing. However, the execution of DNNs needs to conduct massive numbers of multiply-accumulate (MAC) operations on hardware and thus incurs a large power consumption. To address this challenge, we propose a novel digital MAC design based on encoding. In this new design, the multipliers are replaced by simple logic gates to project the results onto a wide bit representation. These bits carry individual position weights, which can be trained for specific neural networks to enhance inference accuracy. The outputs of the new multipliers are added by bit-wise weighted accumulation and the accumulation results are compatible with existing computing platforms accelerating neural networks with either uniform or non-uniform quantization. Since the multiplication function is replaced by simple logic projection, the critical paths in the resulting circuits become much shorter. Correspondingly, pipelining stages in the MAC array can be reduced, leading to a significantly smaller area as well as a better power efficiency. The proposed design has been synthesized and verified by ResNet18-Cifar10, ResNet20-Cifar100 and ResNet50-ImageNet. The experimental results confirmed the reduction of circuit area by up to 79.63% and the reduction of power consumption of executing DNNs by up to 70.18%, while the accuracy of the neural networks can still be well maintained.


[14] 2402.18596

Image-To-Mesh Conversion for Biomedical Simulations

Converting a three-dimensional medical image into a 3D mesh that satisfies both the quality and fidelity constraints of predictive simulations and image-guided surgical procedures remains a critical problem. Presented is an image-to-mesh conversion method called CBC3D. It first discretizes a segmented image by generating an adaptive Body-Centered Cubic (BCC) mesh of high-quality elements. Next, the tetrahedral mesh is converted into a mixed-element mesh of tetrahedra, pentahedra, and hexahedra to decrease element count while maintaining quality. Finally, the mesh surfaces are deformed to their corresponding physical image boundaries, improving the mesh's fidelity. The deformation scheme builds upon the ITK open-source library and is based on the concept of energy minimization, relying on a multi-material point-based registration. It uses non-connectivity patterns to implicitly control the number of extracted feature points needed for the registration and, thus, adjusts the trade-off between the achieved mesh fidelity and the deformation speed. We compare CBC3D with four widely used and state-of-the-art homegrown image-to-mesh conversion methods from industry and academia. Results indicate that the CBC3D meshes (i) achieve high fidelity, (ii) keep the element count reasonably low, and (iii) exhibit good element quality.


[15] 2402.18599

Meta-Tasks: An alternative view on Meta-Learning Regularization

Few-shot learning (FSL) is a challenging machine learning problem due to a scarcity of labeled data. The ability to generalize effectively on both novel and training tasks is a significant barrier to FSL. This paper proposes a novel solution that can generalize to both training and novel tasks while also utilizing unlabeled samples. The method refines the embedding model before updating the outer loop using unsupervised techniques as ``meta-tasks''. The experimental results show that our proposed method performs well on novel and training tasks, with faster and better convergence, lower generalization, and standard deviation error, indicating its potential for practical applications in FSL. The experimental results show that the proposed method outperforms prototypical networks by 3.9%.


[16] 2402.18603

MMSR: Symbolic Regression is a Multimodal Task

Mathematical formulas are the crystallization of human wisdom in exploring the laws of nature for thousands of years. Describing the complex laws of nature with a concise mathematical formula is a constant pursuit of scientists and a great challenge for artificial intelligence. This field is called symbolic regression. Symbolic regression was originally formulated as a combinatorial optimization problem, and GP and reinforcement learning algorithms were used to solve it. However, GP is sensitive to hyperparameters, and these two types of algorithms are inefficient. To solve this problem, researchers treat the mapping from data to expressions as a translation problem. And the corresponding large-scale pre-trained model is introduced. However, the data and expression skeletons do not have very clear word correspondences as the two languages do. Instead, they are more like two modalities (e.g., image and text). Therefore, in this paper, we proposed MMSR. The SR problem is solved as a pure multimodal problem, and contrastive learning is also introduced in the training process for modal alignment to facilitate later modal feature fusion. It is worth noting that in order to better promote the modal feature fusion, we adopt the strategy of training contrastive learning loss and other losses at the same time, which only needs one-step training, instead of training contrastive learning loss first and then training other losses. Because our experiments prove training together can make the feature extraction module and feature fusion module running-in better. Experimental results show that compared with multiple large-scale pre-training baselines, MMSR achieves the most advanced results on multiple mainstream datasets including SRBench.


[17] 2402.18605

FORML: A Riemannian Hessian-free Method for Meta-learning with Orthogonality Constraint

Meta-learning problem is usually formulated as a bi-level optimization in which the task-specific and the meta-parameters are updated in the inner and outer loops of optimization, respectively. However, performing the optimization in the Riemannian space, where the parameters and meta-parameters are located on Riemannian manifolds is computationally intensive. Unlike the Euclidean methods, the Riemannian backpropagation needs computing the second-order derivatives that include backward computations through the Riemannian operators such as retraction and orthogonal projection. This paper introduces a Hessian-free approach that uses a first-order approximation of derivatives on the Stiefel manifold. Our method significantly reduces the computational load and memory footprint. We show how using a Stiefel fully-connected layer that enforces orthogonality constraint on the parameters of the last classification layer as the head of the backbone network, strengthens the representation reuse of the gradient-based meta-learning methods. Our experimental results across various few-shot learning datasets, demonstrate the superiority of our proposed method compared to the state-of-the-art methods, especially MAML, its Euclidean counterpart.


[18] 2402.18606

Impact of network topology on the performance of Decentralized Federated Learning

Fully decentralized learning is gaining momentum for training AI models at the Internet's edge, addressing infrastructure challenges and privacy concerns. In a decentralized machine learning system, data is distributed across multiple nodes, with each node training a local model based on its respective dataset. The local models are then shared and combined to form a global model capable of making accurate predictions on new data. Our exploration focuses on how different types of network structures influence the spreading of knowledge - the process by which nodes incorporate insights gained from learning patterns in data available on other nodes across the network. Specifically, this study investigates the intricate interplay between network structure and learning performance using three network topologies and six data distribution methods. These methods consider different vertex properties, including degree centrality, betweenness centrality, and clustering coefficient, along with whether nodes exhibit high or low values of these metrics. Our findings underscore the significance of global centrality metrics (degree, betweenness) in correlating with learning performance, while local clustering proves less predictive. We highlight the challenges in transferring knowledge from peripheral to central nodes, attributed to a dilution effect during model aggregation. Additionally, we observe that central nodes exert a pull effect, facilitating the spread of knowledge. In examining degree distribution, hubs in Barabasi-Albert networks positively impact learning for central nodes but exacerbate dilution when knowledge originates from peripheral nodes. Finally, we demonstrate the formidable challenge of knowledge circulation outside of segregated communities.


[19] 2402.18607

Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective

Diffusion models have recently gained significant attention in both academia and industry due to their impressive generative performance in terms of both sampling quality and distribution coverage. Accordingly, proposals are made for sharing pre-trained diffusion models across different organizations, as a way of improving data utilization while enhancing privacy protection by avoiding sharing private data directly. However, the potential risks associated with such an approach have not been comprehensively examined. In this paper, we take an adversarial perspective to investigate the potential privacy and fairness risks associated with the sharing of diffusion models. Specifically, we investigate the circumstances in which one party (the sharer) trains a diffusion model using private data and provides another party (the receiver) black-box access to the pre-trained model for downstream tasks. We demonstrate that the sharer can execute fairness poisoning attacks to undermine the receiver's downstream models by manipulating the training data distribution of the diffusion model. Meanwhile, the receiver can perform property inference attacks to reveal the distribution of sensitive features in the sharer's dataset. Our experiments conducted on real-world datasets demonstrate remarkable attack performance on different types of diffusion models, which highlights the critical importance of robust data auditing and privacy protection protocols in pertinent applications.


[20] 2402.18609

ICE-SEARCH: A Language Model-Driven Feature Selection Approach

This study unveils the In-Context Evolutionary Search (ICE-SEARCH) method, the first work that melds language models (LMs) with evolutionary algorithms for feature selection (FS) tasks and demonstrates its effectiveness in Medical Predictive Analytics (MPA) applications. ICE-SEARCH harnesses the crossover and mutation capabilities inherent in LMs within an evolutionary framework, significantly improving FS through the model's comprehensive world knowledge and its adaptability to a variety of roles. Our evaluation of this methodology spans three crucial MPA tasks: stroke, cardiovascular disease, and diabetes, where ICE-SEARCH outperforms traditional FS methods in pinpointing essential features for medical applications. ICE-SEARCH achieves State-of-the-Art (SOTA) performance in stroke prediction and diabetes prediction; the Decision-Randomized ICE-SEARCH ranks as SOTA in cardiovascular disease prediction. Our results not only demonstrate the efficacy of ICE-SEARCH in medical FS but also underscore the versatility, efficiency, and scalability of integrating LMs in FS tasks. The study emphasizes the critical role of incorporating domain-specific insights, illustrating ICE-SEARCH's robustness, generalizability, and swift convergence. This opens avenues for further research into comprehensive and intricate FS landscapes, marking a significant stride in the application of artificial intelligence in medical predictive analytics.


[21] 2402.18610

Why Attention Graphs Are All We Need: Pioneering Hierarchical Classification of Hematologic Cell Populations with LeukoGraph

In the complex landscape of hematologic samples such as peripheral blood or bone marrow, cell classification, delineating diverse populations into a hierarchical structure, presents profound challenges. This study presents LeukoGraph, a recently developed framework designed explicitly for this purpose employing graph attention networks (GATs) to navigate hierarchical classification (HC) complexities. Notably, LeukoGraph stands as a pioneering effort, marking the application of graph neural networks (GNNs) for hierarchical inference on graphs, accommodating up to one million nodes and millions of edges, all derived from flow cytometry data. LeukoGraph intricately addresses a classification paradigm where for example four different cell populations undergo flat categorization, while a fifth diverges into two distinct child branches, exemplifying the nuanced hierarchical structure inherent in complex datasets. The technique is more general than this example. A hallmark achievement of LeukoGraph is its F-score of 98%, significantly outclassing prevailing state-of-the-art methodologies. Crucially, LeukoGraph's prowess extends beyond theoretical innovation, showcasing remarkable precision in predicting both flat and hierarchical cell types across flow cytometry datasets from 30 distinct patients. This precision is further underscored by LeukoGraph's ability to maintain a correct label ratio, despite the inherent challenges posed by hierarchical classifications.


[22] 2402.18614

Deep Neural Network Models Trained With A Fixed Random Classifier Transfer Better Across Domains

The recently discovered Neural collapse (NC) phenomenon states that the last-layer weights of Deep Neural Networks (DNN), converge to the so-called Equiangular Tight Frame (ETF) simplex, at the terminal phase of their training. This ETF geometry is equivalent to vanishing within-class variability of the last layer activations. Inspired by NC properties, we explore in this paper the transferability of DNN models trained with their last layer weight fixed according to ETF. This enforces class separation by eliminating class covariance information, effectively providing implicit regularization. We show that DNN models trained with such a fixed classifier significantly improve transfer performance, particularly on out-of-domain datasets. On a broad range of fine-grained image classification datasets, our approach outperforms i) baseline methods that do not perform any covariance regularization (up to 22%), as well as ii) methods that explicitly whiten covariance of activations throughout training (up to 19%). Our findings suggest that DNNs trained with fixed ETF classifiers offer a powerful mechanism for improving transfer learning across domains.


[23] 2402.18616

JCLEC-MO: a Java suite for solving many-objective optimization engineering problems

Although metaheuristics have been widely recognized as efficient techniques to solve real-world optimization problems, implementing them from scratch remains difficult for domain-specific experts without programming skills. In this scenario, metaheuristic optimization frameworks are a practical alternative as they provide a variety of algorithms composed of customized elements, as well as experimental support. Recently, many engineering problems require to optimize multiple or even many objectives, increasing the interest in appropriate metaheuristic algorithms and frameworks that might integrate new specific requirements while maintaining the generality and reusability principles they were conceived for. Based on this idea, this paper introduces JCLEC-MO, a Java framework for both multi- and many-objective optimization that enables engineers to apply, or adapt, a great number of multi-objective algorithms with little coding effort. A case study is developed and explained to show how JCLEC-MO can be used to address many-objective engineering problems, often requiring the inclusion of domain-specific elements, and to analyze experimental outcomes by means of conveniently connected R utilities.


[24] 2402.18617

ELA: Exploited Level Augmentation for Offline Learning in Zero-Sum Games

Offline learning has become widely used due to its ability to derive effective policies from offline datasets gathered by expert demonstrators without interacting with the environment directly. Recent research has explored various ways to enhance offline learning efficiency by considering the characteristics (e.g., expertise level or multiple demonstrators) of the dataset. However, a different approach is necessary in the context of zero-sum games, where outcomes vary significantly based on the strategy of the opponent. In this study, we introduce a novel approach that uses unsupervised learning techniques to estimate the exploited level of each trajectory from the offline dataset of zero-sum games made by diverse demonstrators. Subsequently, we incorporate the estimated exploited level into the offline learning to maximize the influence of the dominant strategy. Our method enables interpretable exploited level estimation in multiple zero-sum games and effectively identifies dominant strategy data. Also, our exploited level augmented offline learning significantly enhances the original offline learning algorithms including imitation learning and offline reinforcement learning for zero-sum games.


[25] 2402.18618

Urban Green Index estimation based on data collected by remote sensing for Romanian cities

The modernization of offi cial statistics involves the use of new data sources, such as data collected through remote sensing. The document contains a description of how an urban green index, derived from the SDG 11.7 objective, was obtained for Romania's 41 county seat cities based on free data sets collected by remote sensing from the European and North American space agencies. The main result is represented by an estimate of the areas of surfaces covered with vegetation for the 40 county seat towns and the municipality of Bucharest, relative to the total surface. To estimate the area covered with vegetation, we used two data sets obtained by remote sensing, namely data provided by the MODIS mission, the TERRA satellite, and data provided by the Sentinel 2 mission from the Copernicus space program. Based on the results obtained, namely the surface area covered with vegetation, estimated in square kilometers, and the percentage of the total surface area or urban green index, we have created a national top of the county seat cities


[26] 2402.18621

Unveiling News Publishers Trustworthiness Through Social Interactions

With the primary goal of raising readers' awareness of misinformation phenomena, extensive efforts have been made by both academic institutions and independent organizations to develop methodologies for assessing the trustworthiness of online news publishers. Unfortunately, existing approaches are costly and face critical scalability challenges. This study presents a novel framework for assessing the trustworthiness of online news publishers using user interactions on social media platforms. The proposed methodology provides a versatile solution that serves the dual purpose of i) identifying verifiable online publishers and ii) automatically performing an initial estimation of the trustworthiness of previously unclassified online news outlets.


[27] 2402.18630

GNSS Positioning using Cost Function Regulated Multilateration and Graph Neural Networks

In urban environments, where line-of-sight signals from GNSS satellites are frequently blocked by high-rise objects, GNSS receivers are subject to large errors in measuring satellite ranges. Heuristic methods are commonly used to estimate these errors and reduce the impact of noisy measurements on localization accuracy. In our work, we replace these error estimation heuristics with a deep learning model based on Graph Neural Networks. Additionally, by analyzing the cost function of the multilateration process, we derive an optimal method to utilize the estimated errors. Our approach guarantees that the multilateration converges to the receiver's location as the error estimation accuracy increases. We evaluate our solution on a real-world dataset containing more than 100k GNSS epochs, collected from multiple cities with diverse characteristics. The empirical results show improvements from 40% to 80% in the horizontal localization error against recent deep learning baselines as well as classical localization approaches.


[28] 2402.18649

A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems

Large Language Model (LLM) systems are inherently compositional, with individual LLM serving as the core foundation with additional layers of objects such as plugins, sandbox, and so on. Along with the great potential, there are also increasing concerns over the security of such probabilistic intelligent systems. However, existing studies on LLM security often focus on individual LLM, but without examining the ecosystem through the lens of LLM systems with other objects (e.g., Frontend, Webtool, Sandbox, and so on). In this paper, we systematically analyze the security of LLM systems, instead of focusing on the individual LLMs. To do so, we build on top of the information flow and formulate the security of LLM systems as constraints on the alignment of the information flow within LLM and between LLM and other objects. Based on this construction and the unique probabilistic nature of LLM, the attack surface of the LLM system can be decomposed into three key components: (1) multi-layer security analysis, (2) analysis of the existence of constraints, and (3) analysis of the robustness of these constraints. To ground this new attack surface, we propose a multi-layer and multi-step approach and apply it to the state-of-art LLM system, OpenAI GPT4. Our investigation exposes several security issues, not just within the LLM model itself but also in its integration with other components. We found that although the OpenAI GPT4 has designed numerous safety constraints to improve its safety features, these safety constraints are still vulnerable to attackers. To further demonstrate the real-world threats of our discovered vulnerabilities, we construct an end-to-end attack where an adversary can illicitly acquire the user's chat history, all without the need to manipulate the user's input or gain direct access to OpenAI GPT4. Our demo is in the link: https://fzwark.github.io/LLM-System-Attack-Demo/


[29] 2402.18650

The Grasp Reset Mechanism: An Automated Apparatus for Conducting Grasping Trials

Advancing robotic grasping and manipulation requires the ability to test algorithms and/or train learning models on large numbers of grasps. Towards the goal of more advanced grasping, we present the Grasp Reset Mechanism (GRM), a fully automated apparatus for conducting large-scale grasping trials. The GRM automates the process of resetting a grasping environment, repeatably placing an object in a fixed location and controllable 1-D orientation. It also collects data and swaps between multiple objects enabling robust dataset collection with no human intervention. We also present a standardized state machine interface for control, which allows for integration of most manipulators with minimal effort. In addition to the physical design and corresponding software, we include a dataset of 1,020 grasps. The grasps were created with a Kinova Gen3 robot arm and Robotiq 2F-85 Adaptive Gripper to enable training of learning models and to demonstrate the capabilities of the GRM. The dataset includes ranges of grasps conducted across four objects and a variety of orientations. Manipulator states, object pose, video, and grasp success data are provided for every trial.


[30] 2402.18651

Quantifying Human Priors over Social and Navigation Networks

Human knowledge is largely implicit and relational -- do we have a friend in common? can I walk from here to there? In this work, we leverage the combinatorial structure of graphs to quantify human priors over such relational data. Our experiments focus on two domains that have been continuously relevant over evolutionary timescales: social interaction and spatial navigation. We find that some features of the inferred priors are remarkably consistent, such as the tendency for sparsity as a function of graph size. Other features are domain-specific, such as the propensity for triadic closure in social interactions. More broadly, our work demonstrates how nonclassical statistical analysis of indirect behavioral experiments can be used to efficiently model latent biases in the data.


[31] 2402.18652

Spatial Variation-Aware Read Disturbance Defenses: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions

Read disturbance in modern DRAM chips is a widespread phenomenon and is reliably used for breaking memory isolation, a fundamental building block for building robust systems. RowHammer and RowPress are two examples of read disturbance in DRAM where repeatedly accessing (hammering) or keeping active (pressing) a memory location induces bitflips in other memory locations. Unfortunately, shrinking technology node size exacerbates read disturbance in DRAM chips over generations. As a result, existing defense mechanisms suffer from significant performance and energy overheads, limited effectiveness, or prohibitively high hardware complexity. In this paper, we tackle these shortcomings by leveraging the spatial variation in read disturbance across different memory locations in real DRAM chips. To do so, we 1) present the first rigorous real DRAM chip characterization study of spatial variation of read disturbance and 2) propose Sv\"ard, a new mechanism that dynamically adapts the aggressiveness of existing solutions based on the row-level read disturbance profile. Our experimental characterization on 144 real DDR4 DRAM chips representing 10 chip designs demonstrates a large variation in read disturbance vulnerability across different memory locations: in the part of memory with the worst read disturbance vulnerability, 1) up to 2x the number of bitflips can occur and 2) bitflips can occur at an order of magnitude fewer accesses, compared to the memory locations with the least vulnerability to read disturbance. Sv\"ard leverages this variation to reduce the overheads of five state-of-the-art read disturbance solutions, and thus significantly increases system performance.


[32] 2402.18659

Large Language Models and Games: A Survey and Roadmap

Recent years have seen an explosive increase in research on large language models (LLMs), and accompanying public engagement on the topic. While starting as a niche area within natural language processing, LLMs have shown remarkable potential across a broad range of applications and domains, including games. This paper surveys the current state of the art across the various applications of LLMs in and for games, and identifies the different roles LLMs can take within a game. Importantly, we discuss underexplored areas and promising directions for future uses of LLMs in games and we reconcile the potential and limitations of LLMs within the games domain. As the first comprehensive survey and roadmap at the intersection of LLMs and games, we are hopeful that this paper will serve as the basis for groundbreaking research and innovation in this exciting new field.


[33] 2402.18660

Versatile mixed methods for compressible flows

Versatile mixed finite element methods were originally developed by Chen and Williams for isothermal incompressible flows in "Versatile mixed methods for the incompressible Navier-Stokes equations," Computers & Mathematics with Applications, Volume 80, 2020. Thereafter, these methods were extended by Miller, Chen, and Williams to non-isothermal incompressible flows in "Versatile mixed methods for non-isothermal incompressible flows," Computers & Mathematics with Applications, Volume 125, 2022. The main advantage of these methods lies in their flexibility. Unlike traditional mixed methods, they retain the divergence terms in the momentum and temperature equations. As a result, the favorable properties of the schemes are maintained even in the presence of non-zero divergence. This makes them an ideal candidate for an extension to compressible flows, in which the divergence does not generally vanish. In the present article, we finally construct the fully-compressible extension of the methods. In addition, we demonstrate the excellent performance of the resulting methods for weakly-compressible flows that arise near the incompressible limit, as well as more strongly-compressible flows that arise near Mach 0.5.


[34] 2402.18664

Online disinformation in the 2020 U.S. Election: swing vs. safe states

For U.S. presidential elections, most states use the so-called winner-take-all system, in which the state's presidential electors are awarded to the winning political party in the state after a popular vote phase, regardless of the actual margin of victory. Therefore, election campaigns are especially intense in states where there is no clear direction on which party will be the winning party. These states are often referred to as swing states. To measure the impact of such an election law on the campaigns, we analyze the Twitter activity surrounding the 2020 US preelection debate, with a particular focus on the spread of disinformation. We find that about 88\% of the online traffic was associated with swing states. In addition, the sharing of links to unreliable news sources is significantly more prevalent in tweets associated with swing states: in this case, untrustworthy tweets are predominantly generated by automated accounts. Furthermore, we observe that the debate is mostly led by two main communities, one with a predominantly Republican affiliation and the other with accounts of different political orientations. Most of the disinformation comes from the former.


[35] 2402.18667

FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability

This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents. FoFo is released here at https://github.com/SalesforceAIResearch/FoFo.


[36] 2402.18668

Simple linear attention language models balance the recall-throughput tradeoff

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve language model efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.


[37] 2402.18673

Trends, Applications, and Challenges in Human Attention Modelling

Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges. For a comprehensive overview on the ongoing research refer to our dedicated repository available at https://github.com/aimagelab/awesome-human-visual-attention.


[38] 2402.18675

Robot Body Schema Learning from Full-body Extero/Proprioception Sensors

For a robot, its body structure is an a-prior knowledge when it is designed. However, when such information is not available, can a robot recognize it by itself? In this paper, we aim to grant a robot such ability to learn its body structure from exteroception and proprioception data collected from on-body sensors. By a novel machine learning method, the robot can learn a binary Heterogeneous Dependency Matrix from its sensor readings. We showed such matrix is equivalent to a Heterogeneous out-tree structure which can uniquely represent the robot body topology. We explored the properties of such matrix and the out-tree, and proposed a remedy to fix them when they are contaminated by partial observability or data noise. We ran our algorithm on 6 different robots with different body structures in simulation and 1 real robot. Our algorithm correctly recognized their body structures with only on-body sensor readings but no topology prior knowledge.


[39] 2402.18677

Fault Tolerant Neural Control Barrier Functions for Robotic Systems under Sensor Faults and Attacks

Safety is a fundamental requirement of many robotic systems. Control barrier function (CBF)-based approaches have been proposed to guarantee the safety of robotic systems. However, the effectiveness of these approaches highly relies on the choice of CBFs. Inspired by the universal approximation power of neural networks, there is a growing trend toward representing CBFs using neural networks, leading to the notion of neural CBFs (NCBFs). Current NCBFs, however, are trained and deployed in benign environments, making them ineffective for scenarios where robotic systems experience sensor faults and attacks. In this paper, we study safety-critical control synthesis for robotic systems under sensor faults and attacks. Our main contribution is the development and synthesis of a new class of CBFs that we term fault tolerant neural control barrier function (FT-NCBF). We derive the necessary and sufficient conditions for FT-NCBFs to guarantee safety, and develop a data-driven method to learn FT-NCBFs by minimizing a loss function constructed using the derived conditions. Using the learned FT-NCBF, we synthesize a control input and formally prove the safety guarantee provided by our approach. We demonstrate our proposed approach using two case studies: obstacle avoidance problem for an autonomous mobile robot and spacecraft rendezvous problem, with code available via https://github.com/HongchaoZhang-HZ/FTNCBF.


[40] 2402.18678

RORA: Robust Free-Text Rationale Evaluation

Free-text rationales play a pivotal role in explainable NLP, bridging the knowledge and reasoning gaps behind a model's decision-making. However, due to the diversity of potential reasoning paths and a corresponding lack of definitive ground truth, their evaluation remains a challenge. Existing evaluation metrics rely on the degree to which a rationale supports a target label, but we find these fall short in evaluating rationales that inadvertently leak the labels. To address this problem, we propose RORA, a Robust free-text Rationale evaluation against label leakage. RORA quantifies the new information supplied by a rationale to justify the label. This is achieved by assessing the conditional V-information \citep{hewitt-etal-2021-conditional} with a predictive family robust against leaky features that can be exploited by a small model. RORA consistently outperforms existing approaches in evaluating human-written, synthetic, or model-generated rationales, particularly demonstrating robustness against label leakage. We also show that RORA aligns well with human judgment, providing a more reliable and accurate measurement across diverse free-text rationales.


[41] 2402.18679

Data Interpreter: An LLM Agent For Data Science

Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time data adjustment, expertise in optimization due to complex dependencies among various tasks, and the ability to identify logical errors for precise reasoning. In this study, we introduce the Data Interpreter, a solution designed to solve with code that emphasizes three pivotal techniques to augment problem-solving in data science: 1) dynamic planning with hierarchical graph structures for real-time data adaptability;2) tool integration dynamically to enhance code proficiency during execution, enriching the requisite expertise;3) logical inconsistency identification in feedback, and efficiency enhancement through experience recording. We evaluate the Data Interpreter on various data science and real-world tasks. Compared to open-source baselines, it demonstrated superior performance, exhibiting significant improvements in machine learning tasks, increasing from 0.86 to 0.95. Additionally, it showed a 26% increase in the MATH dataset and a remarkable 112% improvement in open-ended tasks. The solution will be released at https://github.com/geekan/MetaGPT.


[42] 2402.18682

Acoustic tactile sensing for mobile robot wheels

Tactile sensing in mobile robots remains under-explored, mainly due to challenges related to sensor integration and the complexities of distributed sensing. In this work, we present a tactile sensing architecture for mobile robots based on wheel-mounted acoustic waveguides. Our sensor architecture enables tactile sensing along the entire circumference of a wheel with a single active component: an off-the-shelf acoustic rangefinder. We present findings showing that our sensor, mounted on the wheel of a mobile robot, is capable of discriminating between different terrains, detecting and classifying obstacles with different geometries, and performing collision detection via contact localization. We also present a comparison between our sensor and sensors traditionally used in mobile robots, and point to the potential for sensor fusion approaches that leverage the unique capabilities of our tactile sensing architecture. Our findings demonstrate that autonomous mobile robots can further leverage our sensor architecture for diverse mapping tasks requiring knowledge of terrain material, surface topology, and underlying structure.


[43] 2402.18683

Integrated Sensing and Communication Meets Smart Propagation Engineering: Opportunities and Challenges

Both smart propagation engineering as well as integrated sensing and communication (ISAC) constitute promising candidates for next-generation (NG) mobile networks. We provide a synergistic view of these technologies, and explore their mutual benefits. First, moving beyond just intelligent surfaces, we provide a holistic view of the engineering aspects of smart propagation environments. By delving into the fundamental characteristics of intelligent surfaces, fluid antennas, and unmanned aerial vehicles, we reveal that more efficient control of the pathloss and fading can be achieved, thus facilitating intrinsic integration and mutual assistance between sensing and communication functionalities. In turn, with the exploitation of the sensing capabilities of ISAC to orchestrate the efficient configuration of radio environments, both the computational effort and signaling overheads can be reduced. We present indicative simulation results, which verify that cooperative smart propagation environment design significantly enhances the ISAC performance. Finally, some promising directions are outlined for combining ISAC with smart propagation engineering.


[44] 2402.18688

Exploring AI Problem Formulation with Children via Teachable Machines

Emphasizing problem formulation in AI literacy activities with children is vital, yet we lack empirical studies on their structure and affordances. We propose that participatory design involving teachable machines facilitates problem formulation activities. To test this, we integrated problem reduction heuristics into storyboarding and invited a university-based intergenerational design team of 10 children (ages 8-13) and 9 adults to co-design a teachable machine. We find that children draw from personal experiences when formulating AI problems; they assume voice and video capabilities, explore diverse machine learning approaches, and plan for error handling. Their ideas promote human involvement in AI, though some are drawn to more autonomous systems. Their designs prioritize values like capability, logic, helpfulness, responsibility, and obedience, and a preference for a comfortable life, family security, inner harmony, and excitement as end-states. We conclude by discussing how these results can inform the design of future participatory AI activities.


[45] 2402.18689

The VOROS: Lifting ROC curves to 3D

The area under the ROC curve is a common measure that is often used to rank the relative performance of different binary classifiers. However, as has been also previously noted, it can be a measure that ill-captures the benefits of different classifiers when either the true class values or misclassification costs are highly unbalanced between the two classes. We introduce a third dimension to capture these costs, and lift the ROC curve to a ROC surface in a natural way. We study both this surface and introduce the VOROS, the volume over this ROC surface, as a 3D generalization of the 2D area under the ROC curve. For problems where there are only bounds on the expected costs or class imbalances, we restrict consideration to the volume of the appropriate subregion of the ROC surface. We show how the VOROS can better capture the costs of different classifiers on both a classical and a modern example dataset.


[46] 2402.18695

Grounding Language Models for Visual Entity Recognition

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.


[47] 2402.18698

Spatial Coherence Loss for Salient and Camouflaged Object Detection and Beyond

Generic object detection is a category-independent task that relies on accurate modeling of objectness. Most relevant CNN-based models of objectness utilize loss functions (e.g., binary cross entropy) that focus on the single-response, i.e., the loss response of a single pixel. Inspired by the human visual system, which first discerns the boundaries of ambiguous regions (i.e., hard regions) before delving into the semantic meaning, we propose a novel loss function, Spatial Coherence Loss (SCLoss), that uses the mutual response between adjacent pixels to suppress or emphasize the single-response of pixels. We demonstrate that the proposed SCLoss can gradually learn the hard regions by detecting and emphasizing their boundaries. Through comprehensive experiments, we demonstrate that replacing popular loss functions with SCLoss can improve the performance of current state-of-the-art (SOTA) salient or camouflaged object detection (SOD or COD) models. We also demonstrate that combining SCLoss with other loss functions can further improve performance and result in the SOTA outcomes for different applications. Finally, as a demonstrative example of the potential uses for other related tasks, we show an application of SCLoss for semantic segmentation.


[48] 2402.18699

Articulated Object Manipulation with Coarse-to-fine Affordance for Mitigating the Effect of Point Cloud Noise

3D articulated objects are inherently challenging for manipulation due to the varied geometries and intricate functionalities associated with articulated objects.Point-level affordance, which predicts the per-point actionable score and thus proposes the best point to interact with, has demonstrated excellent performance and generalization capabilities in articulated object manipulation. However, a significant challenge remains: while previous works use perfect point cloud generated in simulation, the models cannot directly apply to the noisy point cloud in the real-world.To tackle this challenge, we leverage the property of real-world scanned point cloud that, the point cloud becomes less noisy when the camera is closer to the object. Therefore, we propose a novel coarse-to-fine affordance learning pipeline to mitigate the effect of point cloud noise in two stages. In the first stage, we learn the affordance on the noisy far point cloud which includes the whole object to propose the approximated place to manipulate. Then, we move the camera in front of the approximated place, scan a less noisy point cloud containing precise local geometries for manipulation, and learn affordance on such point cloud to propose fine-grained final actions. The proposed method is thoroughly evaluated both using large-scale simulated noisy point clouds mimicking real-world scans, and in the real world scenarios, with superiority over existing methods, demonstrating the effectiveness in tackling the noisy real-world point cloud problem.


[49] 2402.18700

Learning to Compress Prompt in Natural Language Formats

Large language models (LLMs) are great at processing multiple natural language processing tasks, but their abilities are constrained by inferior performance with long context, slow inference speed, and the high cost of computing the results. Deploying LLMs with precise and informative context helps users process large-scale datasets more effectively and cost-efficiently. Existing works rely on compressing long prompt contexts into soft prompts. However, soft prompt compression encounters limitations in transferability across different LLMs, especially API-based LLMs. To this end, this work aims to compress lengthy prompts in the form of natural language with LLM transferability. This poses two challenges: (i) Natural Language (NL) prompts are incompatible with back-propagation, and (ii) NL prompts lack flexibility in imposing length constraints. In this work, we propose a Natural Language Prompt Encapsulation (Nano-Capsulator) framework compressing original prompts into NL formatted Capsule Prompt while maintaining the prompt utility and transferability. Specifically, to tackle the first challenge, the Nano-Capsulator is optimized by a reward function that interacts with the proposed semantics preserving loss. To address the second question, the Nano-Capsulator is optimized by a reward function featuring length constraints. Experimental results demonstrate that the Capsule Prompt can reduce 81.4% of the original length, decrease inference latency up to 4.5x, and save 80.1% of budget overheads while providing transferability across diverse LLMs and different datasets.


[50] 2402.18702

Characterizing Multimedia Information Environment through Multi-modal Clustering of YouTube Videos

This study aims to investigate the comprehensive characterization of information content in multimedia (videos), particularly on YouTube. The research presents a multi-method framework for characterizing multimedia content by clustering signals from various modalities, such as audio, video, and text. With a focus on South China Sea videos as a case study, this approach aims to enhance our understanding of online content, especially on YouTube. The dataset includes 160 videos, and our findings offer insights into content themes and patterns within different modalities of a video based on clusters. Text modality analysis revealed topical themes related to geopolitical countries, strategies, and global security, while video and audio modality analysis identified distinct patterns of signals related to diverse sets of videos, including news analysis/reporting, educational content, and interviews. Furthermore, our findings uncover instances of content repurposing within video clusters, which were identified using the barcode technique and audio similarity assessments. These findings indicate potential content amplification techniques. In conclusion, this study uniquely enhances our current understanding of multimedia content information based on modality clustering techniques.


[51] 2402.18705

How Platform Exchange and Safeguards Matter: The Case of Sexual Risk in Airbnb and Couchsurfing

Recent work in CHI and CSCW has devoted increasing attention to how the design of network hospitality platforms shapes user experiences and relational outcomes. In this article, I interrogate how different risk factors emerge based on the type of exchanges these platforms facilitate. To do so, I juxtapose two prominent network hospitality platforms: one facilitating negotiated exchange (i.e., Airbnb) with another facilitating reciprocal exchange (i.e., Couchsurfing) between users. Homing in on sexual risk, an underexplored form of platform danger, and drawing on interviews with 40 female dual-platform users, I argue that the provision of binding negotiated exchange and institutional safeguards by Airbnb reduces risk through three mechanisms: casting initial guest-host relation into a buyer-seller arrangement, stabilizing interactional scripts, and formalizing sexual violence recourse. Conversely, Couchsurfing's reciprocal exchange and lack of safeguards increase sexual precarity for users both on- and off-platform. This study demonstrates how platforms with strong prosocial motivations can jeopardize sociality and concludes with implications for designs that better protect vulnerable user populations.


[52] 2402.18707

Embodied Supervision: Haptic Display of Automation Command to Improve Supervisory Performance

A human operator using a manual control interface has ready access to their own command signal, both by efference copy and proprioception. In contrast, a human supervisor typically relies on visual information alone. We propose supplying a supervisor with a copy of the operators command signal, hypothesizing improved performance, especially when that copy is provided through haptic display. We experimentally compared haptic with visual access to the command signal, quantifying the performance of N equals 10 participants attempting to determine which of three reference signals was being tracked by an operator. Results indicate an improved accuracy in identifying the tracked target when haptic display was available relative to visual display alone. We conjecture the benefit follows from the relationship of haptics to the supervisor's own experience, perhaps muscle memory, as an operator.


[53] 2402.18708

Bluebell: An Alliance of Relational Lifting and Independence For Probabilistic Reasoning

We present Bluebell, a program logic for reasoning about probabilistic programs where unary and relational styles of reasoning come together to create new reasoning tools. Unary-style reasoning is very expressive and is powered by foundational mechanisms to reason about probabilistic behaviour like independence and conditioning. The relational style of reasoning, on the other hand, naturally shines when the properties of interest compare the behaviour of similar programs (e.g. when proving differential privacy) managing to avoid having to characterize the output distributions of the individual programs. So far, the two styles of reasoning have largely remained separate in the many program logics designed for the deductive verification of probabilistic programs. In Bluebell, we unify these styles of reasoning through the introduction of a new modality called "joint conditioning" that can encode and illuminate the rich interaction between conditional independence and relational liftings; the two powerhouses from the two styles of reasoning.


[54] 2402.18709

Nonlinear identification algorithm for online and offline study of pulmonary mechanical ventilation

This work presents an algorithm for determining the parameters of a nonlinear dynamic model of the respiratory system in patients undergoing assisted ventilation. Using the pressure and flow signals measured at the mouth, the model's quadratic pressure-volume characteristic is fit to this data in each respiratory cycle by appropriate estimates of the model parameters. Parameter changes during ventilation can thus also be detected. The algorithm is first refined and assessed using data derived from simulated patients represented through a sigmoidal pressure-volume characteristic with hysteresis. As satisfactory results are achieved with the simulated data, the algorithm is evaluated with real data obtained from actual patients undergoing assisted ventilation. The proposed nonlinear dynamic model and associated parameter estimation algorithm yield closer fits than the static linear models computed by respiratory machines, with only a minor increase in computation. They also provide more information to the physician, such as the pressure-volume (P-V) curvature and the condition of the lung (whether normal, under-inflated, or over-inflated). This information can be used to provide safer ventilation for patients, for instance by ventilating them in the linear region of the respiratory system.


[55] 2402.18710

Hefty: A Modular Reconfigurable Robot for Advancing Robot Manipulation in Agriculture

This paper presents a modular, reconfigurable robot platform for robot manipulation in agriculture. While robot manipulation promises great advancements in automating challenging, complex tasks that are currently best left to humans, it is also an expensive capital investment for researchers and users because it demands significantly varying robot configurations depending on the task. Modular robots provide a way to obtain multiple configurations and reduce costs by enabling incremental acquisition of only the necessary modules. The robot we present, Hefty, is designed to be modular and reconfigurable. It is designed for both researchers and end-users as a means to improve technology transfer from research to real-world application. This paper provides a detailed design and integration process, outlining the critical design decisions that enable modularity in the mobility of the robot as well as its sensor payload, power systems, computing, and fixture mounting. We demonstrate the utility of the robot by presenting five configurations used in multiple real-world agricultural robotics applications.


[56] 2402.18715

Commonsense Ontology Micropatterns

The previously introduced Modular Ontology Modeling methodology (MOMo) attempts to mimic the human analogical process by using modular patterns to assemble more complex concepts. To support this, MOMo organizes organizes ontology design patterns into design libraries, which are programmatically queryable, to support accelerated ontology development, for both human and automated processes. However, a major bottleneck to large-scale deployment of MOMo is the (to-date) limited availability of ready-to-use ontology design patterns. At the same time, Large Language Models have quickly become a source of common knowledge and, in some cases, replacing search engines for questions. In this paper, we thus present a collection of 104 ontology design patterns representing often occurring nouns, curated from the common-sense knowledge available in LLMs, organized into a fully-annotated modular ontology design library ready for use with MOMo.


[57] 2402.18718

Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks

Backdoor attacks allow an attacker to embed a specific vulnerability in a machine learning algorithm, activated when an attacker-chosen pattern is presented, causing a specific misprediction. The need to identify backdoors in biometric scenarios has led us to propose a novel technique with different trade-offs. In this paper we propose to use model pairs on open-set classification tasks for detecting backdoors. Using a simple linear operation to project embeddings from a probe model's embedding space to a reference model's embedding space, we can compare both embeddings and compute a similarity score. We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures, having been trained independently and on different datasets. Additionally, we show that backdoors can be detected even when both models are backdoored. The source code is made available for reproducibility purposes.


[58] 2402.18719

MaxCUCL: Max-Consensus with Deterministic Convergence in Networks with Unreliable Communication

In this paper, we present a novel distributed algorithm (herein called MaxCUCL) designed to guarantee that max-consensus is reached in networks characterized by unreliable communication links (i.e., links suffering from packet drops). Our proposed algorithm is the first algorithm that achieves max-consensus in a deterministic manner (i.e., nodes always calculate the maximum of their states regardless of the nature of the probability distribution of the packet drops). Furthermore, it allows nodes to determine whether convergence has been achieved (enabling them to transition to subsequent tasks). The operation of MaxCUCL relies on the deployment of narrowband error-free feedback channels used for acknowledging whether a packet transmission between nodes was successful. We analyze the operation of our algorithm and show that it converges after a finite number of time steps. Finally, we demonstrate our algorithm's effectiveness and practical applicability by applying it to a sensor network deployed for environmental monitoring.


[59] 2402.18721

A collocation method for nonlinear tensor differential equations on low-rank manifolds

We present a new method to compute the solution to a nonlinear tensor differential equation with dynamical low-rank approximation. The idea of dynamical low-rank approximation is to project the differential equation onto the tangent space of a low-rank tensor manifold at each time. Traditionally, an orthogonal projection onto the tangent space is employed, which is challenging to compute for nonlinear differential equations. We introduce a novel interpolatory projection onto the tangent space that is easily computed for many nonlinear differential equations and satisfies the differential equation at a set of carefully selected indices. To select these indices, we devise a new algorithm based on the discrete empirical interpolation method (DEIM) that parameterizes any tensor train and its tangent space with tensor cross interpolants. We demonstrate the proposed method with applications to tensor differential equations arising from the discretization of partial differential equations.


[60] 2402.18724

Learning Associative Memories with Gradient Descent

This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the ``classification margins.'' Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. In underparameterized regimes, we illustrate how the cross-entropy loss can lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.


[61] 2402.18726

Unveiling Privacy, Memorization, and Input Curvature Links

Deep Neural Nets (DNNs) have become a pervasive tool for solving many emerging problems. However, they tend to overfit to and memorize the training set. Memorization is of keen interest since it is closely related to several concepts such as generalization, noisy learning, and privacy. To study memorization, Feldman (2019) proposed a formal score, however its computational requirements limit its practical use. Recent research has shown empirical evidence linking input loss curvature (measured by the trace of the loss Hessian w.r.t inputs) and memorization. It was shown to be ~3 orders of magnitude more efficient than calculating the memorization score. However, there is a lack of theoretical understanding linking memorization with input loss curvature. In this paper, we not only investigate this connection but also extend our analysis to establish theoretical links between differential privacy, memorization, and input loss curvature. First, we derive an upper bound on memorization characterized by both differential privacy and input loss curvature. Second, we present a novel insight showing that input loss curvature is upper-bounded by the differential privacy parameter. Our theoretical findings are further empirically validated using deep models on CIFAR and ImageNet datasets, showing a strong correlation between our theoretical predictions and results observed in practice.


[62] 2402.18728

Not All the Same: Understanding and Informing Similarity Estimation in Tile-Based Video Games

Similarity estimation is essential for many game AI applications, from the procedural generation of distinct assets to automated exploration with game-playing agents. While similarity metrics often substitute human evaluation, their alignment with our judgement is unclear. Consequently, the result of their application can fail human expectations, leading to e.g. unappreciated content or unbelievable agent behaviour. We alleviate this gap through a multi-factorial study of two tile-based games in two representations, where participants (N=456) judged the similarity of level triplets. Based on this data, we construct domain-specific perceptual spaces, encoding similarity-relevant attributes. We compare 12 metrics to these spaces and evaluate their approximation quality through several quantitative lenses. Moreover, we conduct a qualitative labelling study to identify the features underlying the human similarity judgement in this popular genre. Our findings inform the selection of existing metrics and highlight requirements for the design of new similarity metrics benefiting game development and research.


[63] 2402.18732

GAIA: Categorical Foundations of Generative AI

In this paper, we propose GAIA, a generative AI architecture based on category theory. GAIA is based on a hierarchical model where modules are organized as a simplicial complex. Each simplicial complex updates its internal parameters biased on information it receives from its superior simplices and in turn relays updates to its subordinate sub-simplices. Parameter updates are formulated in terms of lifting diagrams over simplicial sets, where inner and outer horn extensions correspond to different types of learning problems. Backpropagation is modeled as an endofunctor over the category of parameters, leading to a coalgebraic formulation of deep learning.


[64] 2402.18734

Priority Sampling of Large Language Models for Compilers

Large language models show great potential in generating and optimizing code. Widely used sampling methods such as Nucleus Sampling increase the diversity of generation but often produce repeated samples for low temperatures and incoherent samples for high temperatures. Furthermore, the temperature coefficient has to be tuned for each task, limiting its usability. We present Priority Sampling, a simple and deterministic sampling technique that produces unique samples ordered by the model's confidence. Each new sample expands the unexpanded token with the highest probability in the augmented search tree. Additionally, Priority Sampling supports generation based on regular expression that provides a controllable and structured exploration process. Priority Sampling outperforms Nucleus Sampling for any number of samples, boosting the performance of the original model from 2.87% to 5% improvement over -Oz. Moreover, it outperforms the autotuner used for the generation of labels for the training of the original model in just 30 samples.


[65] 2402.18736

Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis

Processing-using-DRAM (PuD) is an emerging paradigm that leverages the analog operational properties of DRAM circuitry to enable massively parallel in-DRAM computation. PuD has the potential to significantly reduce or eliminate costly data movement between processing elements and main memory. Prior works experimentally demonstrate three-input MAJ (i.e., MAJ3) and two-input AND and OR operations in commercial off-the-shelf (COTS) DRAM chips. Yet, demonstrations on COTS DRAM chips do not provide a functionally complete set of operations (e.g., NAND or AND and NOT). We experimentally demonstrate that COTS DRAM chips are capable of performing 1) functionally-complete Boolean operations: NOT, NAND, and NOR and 2) many-input (i.e., more than two-input) AND and OR operations. We present an extensive characterization of new bulk bitwise operations in 256 off-the-shelf modern DDR4 DRAM chips. We evaluate the reliability of these operations using a metric called success rate: the fraction of correctly performed bitwise operations. Among our 19 new observations, we highlight four major results. First, we can perform the NOT operation on COTS DRAM chips with a 98.37% success rate on average. Second, we can perform up to 16-input NAND, NOR, AND, and OR operations on COTS DRAM chips with high reliability (e.g., 16-input NAND, NOR, AND, and OR with an average success rate of 94.94%, 95.87%, 94.94%, and 95.85%, respectively). Third, data pattern only slightly affects bitwise operations. Our results show that executing NAND, NOR, AND, and OR operations with random data patterns decreases the success rate compared to all logic-1/logic-0 patterns by 1.39%, 1.97%, 1.43%, and 1.98%, respectively. Fourth, bitwise operations are highly resilient to temperature changes, with small success rate fluctuations of at most 1.66% among all the tested operations when the temperature is increased from 50C to 95C.


[66] 2402.18740

Sixth-order parabolic equation on an interval: Eigenfunction expansion, Green's function, and intermediate asymptotics for a finite thin film with elastic resistance

A linear sixth-order partial differential equation (PDE) of ``parabolic'' type describes the dynamics of thin liquid films beneath surfaces with elastic bending resistance when deflections from the equilibrium film height are small. On a finite domain, the associated sixth-order Sturm--Liouville eigenvalue value problem is self-adjoint for the boundary conditions corresponding to a thin film in a closed trough, and the eigenfunctions form a complete orthonormal set. Using these eigenfunctions, we derive the Green's function for the governing sixth-order PDE on a finite interval and compare it to the known infinite-line solution. Further, we propose a Galerkin spectral method based on the constructed sixth-order eigenfunctions and their derivative expansions. The system of ordinary differential equations for the time-dependent expansion coefficients is solved by standard numerical methods. The numerical approach is applied to versions of the governing PDE with a second-order derivative (in addition to the sixth-order one), which arises from gravity acting on the film. In the absence of gravity, we demonstrate the self-similar intermediate asymptotics of initially localized disturbances on the film surface, at least until the disturbances ``feel'' the finite boundaries, and show that the derived Green's function is the global attractor for such solutions. In the presence of gravity, we use the proposed spectral numerical method to demonstrate that self-similar behavior persists, albeit for shortened intervals of time, even for large values of the gravity-to-bending ratio.


[67] 2402.18742

Comparing Importance Sampling Based Methods for Mitigating the Effect of Class Imbalance

Most state-of-the-art computer vision models heavily depend on data. However, many datasets exhibit extreme class imbalance which has been shown to negatively impact model performance. Among the training-time and data-generation solutions that have been explored, one subset that leverages existing data is importance sampling. A good deal of this work focuses primarily on the CIFAR-10 and CIFAR-100 datasets which fail to be representative of the scale, composition, and complexity of current state-of-the-art datasets. In this work, we explore and compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling. Specifically, we compare the effect of these techniques on the performance of two encoders on an impactful satellite imagery dataset, Planet's Amazon Rainforest dataset, in preparation for another work. Furthermore, we perform supplemental experimentation on a scene classification dataset, ADE20K, to test on a contrasting domain and clarify our results. Across both types of encoders, we find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes. Additionally, our results suggest oversampling generally improves performance for the same underrepresented classes. Interestingly, our findings also indicate that there may exist some redundancy in data in the Planet dataset. Our work aims to provide a foundation for further work on the Planet dataset and similar domain-specific datasets. We open-source our code at https://github.com/RichardZhu123/514-class-imbalance for future work on other satellite imagery datasets as well.


[68] 2402.18743

A revision on Multi-Criteria Decision Making methods for Multi-UAV Mission Planning Support

Over the last decade, Unmanned Aerial Vehicles (UAVs) have been extensively used in many commercial applications due to their manageability and risk avoidance. One of the main problems considered is the Mission Planning for multiple UAVs, where a solution plan must be found satisfying the different constraints of the problem. This problem has multiple variables that must be optimized simultaneously, such as the makespan, the cost of the mission or the risk. Therefore, the problem has a lot of possible optimal solutions, and the operator must select the final solution to be executed among them. In order to reduce the workload of the operator in this decision process, a Decision Support System (DSS) becomes necessary. In this work, a DSS consisting of ranking and filtering systems, which order and reduce the optimal solutions, has been designed. With regard to the ranking system, a wide range of Multi-Criteria Decision Making (MCDM) methods, including some fuzzy MCDM, are compared on a multi-UAV mission planning scenario, in order to study which method could fit better in a multi-UAV decision support system. Expert operators have evaluated the solutions returned, and the results show, on the one hand, that fuzzy methods generally achieve better average scores, and on the other, that all of the tested methods perform better when the preferences of the operators are biased towards a specific variable, and worse when their preferences are balanced. For the filtering system, a similarity function based on the proximity of the solutions has been designed, and on top of that, a threshold is tuned empirically to decide how to filter solutions without losing much of the hypervolume of the space of solutions.


[69] 2402.18744

Timer-Based Coverage Control for Mobile Sensors

This work studies the coverage control problem over a static, bounded, and convex workspace and develops a hybrid extension of the continuous-time Lloyd algorithm. Each agent in a multi-agent system (MAS) is equipped with a timer that generates intermittent sampling events, which may occur asynchronously between agents. At each sampling event, the corresponding agents update their controllers, which are otherwise held constant. These controllers are shown to drive the MAS into a neighborhood of the configurations corresponding to a centroidal Voronoi tessellation, that is, a local minimizer of the standard locational cost. The result is a distributed control strategy that leverages intermittent and asynchronous position measurements to disperse the agents within the workspace. The combination of continuous-time dynamics with intermittently updated control inputs is modeled as a hybrid system. The coverage control objective is posed as a set stabilization problem for hybrid systems, where an invariance based convergence analysis yields sufficient conditions that ensure all maximal solutions of the hybrid system asymptotically converge to a desired set. A brief simulation example is included to showcase the result.


[70] 2402.18746

Accelerating Computer Architecture Simulation through Machine Learning

This paper presents our approach to accelerate computer architecture simulation by leveraging machine learning techniques. Traditional computer architecture simulations are time-consuming, making it challenging to explore different design choices efficiently. Our proposed model utilizes a combination of application features and micro-architectural features to predict the performance of an application. These features are derived from simulations of a small portion of the application. We demonstrate the effectiveness of our approach by building and evaluating a machine learning model that offers significant speedup in architectural exploration. This model demonstrates the ability to predict IPC values for the testing data with a root mean square error of less than 0.1.


[71] 2402.18747

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to metrics that rely on the surface form, as well as pre-trained metrics which are not fine-tuned on MT quality judgments.


[72] 2402.18749

Weighted strategies to guide a multi-objective evolutionary algorithm for multi-UAV mission planning

Management and mission planning over a swarm of unmanned aerial vehicle (UAV) remains to date as a challenging research trend in what regards to this particular type of aircrafts. These vehicles are controlled by a number of ground control station (GCS), from which they are commanded to cooperatively perform different tasks in specific geographic areas of interest. Mathematically the problem of coordinating and assigning tasks to a swarm of UAV can be modeled as a constraint satisfaction problem, whose complexity and multiple conflicting criteria has hitherto motivated the adoption of multi-objective solvers such as multi-objective evolutionary algorithm (MOEA). The encoding approach consists of different alleles representing the decision variables, whereas the fitness function checks that all constraints are fulfilled, minimizing the optimization criteria of the problem. In problems of high complexity involving several tasks, UAV and GCS, where the space of search is huge compared to the space of valid solutions, the convergence rate of the algorithm increases significantly. To overcome this issue, this work proposes a weighted random generator for the creation and mutation of new individuals. The main objective of this work is to reduce the convergence rate of the MOEA solver for multi-UAV mission planning using weighted random strategies that focus the search on potentially better regions of the solution space. Extensive experimental results over a diverse range of scenarios evince the benefits of the proposed approach, which notably improves this convergence rate with respect to a na\"ive MOEA approach.


[73] 2402.18751

Multi-Sensor and Multi-temporal High-Throughput Phenotyping for Monitoring and Early Detection of Water-Limiting Stress in Soybean

Soybean production is susceptible to biotic and abiotic stresses, exacerbated by extreme weather events. Water limiting stress, i.e. drought, emerges as a significant risk for soybean production, underscoring the need for advancements in stress monitoring for crop breeding and production. This project combines multi-modal information to identify the most effective and efficient automated methods to investigate drought response. We investigated a set of diverse soybean accessions using multiple sensors in a time series high-throughput phenotyping manner to: (1) develop a pipeline for rapid classification of soybean drought stress symptoms, and (2) investigate methods for early detection of drought stress. We utilized high-throughput time-series phenotyping using UAVs and sensors in conjunction with machine learning (ML) analytics, which offered a swift and efficient means of phenotyping. The red-edge and green bands were most effective to classify canopy wilting stress. The Red-Edge Chlorophyll Vegetation Index (RECI) successfully differentiated susceptible and tolerant soybean accessions prior to visual symptom development. We report pre-visual detection of soybean wilting using a combination of different vegetation indices. These results can contribute to early stress detection methodologies and rapid classification of drought responses in screening nurseries for breeding and production applications.


[74] 2402.18752

Pre-training Differentially Private Models with Limited Public Data

The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with $\epsilon=8$), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models.


[75] 2402.18753

Like-minded, like-bodied: How users (18-26) trust online eating and health information

This paper investigates the relationship between social media and eating practices amongst 42 internet users aged 18-26. We conducted an ethnography in the US and India to observe how they navigated eating and health information online. We found that participants portrayed themselves online through a vocabulary we have labeled "the good life": performing holistic health by displaying a socially-ideal body. In doing so, participants unconsciously engaged in behaviors of disordered eating while actively eschewing them. They also valued personal testimonies, and readily tested tips from content creators who shared similar beliefs and bodies to them. In doing so, they discarded probabilistic thinking and opened themselves to harm. Our study found that their social media feeds did not unidirectionally influence participants - they also reflected participants' internalized views of health, in an intertwined, non-linear journey. Reducing the online spread of disordered eating practices requires addressing it within young people's social context.


[76] 2402.18754

Extending QGroundControl for Automated Mission Planning of UAVs

Unmanned Aerial Vehicle (UAVs) have become very popular in the last decade due to some advantages such as strong terrain adaptation, low cost, zero casualties, and so on. One of the most interesting advances in this field is the automation of mission planning (task allocation) and real-time replanning, which are highly useful to increase the autonomy of the vehicle and reduce the operator workload. These automated mission planning and replanning systems require a Human Computer Interface (HCI) that facilitates the visualization and selection of plans that will be executed by the vehicles. In addition, most missions should be assessed before their real-life execution. This paper extends QGroundControl, an open-source simulation environment for flight control of multiple vehicles, by adding a mission designer that permits the operator to build complex missions with tasks and other scenario items; an interface for automated mission planning and replanning, which works as a test bed for different algorithms, and a Decision Support System (DSS) that helps the operator in the selection of the plan. In this work, a complete guide of these systems and some practical use cases are provided.


[77] 2402.18755

On Defeating Graph Analysis of Anonymous Transactions

In a ring-signature-based anonymous cryptocurrency, signers of a transaction are hidden among a set of potential signers, called a ring, whose size is much smaller than the number of all users. The ring-membership relations specified by the sets of transactions thus induce bipartite transaction graphs, whose distribution is in turn induced by the ring sampler underlying the cryptocurrency. Since efficient graph analysis could be performed on transaction graphs to potentially deanonymise signers, it is crucial to understand the resistance of (the transaction graphs induced by) a ring sampler against graph analysis. Of particular interest is the class of partitioning ring samplers. Although previous works showed that they provide almost optimal local anonymity, their resistance against global, e.g. graph-based, attacks were unclear. In this work, we analyse transaction graphs induced by partitioning ring samplers. Specifically, we show (partly analytically and partly empirically) that, somewhat surprisingly, by setting the ring size to be at least logarithmic in the number of users, a graph-analysing adversary is no better than the one that performs random guessing in deanonymisation up to constant factor of 2.


[78] 2402.18756

How Much Annotation is Needed to Compare Summarization Models?

Modern instruction-tuned models have become highly capable in text generation tasks such as summarization, and are expected to be released at a steady pace. In practice one may now wish to choose confidently, but with minimal effort, the best performing summarization model when applied to a new domain or purpose. In this work, we empirically investigate the test sample size necessary to select a preferred model in the context of news summarization. Empirical results reveal that comparative evaluation converges quickly for both automatic and human evaluation, with clear preferences for a system emerging from under 100 examples. The human preference data allows us to quantify how well automatic scores can reproduce preference rankings across a variety of downstream summarization tasks. We find that, while automatic metrics are stable at smaller sample sizes, only some automatic metrics are able to moderately predict model win rates according to human preference.


[79] 2402.18759

Learning with Language-Guided State Abstractions

We describe a framework for using natural language to design state abstractions for imitation learning. Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations, which can surface important features of an environment and hide irrelevant ones. These state representations are typically manually specified, or derived from other labor-intensive labeling procedures. Our method, LGA (language-guided abstraction), uses a combination of natural language supervision and background knowledge from language models (LMs) to automatically build state representations tailored to unseen tasks. In LGA, a user first provides a (possibly incomplete) description of a target task in natural language; next, a pre-trained LM translates this task description into a state abstraction function that masks out irrelevant features; finally, an imitation policy is trained using a small number of demonstrations and LGA-generated abstract states. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to those designed by humans, but in a fraction of the time, and that these abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. We illustrate the utility of the learned abstractions on mobile manipulation tasks with a Spot robot.


[80] 2402.18762

Disentangling the Causes of Plasticity Loss in Neural Networks

Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption: that the network is trained on a \textit{stationary} data distribution. In settings where this assumption is violated, e.g.\ deep reinforcement learning, learning algorithms become unstable and brittle with respect to hyperparameters and even random seeds. One factor driving this instability is the loss of plasticity, meaning that updating the network's predictions in response to new information becomes more difficult as training progresses. While many recent works provide analyses and partial solutions to this phenomenon, a fundamental question remains unanswered: to what extent do known mechanisms of plasticity loss overlap, and how can mitigation strategies be combined to best maintain the trainability of a network? This paper addresses these questions, showing that loss of plasticity can be decomposed into multiple independent mechanisms and that, while intervening on any single mechanism is insufficient to avoid the loss of plasticity in all cases, intervening on multiple mechanisms in conjunction results in highly robust learning algorithms. We show that a combination of layer normalization and weight decay is highly effective at maintaining plasticity in a variety of synthetic nonstationary learning tasks, and further demonstrate its effectiveness on naturally arising nonstationarities, including reinforcement learning in the Arcade Learning Environment.


[81] 2402.18766

Advancing Generative AI for Portuguese with Open Decoder Gervásio PT*

To advance the neural decoding of Portuguese, in this paper we present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in this respect. To develop this decoder, which we named Gerv\'asio PT*, a strong LLaMA~2 7B model was used as a starting point, and its further improvement through additional training was done over language resources that include new instruction data sets of Portuguese prepared for this purpose, which are also contributed in this paper. All versions of Gerv\'asio are open source and distributed for free under an open license, including for either research or commercial usage, and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.


[82] 2402.18769

CoMeT: Count-Min-Sketch-based Row Tracking to Mitigate RowHammer at Low Cost

We propose a new RowHammer mitigation mechanism, CoMeT, that prevents RowHammer bitflips with low area, performance, and energy costs in DRAM-based systems at very low RowHammer thresholds. The key idea of CoMeT is to use low-cost and scalable hash-based counters to track DRAM row activations. CoMeT uses the Count-Min Sketch technique that maps each DRAM row to a group of counters, as uniquely as possible, using multiple hash functions. When a DRAM row is activated, CoMeT increments the counters mapped to that DRAM row. Because the mapping from DRAM rows to counters is not completely unique, activating one row can increment one or more counters mapped to another row. Thus, CoMeT may overestimate, but never underestimates, a DRAM row's activation count. This property of CoMeT allows it to securely prevent RowHammer bitflips while properly configuring its hash functions reduces overestimations. As a result, CoMeT 1) implements substantially fewer counters than the number of DRAM rows in a DRAM bank and 2) does not significantly overestimate a DRAM row's activation count. Our comprehensive evaluations show that CoMeT prevents RowHammer bitflips with an average performance overhead of only 4.01% across 61 benign single-core workloads for a very low RowHammer threshold of 125, normalized to a system with no RowHammer mitigation. CoMeT achieves a good trade-off between performance, energy, and area overheads. Compared to the best-performing state-of-the-art mitigation, CoMeT requires 74.2x less area overhead at the RowHammer threshold 125 and incurs a small performance overhead on average for all RowHammer thresholds. Compared to the best-performing low-area-cost mechanism, at a very low RowHammer threshold of 125, CoMeT improves performance by up to 39.1% while incurring a similar area overhead. CoMeT is openly and freely available at https://github.com/CMU-SAFARI/CoMeT.


[83] 2402.18771

NARUTO: Neural Active Reconstruction from Uncertain Target Observations

We present NARUTO, a neural active reconstruction system that combines a hybrid neural representation with uncertainty learning, enabling high-fidelity surface reconstruction. Our approach leverages a multi-resolution hash-grid as the mapping backbone, chosen for its exceptional convergence speed and capacity to capture high-frequency local features.The centerpiece of our work is the incorporation of an uncertainty learning module that dynamically quantifies reconstruction uncertainty while actively reconstructing the environment. By harnessing learned uncertainty, we propose a novel uncertainty aggregation strategy for goal searching and efficient path planning. Our system autonomously explores by targeting uncertain observations and reconstructs environments with remarkable completeness and fidelity. We also demonstrate the utility of this uncertainty-aware approach by enhancing SOTA neural SLAM systems through an active ray sampling strategy. Extensive evaluations of NARUTO in various environments, using an indoor scene simulator, confirm its superior performance and state-of-the-art status in active reconstruction, as evidenced by its impressive results on benchmark datasets like Replica and MP3D.


[84] 2402.18774

The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder Early-stage Deliberations Around Public Sector AI Proposals

Public sector agencies are rapidly deploying AI systems to augment or automate critical decisions in real-world contexts like child welfare, criminal justice, and public health. A growing body of work documents how these AI systems often fail to improve services in practice. These failures can often be traced to decisions made during the early stages of AI ideation and design, such as problem formulation. However, today, we lack systematic processes to support effective, early-stage decision-making about whether and under what conditions to move forward with a proposed AI project. To understand how to scaffold such processes in real-world settings, we worked with public sector agency leaders, AI developers, frontline workers, and community advocates across four public sector agencies and three community advocacy groups in the United States. Through an iterative co-design process, we created the Situate AI Guidebook: a structured process centered around a set of deliberation questions to scaffold conversations around (1) goals and intended use or a proposed AI system, (2) societal and legal considerations, (3) data and modeling constraints, and (4) organizational governance factors. We discuss how the guidebook's design is informed by participants' challenges, needs, and desires for improved deliberation processes. We further elaborate on implications for designing responsible AI toolkits in collaboration with public sector agency stakeholders and opportunities for future work to expand upon the guidebook. This design approach can be more broadly adopted to support the co-creation of responsible AI toolkits that scaffold key decision-making processes surrounding the use of AI in the public sector and beyond.


[85] 2402.18775

How to Evaluate Human-likeness of Interaction-aware Driver Models

This study proposes a method for qualitatively evaluating and designing human-like driver models for autonomous vehicles. While most existing research on human-likeness has been focused on quantitative evaluation, it is crucial to consider qualitative measures to accurately capture human perception. To this end, we conducted surveys utilizing both video study and human experience-based study. The findings of this research can significantly contribute to the development of naturalistic and human-like driver models for autonomous vehicles, enabling them to safely and efficiently coexist with human-driven vehicles in diverse driving scenarios.


[86] 2402.18778

X-ResQ: Reverse Annealing for Quantum MIMO Detection with Flexible Parallelism

Quantum Annealing (QA)-accelerated MIMO detection is an emerging research approach in the context of NextG wireless networks. The opportunity is to enable large MIMO systems and thus improve wireless performance. The approach aims to leverage QA to expedite the computation required for theoretically optimal but computationally-demanding Maximum Likelihood detection to overcome the limitations of the currently deployed linear detectors. This paper presents \textbf{X-ResQ}, a QA-based MIMO detector system featuring fine-grained quantum task parallelism that is uniquely enabled by the Reverse Annealing (RA) protocol. Unlike prior designs, X-ResQ has many desirable system properties for a parallel QA detector and has effectively improved detection performance as more qubits are assigned. In our evaluations on a state-of-the-art quantum annealer, fully parallel X-ResQ achieves near-optimal throughput (over 10 bits/s/Hz) for $4\times6$ MIMO with 16-QAM using six levels of parallelism with 240 qubits and $220~\mu$s QA compute time, achieving 2.5--5$\times$ gains compared against other tested detectors. For more comprehensive evaluations, we implement and evaluate X-ResQ in the non-quantum digital setting. This non-quantum X-ResQ demonstration showcases the potential to realize ultra-large $1024\times1024$ MIMO, significantly outperforming other MIMO detectors, including the state-of-the-art RA detector classically implemented in the same way.


[87] 2402.18780

A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

The development of generative models that create 3D content from a text prompt has made considerable strides thanks to the use of the score distillation sampling (SDS) method on pre-trained diffusion models for image generation. However, the SDS method is also the source of several artifacts, such as the Janus problem, the misalignment between the text prompt and the generated 3D model, and 3D model inaccuracies. While existing methods heavily rely on the qualitative assessment of these artifacts through visual inspection of a limited set of samples, in this work we propose more objective quantitative evaluation metrics, which we cross-validate via human ratings, and show analysis of the failure cases of the SDS technique. We demonstrate the effectiveness of this analysis by designing a novel computationally efficient baseline model that achieves state-of-the-art performance on the proposed metrics while addressing all the above-mentioned artifacts.


[88] 2402.18781

Conjectural Online Learning with First-order Beliefs in Asymmetric Information Stochastic Games

Stochastic games arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures, where information asymmetry presents challenges for decision-making entities (players). Existing computational methods for asymmetric information stochastic games (AISG) are primarily offline, targeting special classes of AISGs to avoid belief hierarchies, and lack online adaptability to deviations from equilibrium. To address this limitation, we propose a conjectural online learning (COL), a learning scheme for generic AISGs. COL, structured as a forecaster-actor-critic (FAC) architecture, utilizes first-order beliefs over the hidden states and subjective forecasts of the opponent's strategies. Against the conjectured opponent, COL updates strategies in an actor-critic approach using online rollout and calibrates conjectures through Bayesian learning. We prove that conjecture in COL is asymptotically consistent with the information feedback in the sense of a relaxed Bayesian consistency. The resulting empirical strategy profile converges to the Berk-Nash equilibrium, a solution concept characterizing rationality under subjectivity. Experimental results from an intrusion response use case demonstrate COL's superiority over state-of-the-art reinforcement learning methods against nonstationary attacks.


[89] 2402.18784

Brain-inspired and Self-based Artificial Intelligence

The question "Can machines think?" and the Turing Test to assess whether machines could achieve human-level intelligence is one of the roots of AI. With the philosophical argument "I think, therefore I am", this paper challenge the idea of a "thinking machine" supported by current AIs since there is no sense of self in them. Current artificial intelligence is only seemingly intelligent information processing and does not truly understand or be subjectively aware of oneself and perceive the world with the self as human intelligence does. In this paper, we introduce a Brain-inspired and Self-based Artificial Intelligence (BriSe AI) paradigm. This BriSe AI paradigm is dedicated to coordinating various cognitive functions and learning strategies in a self-organized manner to build human-level AI models and robotic applications. Specifically, BriSe AI emphasizes the crucial role of the Self in shaping the future AI, rooted with a practical hierarchical Self framework, including Perception and Learning, Bodily Self, Autonomous Self, Social Self, and Conceptual Self. The hierarchical framework of the Self highlights self-based environment perception, self-bodily modeling, autonomous interaction with the environment, social interaction and collaboration with others, and even more abstract understanding of the Self. Furthermore, the positive mutual promotion and support among multiple levels of Self, as well as between Self and learning, enhance the BriSe AI's conscious understanding of information and flexible adaptation to complex environments, serving as a driving force propelling BriSe AI towards real Artificial General Intelligence.


[90] 2402.18786

OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition

Depression Recognition (DR) poses a considerable challenge, especially in the context of the growing concerns surrounding privacy. Traditional automatic diagnosis of DR technology necessitates the use of facial images, undoubtedly expose the patient identity features and poses privacy risks. In order to mitigate the potential risks associated with the inappropriate disclosure of patient facial images, we design a new imaging system to erase the identity information of captured facial images while retain disease-relevant features. It is irreversible for identity information recovery while preserving essential disease-related characteristics necessary for accurate DR. More specifically, we try to record a de-identified facial image (erasing the identifiable features as much as possible) by a learnable lens, which is optimized in conjunction with the following DR task as well as a range of face analysis related auxiliary tasks in an end-to-end manner. These aforementioned strategies form our final Optical deep Depression Recognition network (OpticalDR). Experiments on CelebA, AVEC 2013, and AVEC 2014 datasets demonstrate that our OpticalDR has achieved state-of-the-art privacy protection performance with an average AUC of 0.51 on popular facial recognition models, and competitive results for DR with MAE/RMSE of 7.53/8.48 on AVEC 2013 and 7.89/8.82 on AVEC 2014, respectively.


[91] 2402.18787

Enhancing the "Immunity" of Mixture-of-Experts Networks for Adversarial Defense

Recent studies have revealed the vulnerability of Deep Neural Networks (DNNs) to adversarial examples, which can easily fool DNNs into making incorrect predictions. To mitigate this deficiency, we propose a novel adversarial defense method called "Immunity" (Innovative MoE with MUtual information \& positioN stabilITY) based on a modified Mixture-of-Experts (MoE) architecture in this work. The key enhancements to the standard MoE are two-fold: 1) integrating of Random Switch Gates (RSGs) to obtain diverse network structures via random permutation of RSG parameters at evaluation time, despite of RSGs being determined after one-time training; 2) devising innovative Mutual Information (MI)-based and Position Stability-based loss functions by capitalizing on Grad-CAM's explanatory power to increase the diversity and the causality of expert networks. Notably, our MI-based loss operates directly on the heatmaps, thereby inducing subtler negative impacts on the classification performance when compared to other losses of the same type, theoretically. Extensive evaluation validates the efficacy of the proposed approach in improving adversarial robustness against a wide range of attacks.


[92] 2402.18789

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.


[93] 2402.18792

MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks

Deep neural networks have been proven to be vulnerable to adversarial examples and various methods have been proposed to defend against adversarial attacks for natural language processing tasks. However, previous defense methods have limitations in maintaining effective defense while ensuring the performance of the original task. In this paper, we propose a malicious perturbation based adversarial training method (MPAT) for building robust deep neural networks against textual adversarial attacks. Specifically, we construct a multi-level malicious example generation strategy to generate adversarial examples with malicious perturbations, which are used instead of original inputs for model training. Additionally, we employ a novel training objective function to ensure achieving the defense goal without compromising the performance on the original task. We conduct comprehensive experiments to evaluate our defense method by attacking five victim models on three benchmark datasets. The result demonstrates that our method is more effective against malicious adversarial attacks compared with previous defense methods while maintaining or further improving the performance on the original task.


[94] 2402.18793

An Adaptive Orthogonal Basis Method for Computing Multiple Solutions of Differential Equations with polynomial nonlinearities

This paper presents an innovative approach, the Adaptive Orthogonal Basis Method, tailored for computing multiple solutions to differential equations characterized by polynomial nonlinearities. Departing from conventional practices of predefining candidate basis pools, our novel method adaptively computes bases, considering the equation's nature and structural characteristics of the solution. It further leverages companion matrix techniques to generate initial guesses for subsequent computations. Thus this approach not only yields numerous initial guesses for solving such equations but also adapts orthogonal basis functions to effectively address discretized nonlinear systems. Through a series of numerical experiments, this paper demonstrates the method's effectiveness and robustness. By reducing computational costs in various applications, this novel approach opens new avenues for uncovering multiple solutions to differential equations with polynomial nonlinearities.


[95] 2402.18796

MOSAIC: A Modular System for Assistive and Interactive Cooking

We present MOSAIC, a modular architecture for home robots to perform complex collaborative tasks, such as cooking with everyday users. MOSAIC tightly collaborates with humans, interacts with users using natural language, coordinates multiple robots, and manages an open vocabulary of everyday objects. At its core, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for general tasks like language and image recognition, while using streamlined modules designed for task-specific control. We extensively evaluate MOSAIC on 60 end-to-end trials where two robots collaborate with a human user to cook a combination of 6 recipes. We also extensively test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We show that MOSAIC is able to efficiently collaborate with humans by running the overall system end-to-end with a real human user, completing 68.3% (41/60) collaborative cooking trials of 6 different recipes with a subtask completion rate of 91.6%. Finally, we discuss the limitations of the current system and exciting open challenges in this domain. The project's website is at https://portal-cornell.github.io/MOSAIC/


[96] 2402.18797

ARTiST: Automated Text Simplification for Task Guidance in Augmented Reality

Text presented in augmented reality provides in-situ, real-time information for users. However, this content can be challenging to apprehend quickly when engaging in cognitively demanding AR tasks, especially when it is presented on a head-mounted display. We propose ARTiST, an automatic text simplification system that uses a few-shot prompt and GPT-3 models to specifically optimize the text length and semantic content for augmented reality. Developed out of a formative study that included seven users and three experts, our system combines a customized error calibration model with a few-shot prompt to integrate the syntactic, lexical, elaborative, and content simplification techniques, and generate simplified AR text for head-worn displays. Results from a 16-user empirical study showed that ARTiST lightens the cognitive load and improves performance significantly over both unmodified text and text modified via traditional methods. Our work constitutes a step towards automating the optimization of batch text data for readability and performance in augmented reality.


[97] 2402.18800

BlockEcho: Retaining Long-Range Dependencies for Imputing Block-Wise Missing Data

Block-wise missing data poses significant challenges in real-world data imputation tasks. Compared to scattered missing data, block-wise gaps exacerbate adverse effects on subsequent analytic and machine learning tasks, as the lack of local neighboring elements significantly reduces the interpolation capability and predictive power. However, this issue has not received adequate attention. Most SOTA matrix completion methods appeared less effective, primarily due to overreliance on neighboring elements for predictions. We systematically analyze the issue and propose a novel matrix completion method ``BlockEcho" for a more comprehensive solution. This method creatively integrates Matrix Factorization (MF) within Generative Adversarial Networks (GAN) to explicitly retain long-distance inter-element relationships in the original matrix. Besides, we incorporate an additional discriminator for GAN, comparing the generator's intermediate progress with pre-trained MF results to constrain high-order feature distributions. Subsequently, we evaluate BlockEcho on public datasets across three domains. Results demonstrate superior performance over both traditional and SOTA methods when imputing block-wise missing data, especially at higher missing rates. The advantage also holds for scattered missing data at high missing rates. We also contribute on the analyses in providing theoretical justification on the optimality and convergence of fusing MF and GAN for missing block data.


[98] 2402.18803

To Pool or Not To Pool: Analyzing the Regularizing Effects of Group-Fair Training on Shared Models

In fair machine learning, one source of performance disparities between groups is over-fitting to groups with relatively few training samples. We derive group-specific bounds on the generalization error of welfare-centric fair machine learning that benefit from the larger sample size of the majority group. We do this by considering group-specific Rademacher averages over a restricted hypothesis class, which contains the family of models likely to perform well with respect to a fair learning objective (e.g., a power-mean). Our simulations demonstrate these bounds improve over a naive method, as expected by theory, with particularly significant improvement for smaller group sizes.


[99] 2402.18805

VEC-SBM: Optimal Community Detection with Vectorial Edges Covariates

Social networks are often associated with rich side information, such as texts and images. While numerous methods have been developed to identify communities from pairwise interactions, they usually ignore such side information. In this work, we study an extension of the Stochastic Block Model (SBM), a widely used statistical framework for community detection, that integrates vectorial edges covariates: the Vectorial Edges Covariates Stochastic Block Model (VEC-SBM). We propose a novel algorithm based on iterative refinement techniques and show that it optimally recovers the latent communities under the VEC-SBM. Furthermore, we rigorously assess the added value of leveraging edge's side information in the community detection process. We complement our theoretical results with numerical experiments on synthetic and semi-synthetic data.


[100] 2402.18807

On the Decision-Making Abilities in Role-Playing using Large Language Models

Large language models (LLMs) are now increasingly utilized for role-playing tasks, especially in impersonating domain-specific experts, primarily through role-playing prompts. When interacting in real-world scenarios, the decision-making abilities of a role significantly shape its behavioral patterns. In this paper, we concentrate on evaluating the decision-making abilities of LLMs post role-playing thereby validating the efficacy of role-playing. Our goal is to provide metrics and guidance for enhancing the decision-making abilities of LLMs in role-playing tasks. Specifically, we first use LLMs to generate virtual role descriptions corresponding to the 16 personality types of Myers-Briggs Type Indicator (abbreviated as MBTI) representing a segmentation of the population. Then we design specific quantitative operations to evaluate the decision-making abilities of LLMs post role-playing from four aspects: adaptability, exploration$\&$exploitation trade-off ability, reasoning ability, and safety. Finally, we analyze the association between the performance of decision-making and the corresponding MBTI types through GPT-4. Extensive experiments demonstrate stable differences in the four aspects of decision-making abilities across distinct roles, signifying a robust correlation between decision-making abilities and the roles emulated by LLMs. These results underscore that LLMs can effectively impersonate varied roles while embodying their genuine sociological characteristics.


[101] 2402.18811

BFRFormer: Transformer-based generator for Real-World Blind Face Restoration

Blind face restoration is a challenging task due to the unknown and complex degradation. Although face prior-based methods and reference-based methods have recently demonstrated high-quality results, the restored images tend to contain over-smoothed results and lose identity-preserved details when the degradation is severe. It is observed that this is attributed to short-range dependencies, the intrinsic limitation of convolutional neural networks. To model long-range dependencies, we propose a Transformer-based blind face restoration method, named BFRFormer, to reconstruct images with more identity-preserved details in an end-to-end manner. In BFRFormer, to remove blocking artifacts, the wavelet discriminator and aggregated attention module are developed, and spectral normalization and balanced consistency regulation are adaptively applied to address the training instability and over-fitting problem, respectively. Extensive experiments show that our method outperforms state-of-the-art methods on a synthetic dataset and four real-world datasets. The source code, Casia-Test dataset, and pre-trained models are released at https://github.com/s8Znk/BFRFormer.


[102] 2402.18813

Protein Multimer Structure Prediction via Prompt Learning

Understanding the 3D structures of protein multimers is crucial, as they play a vital role in regulating various cellular processes. It has been empirically confirmed that the multimer structure prediction~(MSP) can be well handled in a step-wise assembly fashion using provided dimer structures and predicted protein-protein interactions~(PPIs). However, due to the biological gap in the formation of dimers and larger multimers, directly applying PPI prediction techniques can often cause a \textit{poor generalization} to the MSP task. To address this challenge, we aim to extend the PPI knowledge to multimers of different scales~(i.e., chain numbers). Specifically, we propose \textbf{\textsc{PromptMSP}}, a pre-training and \textbf{Prompt} tuning framework for \textbf{M}ultimer \textbf{S}tructure \textbf{P}rediction. First, we tailor the source and target tasks for effective PPI knowledge learning and efficient inference, respectively. We design PPI-inspired prompt learning to narrow the gaps of two task formats and generalize the PPI knowledge to multimers of different scales. We provide a meta-learning strategy to learn a reliable initialization of the prompt model, enabling our prompting framework to effectively adapt to limited data for large-scale multimers. Empirically, we achieve both significant accuracy (RMSD and TM-Score) and efficiency improvements compared to advanced MSP models. The code, data and checkpoints are released at \url{https://github.com/zqgao22/PromptMSP}.


[103] 2402.18814

New topological subsystem codes from semi-regular tessellations

In this work, we present new constructions for topological subsystem codes using semi-regular Euclidean and hyperbolic tessellations. They give us new families of codes, and we also provide a new family of codes obtained through an already existing construction, due to Sarvepalli and Brown. We also prove new results that allow us to obtain the parameters of these new codes.


[104] 2402.18815

How do Large Language Models Handle Multilingualism?

Large language models (LLMs) demonstrate remarkable performance across a spectrum of languages. In this work, we delve into the question: How do LLMs handle multilingualism? We introduce a framework that depicts LLMs' processing of multilingual inputs: In the first several layers, LLMs understand the question, converting multilingual inputs into English to facilitate the task-solving phase. In the intermediate layers, LLMs engage in problem-solving by thinking in English and incorporating multilingual knowledge to obtain factual content, leveraging the self-attention and feed-forward structures, respectively. In the last several layers, LLMs generate responses that align with the original language of the query. In addition, we investigate the existence of language-specific neurons when processing a certain language. To detect neurons activated by the input language, even without labels, we innovatively design a Parallel Language specific Neuron Detection ($\texttt{PLND}$) method that effectively measures the significance of neurons when handling multilingual inputs. By comprehensive ablation analysis through deactivating neurons of different layers and structures, we verify the framework that we propose. Additionally, we demonstrate that we can utilize such a framework to effectively enhance the multilingual ability with much less training effort.


[105] 2402.18817

Gradient Alignment for Cross-Domain Face Anti-Spoofing

Recent advancements in domain generalization (DG) for face anti-spoofing (FAS) have garnered considerable attention. Traditional methods have focused on designing learning objectives and additional modules to isolate domain-specific features while retaining domain-invariant characteristics in their representations. However, such approaches often lack guarantees of consistent maintenance of domain-invariant features or the complete removal of domain-specific features. Furthermore, most prior works of DG for FAS do not ensure convergence to a local flat minimum, which has been shown to be advantageous for DG. In this paper, we introduce GAC-FAS, a novel learning objective that encourages the model to converge towards an optimal flat minimum without necessitating additional learning modules. Unlike conventional sharpness-aware minimizers, GAC-FAS identifies ascending points for each domain and regulates the generalization gradient updates at these points to align coherently with empirical risk minimization (ERM) gradient updates. This unique approach specifically guides the model to be robust against domain shifts. We demonstrate the efficacy of GAC-FAS through rigorous testing on challenging cross-domain FAS datasets, where it establishes state-of-the-art performance. The code is available at https://github.com/leminhbinh0209/CVPR24-FAS.


[106] 2402.18818

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of $85.46\%$ on this more practical but challenging task, which are several order of magnitudes faster and $4.07\times$ better than the best SOTA baseline. Our code is available at https://github.com/Hustcw/CEBin.


[107] 2402.18819

Dual Operating Modes of In-Context Learning

In-context learning (ICL) exhibits dual operating modes: task learning, i.e., acquiring a new skill from in-context samples, and task retrieval, i.e., locating and activating a relevant pretrained skill. Recent theoretical work investigates various mathematical models to analyze ICL, but existing models explain only one operating mode at a time. We introduce a probabilistic model, with which one can explain the dual operating modes of ICL simultaneously. Focusing on in-context learning of linear functions, we extend existing models for pretraining data by introducing multiple task groups and task-dependent input distributions. We then analyze the behavior of the optimally pretrained model under the squared loss, i.e., the MMSE estimator of the label given in-context examples. Regarding pretraining task distribution as prior and in-context examples as the observation, we derive the closed-form expression of the task posterior distribution. With the closed-form expression, we obtain a quantitative understanding of the two operating modes of ICL. Furthermore, we shed light on an unexplained phenomenon observed in practice: under certain settings, the ICL risk initially increases and then decreases with more in-context examples. Our model offers a plausible explanation for this "early ascent" phenomenon: a limited number of in-context samples may lead to the retrieval of an incorrect skill, thereby increasing the risk, which will eventually diminish as task learning takes effect with more in-context samples. We also theoretically analyze ICL with biased labels, e.g., zero-shot ICL, where in-context examples are assigned random labels. Lastly, we validate our findings and predictions via experiments involving Transformers and large language models.


[108] 2402.18821

Debiased Novel Category Discovering and Localization

In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closed-set datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this paper, we focus on the challenging problem of Novel Class Discovery and Localization (NCDL), aiming to train detectors that can detect the categories present in the training data, while also actively discover, localize, and cluster new categories. We analyze existing NCDL methods and identify the core issue: object detectors tend to be biased towards seen objects, and this leads to the neglect of unseen targets. To address this issue, we first propose an Debiased Region Mining (DRM) approach that combines class-agnostic Region Proposal Network (RPN) and class-aware RPN in a complementary manner. Additionally, we suggest to improve the representation network through semi-supervised contrastive learning by leveraging unlabeled data. Finally, we adopt a simple and efficient mini-batch K-means clustering method for novel class discovery. We conduct extensive experiments on the NCDL benchmark, and the results demonstrate that the proposed DRM approach significantly outperforms previous methods, establishing a new state-of-the-art.


[109] 2402.18824

Batch size invariant Adam

We propose a batch size invariant version of Adam, for use in large-scale, distributed settings, in which the mini-batch is divided into micro-batches which are distributed among worker nodes. For the v term, standard Adam first computes the average over micro-batch gradients, then squares, while in the batch size invariant Adam proposed here, we first square the micro-batch gradients, then average. Previous work (e.g. Malladi et al. 2022) used an alternative approach that involved a square-root scaling of the learning rate, but this approach requires strong assumptions to work; in particular that the gradient variance dominates the square of the expected gradient. In contrast, the approach proposed here gives batch size invariance without this assumption. We confirm that in practice our scheme gives batch size invariance in a much larger range of scenarios than the previous approach.


[110] 2402.18825

Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification

Hierarchical text classification (HTC) is a challenging subtask of multi-label classification due to its complex taxonomic structure. Nearly all recent HTC works focus on how the labels are structured but ignore the sub-structure of ground-truth labels according to each input text which contains fruitful label co-occurrence information. In this work, we introduce this local hierarchy with an adversarial framework. We propose a HiAdv framework that can fit in nearly all HTC models and optimize them with the local hierarchy as auxiliary information. We test on two typical HTC models and find that HiAdv is effective in all scenarios and is adept at dealing with complex taxonomic hierarchies. Further experiments demonstrate that the promotion of our framework indeed comes from the local hierarchy and the local hierarchy is beneficial for rare classes which have insufficient training data.


[111] 2402.18826

The Machine Can't Replace the Human Heart

What is the true heart of mental healthcare -- innovation or humanity? Can virtual therapy ever replicate the profound human bonds where healing arises? As artificial intelligence and immersive technologies promise expanded access, safeguards must ensure technologies remain supplementary tools guided by providers' wisdom. Implementation requires nuance balancing efficiency and empathy. If conscious of ethical risks, perhaps AI could restore humanity by automating tasks, giving providers more time to listen. Yet no algorithm can replicate the seat of dignity within. We must ask ourselves: What future has people at its core? One where AI thoughtfully plays a collaborative role? Or where pursuit of progress leaves vulnerability behind? This commentary argues for a balanced approach thoughtfully integrating technology while retaining care's irreplaceable human essence, at the heart of this profoundly human profession. Ultimately, by nurturing innovation and humanity together, perhaps we reach new heights of empathy previously unimaginable.


[112] 2402.18835

Envisioning the Applications and Implications of Generative AI for News Media

This article considers the increasing use of algorithmic decision-support systems and synthetic media in the newsroom, and explores how generative models can help reporters and editors across a range of tasks from the conception of a news story to its distribution. Specifically, we draw from a taxonomy of tasks associated with news production, and discuss where generative models could appropriately support reporters, the journalistic and ethical values that must be preserved within these interactions, and the resulting implications for design contributions in this area in the future. Our essay is relevant to practitioners and researchers as they consider using generative AI systems to support different tasks and workflows.


[113] 2402.18836

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

This paper investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.


[114] 2402.18838

When does word order matter and when doesn't it?

Language models (LMs) may appear insensitive to word order changes in natural language understanding (NLU) tasks. In this paper, we propose that linguistic redundancy can explain this phenomenon, whereby word order and other linguistic cues such as case markers provide overlapping and thus redundant information. Our hypothesis is that models exhibit insensitivity to word order when the order provides redundant information, and the degree of insensitivity varies across tasks. We quantify how informative word order is using mutual information (MI) between unscrambled and scrambled sentences. Our results show the effect that the less informative word order is, the more consistent the model's predictions are between unscrambled and scrambled sentences. We also find that the effect varies across tasks: for some tasks, like SST-2, LMs' prediction is almost always consistent with the original one even if the Pointwise-MI (PMI) changes, while for others, like RTE, the consistency is near random when the PMI gets lower, i.e., word order is really important.


[115] 2402.18839

Extended Flow Matching: a Method of Conditional Generation with Generalized Continuity Equation

The task of conditional generation is one of the most important applications of generative models, and numerous methods have been developed to date based on the celebrated diffusion models, with the guidance-based classifier-free method taking the lead. However, the theory of the guidance-based method not only requires the user to fine-tune the "guidance strength," but its target vector field does not necessarily correspond to the conditional distribution used in training. In this paper, we develop the theory of conditional generation based on Flow Matching, a current strong contender of diffusion methods. Motivated by the interpretation of a probability path as a distribution on path space, we establish a novel theory of flow-based generation of conditional distribution by employing the mathematical framework of generalized continuity equation instead of the continuity equation in flow matching. This theory naturally derives a method that aims to match the matrix field as opposed to the vector field. Our framework ensures the continuity of the generated conditional distribution through the existence of flow between conditional distributions. We will present our theory through experiments and mathematical results.


[116] 2402.18842

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

Novel-view synthesis through diffusion models has demonstrated remarkable potential for generating diverse and high-quality images. Yet, the independent process of image generation in these prevailing methods leads to challenges in maintaining multiple-view consistency. To address this, we introduce ViewFusion, a novel, training-free algorithm that can be seamlessly integrated into existing pre-trained diffusion models. Our approach adopts an auto-regressive method that implicitly leverages previously generated views as context for the next view generation, ensuring robust multi-view consistency during the novel-view generation process. Through a diffusion process that fuses known-view information via interpolated denoising, our framework successfully extends single-view conditioned models to work in multiple-view conditional settings without any additional fine-tuning. Extensive experimental results demonstrate the effectiveness of ViewFusion in generating consistent and detailed novel views.


[117] 2402.18844

Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey

3D human pose estimation and mesh recovery have attracted widespread research interest in many areas, such as computer vision, autonomous driving, and robotics. Deep learning on 3D human pose estimation and mesh recovery has recently thrived, with numerous methods proposed to address different problems in this area. In this paper, to stimulate future research, we present a comprehensive review of recent progress over the past five years in deep learning methods for this area by delving into over 200 references. To the best of our knowledge, this survey is arguably the first to comprehensively cover deep learning methods for 3D human pose estimation, including both single-person and multi-person approaches, as well as human mesh recovery, encompassing methods based on explicit models and implicit representations. We also present comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions. A regularly updated project page can be found at https://github.com/liuyangme/SOTA-3DHPE-HMR.


[118] 2402.18846

Multi-Fidelity Residual Neural Processes for Scalable Surrogate Modeling

Multi-fidelity surrogate modeling aims to learn an accurate surrogate at the highest fidelity level by combining data from multiple sources. Traditional methods relying on Gaussian processes can hardly scale to high-dimensional data. Deep learning approaches utilize neural network based encoders and decoders to improve scalability. These approaches share encoded representations across fidelities without including corresponding decoder parameters. At the highest fidelity, the representations are decoded with different parameters, making the shared information inherently inaccurate. This hinders inference performance, especially in out-of-distribution scenarios when the highest fidelity data has limited domain coverage. To address these limitations, we propose Multi-fidelity Residual Neural Processes (MFRNP), a novel multi-fidelity surrogate modeling framework. MFRNP optimizes lower fidelity decoders for accurate information sharing by aggregating lower fidelity surrogate outputs and models residual between the aggregation and ground truth on the highest fidelity. We show that MFRNP significantly outperforms current state-of-the-art in learning partial differential equations and a real-world climate modeling task.


[119] 2402.18847

Flexible Precoding for Multi-User Movable Antenna Communications

This letter rethinks traditional precoding in multi-user wireless communications with movable antennas (MAs). Utilizing MAs for optimal antenna positioning, we introduce a sparse optimization (SO)-based approach focusing on regularized zero-forcing (RZF). This framework targets the optimization of antenna positions and the precoding matrix to minimize inter-user interference and transmit power. We propose an off-grid regularized least squares-based orthogonal matching pursuit (RLS-OMP) method for this purpose. Moreover, we provide deeper insights into antenna position optimization using RLS-OMP, viewed from a subspace projection angle. Overall, our proposed flexible precoding scheme demonstrates a sum rate that exceeds more than twice that of fixed antenna positions.


[120] 2402.18848

SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting

We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. Drawing on the Cook-Torrance reflectance model, we have meticulously configured the architecture design to precisely simulate light-surface interactions. Furthermore, to overcome the limitation of scarce high-quality lightstage data, we have developed a self-supervised pre-training strategy. This novel combination of accurate physical modeling and expanded training dataset establishes a new benchmark in relighting realism.


[121] 2402.18849

Enhancing Steganographic Text Extraction: Evaluating the Impact of NLP Models on Accuracy and Semantic Coherence

This study discusses a new method combining image steganography technology with Natural Language Processing (NLP) large models, aimed at improving the accuracy and robustness of extracting steganographic text. Traditional Least Significant Bit (LSB) steganography techniques face challenges in accuracy and robustness of information extraction when dealing with complex character encoding, such as Chinese characters. To address this issue, this study proposes an innovative LSB-NLP hybrid framework. This framework integrates the advanced capabilities of NLP large models, such as error detection, correction, and semantic consistency analysis, as well as information reconstruction techniques, thereby significantly enhancing the robustness of steganographic text extraction. Experimental results show that the LSB-NLP hybrid framework excels in improving the extraction accuracy of steganographic text, especially in handling Chinese characters. The findings of this study not only confirm the effectiveness of combining image steganography technology and NLP large models but also propose new ideas for research and application in the field of information hiding. The successful implementation of this interdisciplinary approach demonstrates the great potential of integrating image steganography technology with natural language processing technology in solving complex information processing problems.


[122] 2402.18851

Applications of 0-1 Neural Networks in Prescription and Prediction

A key challenge in medical decision making is learning treatment policies for patients with limited observational data. This challenge is particularly evident in personalized healthcare decision-making, where models need to take into account the intricate relationships between patient characteristics, treatment options, and health outcomes. To address this, we introduce prescriptive networks (PNNs), shallow 0-1 neural networks trained with mixed integer programming that can be used with counterfactual estimation to optimize policies in medium data settings. These models offer greater interpretability than deep neural networks and can encode more complex policies than common models such as decision trees. We show that PNNs can outperform existing methods in both synthetic data experiments and in a case study of assigning treatments for postpartum hypertension. In particular, PNNs are shown to produce policies that could reduce peak blood pressure by 5.47 mm Hg (p=0.02) over existing clinical practice, and by 2 mm Hg (p=0.01) over the next best prescriptive modeling technique. Moreover PNNs were more likely than all other models to correctly identify clinically significant features while existing models relied on potentially dangerous features such as patient insurance information and race that could lead to bias in treatment.


[123] 2402.18853

Rethinking Multi-domain Generalization with A General Learning Objective

Multi-domain generalization (mDG) is universally aimed to minimize the discrepancy between training and testing distributions to enhance marginal-to-label distribution mapping. However, existing mDG literature lacks a general learning objective paradigm and often imposes constraints on static target marginal distributions. In this paper, we propose to leverage a $Y$-mapping to relax the constraint. We rethink the learning objective for mDG and design a new \textbf{general learning objective} to interpret and analyze most existing mDG wisdom. This general objective is bifurcated into two synergistic amis: learning domain-independent conditional features and maximizing a posterior. Explorations also extend to two effective regularization terms that incorporate prior information and suppress invalid causality, alleviating the issues that come with relaxed constraints. We theoretically contribute an upper bound for the domain alignment of domain-independent conditional features, disclosing that many previous mDG endeavors actually \textbf{optimize partially the objective} and thus lead to limited performance. As such, our study distills a general learning objective into four practical components, providing a general, robust, and flexible mechanism to handle complex domain shifts. Extensive empirical results indicate that the proposed objective with $Y$-mapping leads to substantially better mDG performance in various downstream tasks, including regression, segmentation, and classification.


[124] 2402.18859

Taking Second-life Batteries from Exhausted to Empowered using Experiments, Data Analysis, and Health Estimation

The reuse of retired electric vehicle (EV) batteries in electric grid energy storage emerges as a promising strategy to address environmental concerns and boost economic value. This study concentrates on devising health monitoring algorithms for retired batteries (BMS$_2$) deployed in grid storage applications. Over 15 months of testing, we compile, analyze, and publicly share a dataset of second-life (SL) batteries, implementing a cycling protocol simulating grid energy storage load profiles within a 3 V-4 V voltage window. Four machine learning-based health estimation models, relying on BMS$_2$ features and initial capacity, are developed and compared, with the selected model achieving a Mean Absolute Percentage Error (MAPE) below 2.3% on test data. Additionally, an adaptive online health estimation algorithm is proposed by integrating a clustering-based method, limiting estimation errors during online deployment. These results constitute an initial proof of concept, showcasing the feasibility of repurposing retired batteries for second-life applications. Based on obtained data and representative power demand, these SL batteries exhibit the potential, under specific conditions, for over a decade of grid energy storage use.


[125] 2402.18860

Error estimation for finite element method on meshes that contain thin elements

In an error estimation of finite element solutions to the Poisson equation, we usually impose the shape regularity assumption on the meshes to be used. In this paper, we show that even if the shape regularity condition is violated, the standard error estimation can be obtained if "bad" elements (elements that violate the shape regularity or maximum angle condition) are covered virtually by "good" simplices. A numerical experiment confirms the theoretical result.


[126] 2402.18863

Probabilistic Lipschitzness and the Stable Rank for Comparing Explanation Models

Explainability models are now prevalent within machine learning to address the black-box nature of neural networks. The question now is which explainability model is most effective. Probabilistic Lipschitzness has demonstrated that the smoothness of a neural network is fundamentally linked to the quality of post hoc explanations. In this work, we prove theoretical lower bounds on the probabilistic Lipschitzness of Integrated Gradients, LIME and SmoothGrad. We propose a novel metric using probabilistic Lipschitzness, normalised astuteness, to compare the robustness of explainability models. Further, we prove a link between the local Lipschitz constant of a neural network and its stable rank. We then demonstrate that the stable rank of a neural network provides a heuristic for the robustness of explainability models.


[127] 2402.18865

Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning

Existing research has shown that large language models (LLMs) exhibit remarkable performance in language understanding and generation. However, when LLMs are continuously fine-tuned on complex and diverse domain-specific downstream tasks, the inference performance on historical tasks decreases dramatically, which is known as a catastrophic forgetting problem. A trade-off needs to be kept between learning plasticity and memory stability. Plenty of existing works have explored strategies like memory replay, regularization and parameter isolation, but little is known about the geometric connection of various adjacent minima in the continual LLMs fine-tuning scenarios. In this work, we investigate the geometric connections of different minima through the lens of mode connectivity, which means different minima can be connected by a low-loss valley. Through extensive experiments, we uncover the mode connectivity phenomenon in the LLMs continual learning scenario and find that it can strike a balance between plasticity and stability. Building upon these findings, we propose a simple yet effective method called Interpolation-based LoRA (I-LoRA), which constructs a dual-memory experience replay framework based on LoRA parameter interpolations. Extensive experiments and analysis on eight domain-specific CL benchmarks demonstrate that I-LoRA consistently show significant improvement over the previous state-of-the-art approaches with up to $11\%$ performance gains, providing a strong baseline and insights for future research on the large language model continual learning problem. Our code is available at \url{https://github.com/which47/LLMCL}.


[128] 2402.18866

Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming

Model-based reinforcement learning (MBRL) has been a primary approach to ameliorating the sample efficiency issue as well as to make a generalist agent. However, there has not been much effort toward enhancing the strategy of dreaming itself. Therefore, it is a question whether and how an agent can "dream better" in a more structured and strategic way. In this paper, inspired by the observation from cognitive science suggesting that humans use a spatial divide-and-conquer strategy in planning, we propose a new MBRL agent, called Dr. Strategy, which is equipped with a novel Dreaming Strategy. The proposed agent realizes a version of divide-and-conquer-like strategy in dreaming. This is achieved by learning a set of latent landmarks and then utilizing these to learn a landmark-conditioned highway policy. With the highway policy, the agent can first learn in the dream to move to a landmark, and from there it tackles the exploration and achievement task in a more focused way. In experiments, we show that the proposed model outperforms prior pixel-based MBRL methods in various visually complex and partially observable navigation tasks. The source code will be available at https://github.com/ahn-ml/drstrategy


[129] 2402.18869

Evaluating the Gilbert-Varshamov Bound for Constrained Systems

We revisit the well-known Gilbert-Varshamov (GV) bound for constrained systems. In 1991, Kolesnik and Krachkovsky showed that GV bound can be determined via the solution of some optimization problem. Later, Marcus and Roth (1992) modified the optimization problem and improved the GV bound in many instances. In this work, we provide explicit numerical procedures to solve these two optimization problems and hence, compute the bounds. We then show the procedures can be further simplified when we plot the respective curves. In the case where the graph presentation comprise a single state, we provide explicit formulas for both bounds.


[130] 2402.18873

Reducing Hallucinations in Entity Abstract Summarization with Facts-Template Decomposition

Entity abstract summarization aims to generate a coherent description of a given entity based on a set of relevant Internet documents. Pretrained language models (PLMs) have achieved significant success in this task, but they may suffer from hallucinations, i.e. generating non-factual information about the entity. To address this issue, we decompose the summary into two components: Facts that represent the factual information about the given entity, which PLMs are prone to fabricate; and Template that comprises generic content with designated slots for facts, which PLMs can generate competently. Based on the facts-template decomposition, we propose SlotSum, an explainable framework for entity abstract summarization. SlotSum first creates the template and then predicts the fact for each template slot based on the input documents. Benefiting from our facts-template decomposition, SlotSum can easily locate errors and further rectify hallucinated predictions with external knowledge. We construct a new dataset WikiFactSum to evaluate the performance of SlotSum. Experimental results demonstrate that SlotSum could generate summaries that are significantly more factual with credible external knowledge.


[131] 2402.18875

Loss-aware Curriculum Learning for Heterogeneous Graph Neural Networks

Heterogeneous Graph Neural Networks (HGNNs) are a class of deep learning models designed specifically for heterogeneous graphs, which are graphs that contain different types of nodes and edges. This paper investigates the application of curriculum learning techniques to improve the performance and robustness of Heterogeneous Graph Neural Networks (GNNs). To better classify the quality of the data, we design a loss-aware training schedule, named LTS that measures the quality of every nodes of the data and incorporate the training dataset into the model in a progressive manner that increases difficulty step by step. LTS can be seamlessly integrated into various frameworks, effectively reducing bias and variance, mitigating the impact of noisy data, and enhancing overall accuracy. Our findings demonstrate the efficacy of curriculum learning in enhancing HGNNs capabilities for analyzing complex graph-structured data. The code is public at https: //github.com/LARS-research/CLGNN/.


[132] 2402.18877

Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction

Bayesian approaches to reconstructing the evolutionary history of languages rely on the tree model, which assumes that these languages descended from a common ancestor and underwent modifications over time. However, this assumption can be violated to different extents due to contact and other factors. Understanding the degree to which this assumption is violated is crucial for validating the accuracy of phylolinguistic inference. In this paper, we propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis. By using both synthetic and real data, we demonstrate that our method effectively visualizes anomalies, particularly in the form of jogging.


[133] 2402.18879

Dose Prediction Driven Radiotherapy Paramters Regression via Intra- and Inter-Relation Modeling

Deep learning has facilitated the automation of radiotherapy by predicting accurate dose distribution maps. However, existing methods fail to derive the desirable radiotherapy parameters that can be directly input into the treatment planning system (TPS), impeding the full automation of radiotherapy. To enable more thorough automatic radiotherapy, in this paper, we propose a novel two-stage framework to directly regress the radiotherapy parameters, including a dose map prediction stage and a radiotherapy parameters regression stage. In stage one, we combine transformer and convolutional neural network (CNN) to predict realistic dose maps with rich global and local information, providing accurate dosimetric knowledge for the subsequent parameters regression. In stage two, two elaborate modules, i.e., an intra-relation modeling (Intra-RM) module and an inter-relation modeling (Inter-RM) module, are designed to exploit the organ-specific and organ-shared features for precise parameters regression. Experimental results on a rectal cancer dataset demonstrate the effectiveness of our method.


[134] 2402.18883

Efficient Processing of Subsequent Densest Subgraph Query

Dense subgraph extraction is a fundamental problem in graph analysis and data mining, aimed at identifying cohesive and densely connected substructures within a given graph. It plays a crucial role in various domains, including social network analysis, biological network analysis, recommendation systems, and community detection. However, extracting a subgraph with the highest node similarity is a lack of exploration. To address this problem, we studied the Member Selection Problem and extended it with a dynamic constraint variant. By incorporating dynamic constraints, our algorithm can adapt to changing conditions or requirements, allowing for more flexible and personalized subgraph extraction. This approach enables the algorithm to provide tailored solutions that meet specific needs, even in scenarios where constraints may vary over time. We also provide the theoretical analysis to show that our algorithm is 1/3-approximation. Eventually, the experiments show that our algorithm is effective and efficient in tackling the member selection problem with dynamic constraints.


[135] 2402.18884

Supervised Contrastive Representation Learning: Landscape Analysis with Unconstrained Features

Recent findings reveal that over-parameterized deep neural networks, trained beyond zero training-error, exhibit a distinctive structural pattern at the final layer, termed as Neural-collapse (NC). These results indicate that the final hidden-layer outputs in such networks display minimal within-class variations over the training set. While existing research extensively investigates this phenomenon under cross-entropy loss, there are fewer studies focusing on its contrastive counterpart, supervised contrastive (SC) loss. Through the lens of NC, this paper employs an analytical approach to study the solutions derived from optimizing the SC loss. We adopt the unconstrained features model (UFM) as a representative proxy for unveiling NC-related phenomena in sufficiently over-parameterized deep networks. We show that, despite the non-convexity of SC loss minimization, all local minima are global minima. Furthermore, the minimizer is unique (up to a rotation). We prove our results by formalizing a tight convex relaxation of the UFM. Finally, through this convex formulation, we delve deeper into characterizing the properties of global solutions under label-imbalanced training data.


[136] 2402.18886

BP-DeepONet: A new method for cuffless blood pressure estimation using the physcis-informed DeepONet

Cardiovascular diseases (CVDs) are the leading cause of death worldwide, with blood pressure serving as a crucial indicator. Arterial blood pressure (ABP) waveforms provide continuous pressure measurements throughout the cardiac cycle and offer valuable diagnostic insights. Consequently, there is a significant demand for non-invasive and cuff-less methods to measure ABP waveforms continuously. Accurate prediction of ABP waveforms can also improve the estimation of mean blood pressure, an essential cardiovascular health characteristic. This study proposes a novel framework based on the physics-informed DeepONet approach to predict ABP waveforms. Unlike previous methods, our approach requires the predicted ABP waveforms to satisfy the Navier-Stokes equation with a time-periodic condition and a Windkessel boundary condition. Notably, our framework is the first to predict ABP waveforms continuously, both with location and time, within the part of the artery that is being simulated. Furthermore, our method only requires ground truth data at the outlet boundary and can handle periodic conditions with varying periods. Incorporating the Windkessel boundary condition in our solution allows for generating natural physical reflection waves, which closely resemble measurements observed in real-world cases. Moreover, accurately estimating the hyper-parameters in the Navier-Stokes equation for our simulations poses a significant challenge. To overcome this obstacle, we introduce the concept of meta-learning, enabling the neural networks to learn these parameters during the training process.


[137] 2402.18888

Uncertainty-Based Extensible Codebook for Discrete Federated Learning in Heterogeneous Data Silos

Federated learning (FL), aimed at leveraging vast distributed datasets, confronts a crucial challenge: the heterogeneity of data across different silos. While previous studies have explored discrete representations to enhance model generalization across minor distributional shifts, these approaches often struggle to adapt to new data silos with significantly divergent distributions. In response, we have identified that models derived from FL exhibit markedly increased uncertainty when applied to data silos with unfamiliar distributions. Consequently, we propose an innovative yet straightforward iterative framework, termed Uncertainty-Based Extensible-Codebook Federated Learning (UEFL). This framework dynamically maps latent features to trainable discrete vectors, assesses the uncertainty, and specifically extends the discretization dictionary or codebook for silos exhibiting high uncertainty. Our approach aims to simultaneously enhance accuracy and reduce uncertainty by explicitly addressing the diversity of data distributions, all while maintaining minimal computational overhead in environments characterized by heterogeneous data silos. Through experiments conducted on five datasets, our method has demonstrated its superiority, achieving significant improvements in accuracy (by 3%--22.1%) and uncertainty reduction (by 38.83%--96.24%), thereby outperforming contemporary state-of-the-art methods. The source code is available at https://github.com/destiny301/uefl.


[138] 2402.18892

Aligning Knowledge Graph with Visual Perception for Object-goal Navigation

Object-goal navigation is a challenging task that requires guiding an agent to specific objects based on first-person visual observations. The ability of agent to comprehend its surroundings plays a crucial role in achieving successful object finding. However, existing knowledge-graph-based navigators often rely on discrete categorical one-hot vectors and vote counting strategy to construct graph representation of the scenes, which results in misalignment with visual images. To provide more accurate and coherent scene descriptions and address this misalignment issue, we propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Technically, our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability. We extensively evaluate our method using the AI2-THOR simulator and conduct a series of experiments to demonstrate the effectiveness and efficiency of our navigator. Code available: https://github.com/nuoxu/AKGVP.


[139] 2402.18896

On the maximum size of variable-length non-overlapping codes

Non-overlapping codes are a set of codewords such that the prefix of each codeword is not a suffix of any codeword in the set, including itself. If the lengths of the codewords are variable, it is additionally required that every codeword is not contained in any other codeword as a subword. Let $C(n,q)$ be the maximum size of $q$-ary fixed-length non-overlapping codes of length $n$. The upper bound on $C(n,q)$ has been well studied. However, the nontrivial upper bound on the maximum size of variable-length non-overlapping codes of length at most $n$ remains open. In this paper, by establishing a link between variable-length non-overlapping codes and fixed-length ones, we are able to show that the size of a $q$-ary variable-length non-overlapping code is upper bounded by $C(n,q)$. Furthermore, we prove that the average length of the codewords in a $q$-ary variable-length non-overlapping codes is lower bounded by $\lceil \log_q \tilde{C} \rceil$, and is asymptotically no shorter than $n-2$ as $q$ approaches $\infty$, where $\tilde{C}$ denotes the cardinality of $q$-ary variable-length non-overlapping codes of length up to $n$.


[140] 2402.18897

Contact-Implicit Model Predictive Control for Dexterous In-hand Manipulation: A Long-Horizon and Robust Approach

Dexterous in-hand manipulation is an essential skill of production and life. Nevertheless, the highly stiff and mutable features of contacts cause limitations to real-time contact discovery and inference, which degrades the performance of model-based methods. Inspired by recent advancements in contact-rich locomotion and manipulation, this paper proposes a novel model-based approach to control dexterous in-hand manipulation and overcome the current limitations. The proposed approach has the attractive feature, which allows the robot to robustly execute long-horizon in-hand manipulation without pre-defined contact sequences or separated planning procedures. Specifically, we design a contact-implicit model predictive controller at high-level to generate real-time contact plans, which are executed by the low-level tracking controller. Compared with other model-based methods, such a long-horizon feature enables replanning and robust execution of contact-rich motions to achieve large-displacement in-hand tasks more efficiently; Compared with existing learning-based methods, the proposed approach achieves the dexterity and also generalizes to different objects without any pre-training. Detailed simulations and ablation studies demonstrate the efficiency and effectiveness of our method. It runs at 20Hz on the 23-degree-of-freedom long-horizon in-hand object rotation task.


[141] 2402.18899

Aligning Language Models for Versatile Text-based Item Retrieval

This paper addresses the gap between general-purpose text embeddings and the specific demands of item retrieval tasks. We demonstrate the shortcomings of existing models in capturing the nuances necessary for zero-shot performance on item retrieval tasks. To overcome these limitations, we propose generate in-domain dataset from ten tasks tailored to unlocking models' representation ability for item retrieval. Our empirical studies demonstrate that fine-tuning embedding models on the dataset leads to remarkable improvements in a variety of retrieval tasks. We also illustrate the practical application of our refined model in a conversational setting, where it enhances the capabilities of LLM-based Recommender Agents like Chat-Rec. Our code is available at https://github.com/microsoft/RecAI.


[142] 2402.18905

On the Convergence of Differentially-Private Fine-tuning: To Linearly Probe or to Fully Fine-tune?

Differentially private (DP) machine learning pipelines typically involve a two-phase process: non-private pre-training on a public dataset, followed by fine-tuning on private data using DP optimization techniques. In the DP setting, it has been observed that full fine-tuning may not always yield the best test accuracy, even for in-distribution data. This paper (1) analyzes the training dynamics of DP linear probing (LP) and full fine-tuning (FT), and (2) explores the phenomenon of sequential fine-tuning, starting with linear probing and transitioning to full fine-tuning (LP-FT), and its impact on test loss. We provide theoretical insights into the convergence of DP fine-tuning within an overparameterized neural network and establish a utility curve that determines the allocation of privacy budget between linear probing and full fine-tuning. The theoretical results are supported by empirical evaluations on various benchmarks and models. The findings reveal the complex nature of DP fine-tuning methods. These results contribute to a deeper understanding of DP machine learning and highlight the importance of considering the allocation of privacy budget in the fine-tuning process.


[143] 2402.18908

Facility Location Games with Scaling Effects

We take the classic facility location problem and consider a variation, in which each agent's individual cost function is equal to their distance from the facility multiplied by a scaling factor which is determined by the facility placement. In addition to the general class of continuous scaling functions, we also provide results for piecewise linear scaling functions which can effectively approximate or model the scaling of many real world scenarios. We focus on the objectives of total and maximum cost, describing the computation of the optimal solution. We then move to the approximate mechanism design setting, observing that the agents' preferences may no longer be single-peaked. Consequently, we characterize the conditions on scaling functions which ensure that agents have single-peaked preferences. Under these conditions, we find results on the total and maximum cost approximation ratios achievable by strategyproof and anonymous mechanisms.


[144] 2402.18909

Updating Language Models with Unstructured Facts: Towards Practical Knowledge Editing

Knowledge editing aims to inject knowledge updates into language models to keep them correct and up-to-date. However, its current evaluation strategies are notably impractical: they solely update with well-curated structured facts (triplets with subjects, relations, and objects), whereas real-world knowledge updates commonly emerge in unstructured texts like news articles. In this paper, we propose a new benchmark, Unstructured Knowledge Editing (UKE). It evaluates editing performance directly using unstructured texts as knowledge updates, termed unstructured facts. Hence UKE avoids the laborious construction of structured facts and enables efficient and responsive knowledge editing, becoming a more practical benchmark. We conduct extensive experiments on newly built datasets and demonstrate that UKE poses a significant challenge to state-of-the-art knowledge editing methods, resulting in their critical performance declines. We further show that this challenge persists even if we extract triplets as structured facts. Our analysis discloses key insights to motivate future research in UKE for more practical knowledge editing.


[145] 2402.18910

DIGIC: Domain Generalizable Imitation Learning by Causal Discovery

Causality has been combined with machine learning to produce robust representations for domain generalization. Most existing methods of this type require massive data from multiple domains to identify causal features by cross-domain variations, which can be expensive or even infeasible and may lead to misidentification in some cases. In this work, we make a different attempt by leveraging the demonstration data distribution to discover the causal features for a domain generalizable policy. We design a novel framework, called DIGIC, to identify the causal features by finding the direct cause of the expert action from the demonstration data distribution via causal discovery. Our framework can achieve domain generalizable imitation learning with only single-domain data and serve as a complement for cross-domain variation-based methods under non-structural assumptions on the underlying causal models. Our empirical study in various control tasks shows that the proposed framework evidently improves the domain generalization performance and has comparable performance to the expert in the original domain simultaneously.


[146] 2402.18913

AdaMergeX: Cross-Lingual Transfer with Large Language Models via Adaptive Adapter Merging

As an effective alternative to the direct fine-tuning on target tasks in specific languages, cross-lingual transfer addresses the challenges of limited training data by decoupling ''task ability'' and ''language ability'' by fine-tuning on the target task in the source language and another selected task in the target language, respectively. However, they fail to fully separate the task ability from the source language or the language ability from the chosen task. In this paper, we acknowledge the mutual reliance between task ability and language ability and direct our attention toward the gap between the target language and the source language on tasks. As the gap removes the impact of tasks, we assume that it remains consistent across tasks. Based on this assumption, we propose a new cross-lingual transfer method called $\texttt{AdaMergeX}$ that utilizes adaptive adapter merging. By introducing a reference task, we can determine that the divergence of adapters fine-tuned on the reference task in both languages follows the same distribution as the divergence of adapters fine-tuned on the target task in both languages. Hence, we can obtain target adapters by combining the other three adapters. Furthermore, we propose a structure-adaptive adapter merging method. Our empirical results demonstrate that our approach yields new and effective cross-lingual transfer, outperforming existing methods across all settings.


[147] 2402.18917

Stop Relying on No-Choice and Do not Repeat the Moves: Optimal, Efficient and Practical Algorithms for Assortment Optimization

We address the problem of active online assortment optimization problem with preference feedback, which is a framework for modeling user choices and subsetwise utility maximization. The framework is useful in various real-world applications including ad placement, online retail, recommender systems, fine-tuning language models, amongst many. The problem, although has been studied in the past, lacks an intuitive and practical solution approach with simultaneously efficient algorithm and optimal regret guarantee. E.g., popularly used assortment selection algorithms often require the presence of a `strong reference' which is always included in the choice sets, further they are also designed to offer the same assortments repeatedly until the reference item gets selected -- all such requirements are quite unrealistic for practical applications. In this paper, we designed efficient algorithms for the problem of regret minimization in assortment selection with \emph{Plackett Luce} (PL) based user choices. We designed a novel concentration guarantee for estimating the score parameters of the PL model using `\emph{Pairwise Rank-Breaking}', which builds the foundation of our proposed algorithms. Moreover, our methods are practical, provably optimal, and devoid of the aforementioned limitations of the existing methods. Empirical evaluations corroborate our findings and outperform the existing baselines.


[148] 2402.18918

SNE-RoadSegV2: Advancing Heterogeneous Feature Fusion and Fallibility Awareness for Freespace Detection

Feature-fusion networks with duplex encoders have proven to be an effective technique to solve the freespace detection problem. However, despite the compelling results achieved by previous research efforts, the exploration of adequate and discriminative heterogeneous feature fusion, as well as the development of fallibility-aware loss functions remains relatively scarce. This paper makes several significant contributions to address these limitations: (1) It presents a novel heterogeneous feature fusion block, comprising a holistic attention module, a heterogeneous feature contrast descriptor, and an affinity-weighted feature recalibrator, enabling a more in-depth exploitation of the inherent characteristics of the extracted features, (2) it incorporates both inter-scale and intra-scale skip connections into the decoder architecture while eliminating redundant ones, leading to both improved accuracy and computational efficiency, and (3) it introduces two fallibility-aware loss functions that separately focus on semantic-transition and depth-inconsistent regions, collectively contributing to greater supervision during model training. Our proposed heterogeneous feature fusion network (SNE-RoadSegV2), which incorporates all these innovative components, demonstrates superior performance in comparison to all other freespace detection algorithms across multiple public datasets. Notably, it ranks the 1st on the official KITTI Road benchmark.


[149] 2402.18919

Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation

While standard Empirical Risk Minimization (ERM) training is proven effective for image classification on in-distribution data, it fails to perform well on out-of-distribution samples. One of the main sources of distribution shift for image classification is the compositional nature of images. Specifically, in addition to the main object or component(s) determining the label, some other image components usually exist, which may lead to the shift of input distribution between train and test environments. More importantly, these components may have spurious correlations with the label. To address this issue, we propose Decompose-and-Compose (DaC), which improves robustness to correlation shift by a compositional approach based on combining elements of images. Based on our observations, models trained with ERM usually highly attend to either the causal components or the components having a high spurious correlation with the label (especially in datapoints on which models have a high confidence). In fact, according to the amount of spurious correlation and the easiness of classification based on the causal or non-causal components, the model usually attends to one of these more (on samples with high confidence). Following this, we first try to identify the causal components of images using class activation maps of models trained with ERM. Afterward, we intervene on images by combining them and retraining the model on the augmented data, including the counterfactual ones. Along with its high interpretability, this work proposes a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training. The method has an overall better worst group accuracy compared to previous methods with the same amount of supervision on the group labels in correlation shift.


[150] 2402.18920

Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation

Although 3D shape matching and interpolation are highly interrelated, they are often studied separately and applied sequentially to relate different 3D shapes, thus resulting in sub-optimal performance. In this work we present a unified framework to predict both point-wise correspondences and shape interpolation between 3D shapes. To this end, we combine the deep functional map framework with classical surface deformation models to map shapes in both spectral and spatial domains. On the one hand, by incorporating spatial maps, our method obtains more accurate and smooth point-wise correspondences compared to previous functional map methods for shape matching. On the other hand, by introducing spectral maps, our method gets rid of commonly used but computationally expensive geodesic distance constraints that are only valid for near-isometric shape deformations. Furthermore, we propose a novel test-time adaptation scheme to capture both pose-dominant and shape-dominant deformations. Using different challenging datasets, we demonstrate that our method outperforms previous state-of-the-art methods for both shape matching and interpolation, even compared to supervised approaches.


[151] 2402.18922

A Simple yet Effective Network based on Vision Transformer for Camouflaged Object and Salient Object Detection

Camouflaged object detection (COD) and salient object detection (SOD) are two distinct yet closely-related computer vision tasks widely studied during the past decades. Though sharing the same purpose of segmenting an image into binary foreground and background regions, their distinction lies in the fact that COD focuses on concealed objects hidden in the image, while SOD concentrates on the most prominent objects in the image. Previous works achieved good performance by stacking various hand-designed modules and multi-scale features. However, these carefully-designed complex networks often performed well on one task but not on another. In this work, we propose a simple yet effective network (SENet) based on vision Transformer (ViT), by employing a simple design of an asymmetric ViT-based encoder-decoder structure, we yield competitive results on both tasks, exhibiting greater versatility than meticulously crafted ones. Furthermore, to enhance the Transformer's ability to model local information, which is important for pixel-level binary segmentation tasks, we propose a local information capture module (LICM). We also propose a dynamic weighted loss (DW loss) based on Binary Cross-Entropy (BCE) and Intersection over Union (IoU) loss, which guides the network to pay more attention to those smaller and more difficult-to-find target objects according to their size. Moreover, we explore the issue of joint training of SOD and COD, and propose a preliminary solution to the conflict in joint training, further improving the performance of SOD. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method. The code is available at https://github.com/linuxsino/SENet.


[152] 2402.18923

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)


[153] 2402.18925

PCDepth: Pattern-based Complementary Learning for Monocular Depth Estimation by Best of Both Worlds

Event cameras can record scene dynamics with high temporal resolution, providing rich scene details for monocular depth estimation (MDE) even at low-level illumination. Therefore, existing complementary learning approaches for MDE fuse intensity information from images and scene details from event data for better scene understanding. However, most methods directly fuse two modalities at pixel level, ignoring that the attractive complementarity mainly impacts high-level patterns that only occupy a few pixels. For example, event data is likely to complement contours of scene objects. In this paper, we discretize the scene into a set of high-level patterns to explore the complementarity and propose a Pattern-based Complementary learning architecture for monocular Depth estimation (PCDepth). Concretely, PCDepth comprises two primary components: a complementary visual representation learning module for discretizing the scene into high-level patterns and integrating complementary patterns across modalities and a refined depth estimator aimed at scene reconstruction and depth prediction while maintaining an efficiency-accuracy balance. Through pattern-based complementary learning, PCDepth fully exploits two modalities and achieves more accurate predictions than existing methods, especially in challenging nighttime scenarios. Extensive experiments on MVSEC and DSEC datasets verify the effectiveness and superiority of our PCDepth. Remarkably, compared with state-of-the-art, PCDepth achieves a 37.9% improvement in accuracy in MVSEC nighttime scenarios.


[154] 2402.18927

Edge Computing Enabled Real-Time Video Analysis via Adaptive Spatial-Temporal Semantic Filtering

This paper proposes a novel edge computing enabled real-time video analysis system for intelligent visual devices. The proposed system consists of a tracking-assisted object detection module (TAODM) and a region of interesting module (ROIM). TAODM adaptively determines the offloading decision to process each video frame locally with a tracking algorithm or to offload it to the edge server inferred by an object detection model. ROIM determines each offloading frame's resolution and detection model configuration to ensure that the analysis results can return in time. TAODM and ROIM interact jointly to filter the repetitive spatial-temporal semantic information to maximize the processing rate while ensuring high video analysis accuracy. Unlike most existing works, this paper investigates the real-time video analysis systems where the intelligent visual device connects to the edge server through a wireless network with fluctuating network conditions. We decompose the real-time video analysis problem into the offloading decision and configurations selection sub-problems. To solve these two sub-problems, we introduce a double deep Q network (DDQN) based offloading approach and a contextual multi-armed bandit (CMAB) based adaptive configurations selection approach, respectively. A DDQN-CMAB reinforcement learning (DCRL) training framework is further developed to integrate these two approaches to improve the overall video analyzing performance. Extensive simulations are conducted to evaluate the performance of the proposed solution, and demonstrate its superiority over counterparts.


[155] 2402.18929

Navigating Beyond Dropout: An Intriguing Solution Towards Generalizable Image Super Resolution

Deep learning has led to a dramatic leap on Single Image Super-Resolution (SISR) performances in recent years. %Despite the substantial advancement% While most existing work assumes a simple and fixed degradation model (e.g., bicubic downsampling), the research of Blind SR seeks to improve model generalization ability with unknown degradation. Recently, Kong et al pioneer the investigation of a more suitable training strategy for Blind SR using Dropout. Although such method indeed brings substantial generalization improvements via mitigating overfitting, we argue that Dropout simultaneously introduces undesirable side-effect that compromises model's capacity to faithfully reconstruct fine details. We show both the theoretical and experimental analyses in our paper, and furthermore, we present another easy yet effective training strategy that enhances the generalization ability of the model by simply modulating its first and second-order features statistics. Experimental results have shown that our method could serve as a model-agnostic regularization and outperforms Dropout on seven benchmark datasets including both synthetic and real-world scenarios.


[156] 2402.18933

Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration

Establishing dense anatomical correspondence across distinct imaging modalities is a foundational yet challenging procedure for numerous medical image analysis studies and image-guided radiotherapy. Existing multi-modality image registration algorithms rely on statistical-based similarity measures or local structural image representations. However, the former is sensitive to locally varying noise, while the latter is not discriminative enough to cope with complex anatomical structures in multimodal scans, causing ambiguity in determining the anatomical correspondence across scans with different modalities. In this paper, we propose a modality-agnostic structural representation learning method, which leverages Deep Neighbourhood Self-similarity (DNS) and anatomy-aware contrastive learning to learn discriminative and contrast-invariance deep structural image representations (DSIR) without the need for anatomical delineations or pre-aligned training images. We evaluate our method on multiphase CT, abdomen MR-CT, and brain MR T1w-T2w registration. Comprehensive results demonstrate that our method is superior to the conventional local structural representation and statistical-based similarity measures in terms of discriminability and accuracy.


[157] 2402.18934

RELEAD: Resilient Localization with Enhanced LiDAR Odometry in Adverse Environments

LiDAR-based localization is valuable for applications like mining surveys and underground facility maintenance. However, existing methods can struggle when dealing with uninformative geometric structures in challenging scenarios. This paper presents RELEAD, a LiDAR-centric solution designed to address scan-matching degradation. Our method enables degeneracy-free point cloud registration by solving constrained ESIKF updates in the front end and incorporates multisensor constraints, even when dealing with outlier measurements, through graph optimization based on Graduated Non-Convexity (GNC). Additionally, we propose a robust Incremental Fixed Lag Smoother (rIFL) for efficient GNC-based optimization. RELEAD has undergone extensive evaluation in degenerate scenarios and has outperformed existing state-of-the-art LiDAR-Inertial odometry and LiDAR-Visual-Inertial odometry methods.


[158] 2402.18936

Energy-Efficient UAV Swarm Assisted MEC with Dynamic Clustering and Scheduling

In this paper, the energy-efficient unmanned aerial vehicle (UAV) swarm assisted mobile edge computing (MEC) with dynamic clustering and scheduling is studied. In the considered system model, UAVs are divided into multiple swarms, with each swarm consisting of a leader UAV and several follower UAVs to provide computing services to end-users. Unlike existing work, we allow UAVs to dynamically cluster into different swarms, i.e., each follower UAV can change its leader based on the time-varying spatial positions, updated application placement, etc. in a dynamic manner. Meanwhile, UAVs are required to dynamically schedule their energy replenishment, application placement, trajectory planning and task delegation. With the aim of maximizing the long-term energy efficiency of the UAV swarm assisted MEC system, a joint optimization problem of dynamic clustering and scheduling is formulated. Taking into account the underlying cooperation and competition among intelligent UAVs, we further reformulate this optimization problem as a combination of a series of strongly coupled multi-agent stochastic games, and then propose a novel reinforcement learning-based UAV swarm dynamic coordination (RLDC) algorithm for obtaining the equilibrium. Simulations are conducted to evaluate the performance of the RLDC algorithm and demonstrate its superiority over counterparts.


[159] 2402.18937

Equivalence of ADER and Lax-Wendroff in DG / FR framework for linear problems

ADER (Arbitrary high order by DERivatives) and Lax-Wendroff (LW) schemes are two high order single stage methods for solving time dependent partial differential equations. ADER is based on solving a locally implicit equation to obtain a space-time predictor solution while LW is based on an explicit Taylor's expansion in time. We cast the corrector step of ADER Discontinuous Galerkin (DG) scheme into an equivalent quadrature free Flux Reconstruction (FR) framework and then show that the obtained ADER-FR scheme is equivalent to the LWFR scheme with D2 dissipation numerical flux for linear problems. This also implies that the two schemes have the same Fourier stability limit for time step size. The equivalence is verified by numerical experiments.


[160] 2402.18944

SemEval 2024 -- Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)

We present SemEval-2024 Task 10, a shared task centred on identifying emotions and finding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for code-mixed dialogues, emotion flip reasoning for code-mixed dialogues, and emotion flip reasoning for English dialogues. Participating systems were tasked to automatically execute one or more of these subtasks. The datasets for these tasks comprise manually annotated conversations focusing on emotions and triggers for emotion shifts (The task data is available at https://github.com/LCS2-IIITD/EDiReF-SemEval2024.git). A total of 84 participants engaged in this task, with the most adept systems attaining F1-scores of 0.70, 0.79, and 0.76 for the respective subtasks. This paper summarises the results and findings from 24 teams alongside their system descriptions.


[161] 2402.18945

Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models

Pre-trained language models (PLMs) have been found susceptible to backdoor attacks, which can transfer vulnerabilities to various downstream tasks. However, existing PLM backdoors are conducted with explicit triggers under the manually aligned, thus failing to satisfy expectation goals simultaneously in terms of effectiveness, stealthiness, and universality. In this paper, we propose a novel approach to achieve invisible and general backdoor implantation, called \textbf{Syntactic Ghost} (synGhost for short). Specifically, the method hostilely manipulates poisoned samples with different predefined syntactic structures as stealth triggers and then implants the backdoor to pre-trained representation space without disturbing the primitive knowledge. The output representations of poisoned samples are distributed as uniformly as possible in the feature space via contrastive learning, forming a wide range of backdoors. Additionally, in light of the unique properties of syntactic triggers, we introduce an auxiliary module to drive the PLMs to learn this knowledge in priority, which can alleviate the interference between different syntactic structures. Experiments show that our method outperforms the previous methods and achieves the predefined objectives. Not only do severe threats to various natural language understanding (NLU) tasks on two tuning paradigms but also to multiple PLMs. Meanwhile, the synGhost is imperceptible against three countermeasures based on perplexity, fine-pruning, and the proposed maxEntropy.


[162] 2402.18946

Real-Time Adaptive Safety-Critical Control with Gaussian Processes in High-Order Uncertain Models

This paper presents an adaptive online learning framework for systems with uncertain parameters to ensure safety-critical control in non-stationary environments. Our approach consists of two phases. The initial phase is centered on a novel sparse Gaussian process (GP) framework. We first integrate a forgetting factor to refine a variational sparse GP algorithm, thus enhancing its adaptability. Subsequently, the hyperparameters of the Gaussian model are trained with a specially compound kernel, and the Gaussian model's online inferential capability and computational efficiency are strengthened by updating a solitary inducing point derived from new samples, in conjunction with the learned hyperparameters. In the second phase, we propose a safety filter based on high-order control barrier functions (HOCBFs), synergized with the previously trained learning model. By leveraging the compound kernel from the first phase, we effectively address the inherent limitations of GPs in handling high-dimensional problems for real-time applications. The derived controller ensures a rigorous lower bound on the probability of satisfying the safety specification. Finally, the efficacy of our proposed algorithm is demonstrated through real-time obstacle avoidance experiments executed using both a simulation platform and a real-world 7-DOF robot.


[163] 2402.18949

Improving Group Connectivity for Generalization of Federated Deep Learning

Federated learning (FL) involves multiple heterogeneous clients collaboratively training a global model via iterative local updates and model fusion. The generalization of FL's global model has a large gap compared with centralized training, which is its bottleneck for broader applications. In this paper, we study and improve FL's generalization through a fundamental ``connectivity'' perspective, which means how the local models are connected in the parameter region and fused into a generalized global model. The term ``connectivity'' is derived from linear mode connectivity (LMC), studying the interpolated loss landscape of two different solutions (e.g., modes) of neural networks. Bridging the gap between LMC and FL, in this paper, we leverage fixed anchor models to empirically and theoretically study the transitivity property of connectivity from two models (LMC) to a group of models (model fusion in FL). Based on the findings, we propose FedGuCci and FedGuCci+, improving group connectivity for better generalization. It is shown that our methods can boost the generalization of FL under client heterogeneity across various tasks (4 CV datasets and 6 NLP datasets), models (both convolutional and transformer-based), and training paradigms (both from-scratch and pretrain-finetune).


[164] 2402.18950

PopALM: Popularity-Aligned Language Models for Social Media Trendy Response Prediction

Social media platforms are daily exhibiting millions of events. To preliminarily predict the mainstream public reaction to these events, we study trendy response prediction to automatically generate top-liked user replies to social media events. While previous works focus on generating responses without factoring in popularity, we propose Popularity-Aligned Language Models (PopALM) to distinguish responses liked by a larger audience through reinforcement learning. Recognizing the noisy labels from user "likes", we tailor-make curriculum learning in proximal policy optimization (PPO) to help models capture the essential samples for easy-to-hard training. In experiments, we build a large-scale Weibo dataset for trendy response prediction, and its results show that PopALM can help boost the performance of advanced language models.


[165] 2402.18951

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

Open-world video recognition is challenging since traditional networks are not generalized well on complex environment variations. Alternatively, foundation models with rich knowledge have recently shown their generalization power. However, how to apply such knowledge has not been fully explored for open-world video recognition. To this end, we propose a generic knowledge transfer pipeline, which progressively exploits and integrates external multimodal knowledge from foundation models to boost open-world video recognition. We name it PCA, based on three stages of Percept, Chat, and Adapt. First, we perform Percept process to reduce the video domain gap and obtain external visual knowledge. Second, we generate rich linguistic semantics as external textual knowledge in Chat stage. Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks. We conduct extensive experiments on three challenging open-world video benchmarks, i.e., TinyVIRAT, ARID, and QV-Pipe. Our approach achieves state-of-the-art performance on all three datasets.


[166] 2402.18954

Getting Saturated with Induction

Induction in saturation-based first-order theorem proving is a new exciting direction in the automation of inductive reasoning. In this paper we survey our work on integrating induction directly into the saturation-based proof search framework of first-order theorem proving. We describe our induction inference rules proving properties with inductively defined datatypes and integers. We also present additional reasoning heuristics for strengthening inductive reasoning, as well as for using induction hypotheses and recursive function definitions for guiding induction. We present exhaustive experimental results demonstrating the practical impact of our approach as implemented within Vampire. This is an extended version of a Principles of Systems Design 2022 paper with the same title and the same authors.


[167] 2402.18956

WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

Recent advancements in neural networks have showcased their remarkable capabilities across various domains. Despite these successes, the "black box" problem still remains. Addressing this, we propose a novel framework, WWW, that offers the 'what', 'where', and 'why' of the neural network decisions in human-understandable terms. Specifically, WWW utilizes adaptive selection for concept discovery, employing adaptive cosine similarity and thresholding techniques to effectively explain 'what'. To address the 'where' and 'why', we proposed a novel combination of neuron activation maps (NAMs) with Shapley values, generating localized concept maps and heatmaps for individual inputs. Furthermore, WWW introduces a method for predicting uncertainty, leveraging heatmap similarities to estimate 'how' reliable the prediction is. Experimental evaluations of WWW demonstrate superior performance in both quantitative and qualitative metrics, outperforming existing methods in interpretability. WWW provides a unified solution for explaining 'what', 'where', and 'why', introducing a method for localized explanations from global interpretations and offering a plug-and-play solution adaptable to various architectures.


[168] 2402.18958

Boosting Semi-Supervised Object Detection in Remote Sensing Images With Active Teaching

The lack of object-level annotations poses a significant challenge for object detection in remote sensing images (RSIs). To address this issue, active learning (AL) and semi-supervised learning (SSL) techniques have been proposed to enhance the quality and quantity of annotations. AL focuses on selecting the most informative samples for annotation, while SSL leverages the knowledge from unlabeled samples. In this letter, we propose a novel AL method to boost semi-supervised object detection (SSOD) for remote sensing images with a teacher student network, called SSOD-AT. The proposed method incorporates an RoI comparison module (RoICM) to generate high-confidence pseudo-labels for regions of interest (RoIs). Meanwhile, the RoICM is utilized to identify the top-K uncertain images. To reduce redundancy in the top-K uncertain images for human labeling, a diversity criterion is introduced based on object-level prototypes of different categories using both labeled and pseudo-labeled images. Extensive experiments on DOTA and DIOR, two popular datasets, demonstrate that our proposed method outperforms state-of-the-art methods for object detection in RSIs. Compared with the best performance in the SOTA methods, the proposed method achieves 1 percent improvement in most cases in the whole AL.


[169] 2402.18959

MambaStock: Selective state space model for stock prediction

The stock market plays a pivotal role in economic development, yet its intricate volatility poses challenges for investors. Consequently, research and accurate predictions of stock price movements are crucial for mitigating risks. Traditional time series models fall short in capturing nonlinearity, leading to unsatisfactory stock predictions. This limitation has spurred the widespread adoption of neural networks for stock prediction, owing to their robust nonlinear generalization capabilities. Recently, Mamba, a structured state space sequence model with a selection mechanism and scan module (S6), has emerged as a powerful tool in sequence modeling tasks. Leveraging this framework, this paper proposes a novel Mamba-based model for stock price prediction, named MambaStock. The proposed MambaStock model effectively mines historical stock market data to predict future stock prices without handcrafted features or extensive preprocessing procedures. Empirical studies on several stocks indicate that the MambaStock model outperforms previous methods, delivering highly accurate predictions. This enhanced accuracy can assist investors and institutions in making informed decisions, aiming to maximize returns while minimizing risks. This work underscores the value of Mamba in time-series forecasting. Source code is available at https://github.com/zshicode/MambaStock.


[170] 2402.18960

Towards Out-of-Distribution Detection for breast cancer classification in Point-of-Care Ultrasound Imaging

Deep learning has shown to have great potential in medical applications. In critical domains as such, it is of high interest to have trustworthy algorithms which are able to tell when reliable assessments cannot be guaranteed. Detecting out-of-distribution (OOD) samples is a crucial step towards building a safe classifier. Following a previous study, showing that it is possible to classify breast cancer in point-of-care ultrasound images, this study investigates OOD detection using three different methods: softmax, energy score and deep ensembles. All methods are tested on three different OOD data sets. The results show that the energy score method outperforms the softmax method, performing well on two of the data sets. The ensemble method is the most robust, performing the best at detecting OOD samples for all three OOD data sets.


[171] 2402.18962

Program Synthesis in Saturation

We present an automated reasoning framework for synthesizing recursion-free programs using saturation-based theorem proving. Given a functional specification encoded as a first-order logical formula, we use a first-order theorem prover to both establish validity of this formula and discover program fragments satisfying the specification. As a result, when deriving a proof of program correctness, we also synthesize a program that is correct with respect to the given specification. We describe properties of the calculus that a saturation-based prover capable of synthesis should employ, and extend the superposition calculus in a corresponding way. We implemented our work in the first-order prover Vampire, extending the successful applicability of first-order proving to program synthesis. This is an extended version of an Automated Deduction -- CADE 29 paper with the same title and the same authors.


[172] 2402.18969

OHTA: One-shot Hand Avatar via Data-driven Implicit Priors

In this paper, we delve into the creation of one-shot hand avatars, attaining high-fidelity and drivable hand representations swiftly from a single image. With the burgeoning domains of the digital human, the need for quick and personalized hand avatar creation has become increasingly critical. Existing techniques typically require extensive input data and may prove cumbersome or even impractical in certain scenarios. To enhance accessibility, we present a novel method OHTA (One-shot Hand avaTAr) that enables the creation of detailed hand avatars from merely one image. OHTA tackles the inherent difficulties of this data-limited problem by learning and utilizing data-driven hand priors. Specifically, we design a hand prior model initially employed for 1) learning various hand priors with available data and subsequently for 2) the inversion and fitting of the target identity with prior knowledge. OHTA demonstrates the capability to create high-fidelity hand avatars with consistent animatable quality, solely relying on a single image. Furthermore, we illustrate the versatility of OHTA through diverse applications, encompassing text-to-avatar conversion, hand editing, and identity latent space manipulation.


[173] 2402.18970

PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators' updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.


[174] 2402.18973

Privacy Management and Interface Design for a Smart House

In today's life, more and more people tend to opt for a smart house. In this way, the idea of including technology has become popular worldwide. Despite this concept's many benefits, managing security remains an essential problem due to the shared activities. The Internet of Things system behind a smart house is based on several sensors to measure temperature, humidity, air quality, and movement. Because of being supervised every day through sensors and controlling their house only with a simple click, many people can be afraid of this new approach in terms of their privacy, and this fact can constrain them from following their habits. The security aspects should be constantly analyzed to keep the data's confidentiality and make people feel safe in their own houses. In this context, the current paper puts light on an alternative design of a platform in which the safety of homeowners is the primary purpose, and they maintain complete control over the data generated by smart devices. The current research highlights the role of security and interface design in controlling a smart house. The study underscores the importance of providing an interface that can be used easily by any person to manage data and live activities in a modern residence in an era dominated by continuously developing technology.


[175] 2402.18974

Graph Generation via Spectral Diffusion

In this paper, we present GRASP, a novel graph generative model based on 1) the spectral decomposition of the graph Laplacian matrix and 2) a diffusion process. Specifically, we propose to use a denoising model to sample eigenvectors and eigenvalues from which we can reconstruct the graph Laplacian and adjacency matrix. Our permutation invariant model can also handle node features by concatenating them to the eigenvectors of each node. Using the Laplacian spectrum allows us to naturally capture the structural characteristics of the graph and work directly in the node space while avoiding the quadratic complexity bottleneck that limits the applicability of other methods. This is achieved by truncating the spectrum, which as we show in our experiments results in a faster yet accurate generative process. An extensive set of experiments on both synthetic and real world graphs demonstrates the strengths of our model against state-of-the-art alternatives.


[176] 2402.18975

Theoretically Achieving Continuous Representation of Oriented Bounding Boxes

Considerable efforts have been devoted to Oriented Object Detection (OOD). However, one lasting issue regarding the discontinuity in Oriented Bounding Box (OBB) representation remains unresolved, which is an inherent bottleneck for extant OOD methods. This paper endeavors to completely solve this issue in a theoretically guaranteed manner and puts an end to the ad-hoc efforts in this direction. Prior studies typically can only address one of the two cases of discontinuity: rotation and aspect ratio, and often inadvertently introduce decoding discontinuity, e.g. Decoding Incompleteness (DI) and Decoding Ambiguity (DA) as discussed in literature. Specifically, we propose a novel representation method called Continuous OBB (COBB), which can be readily integrated into existing detectors e.g. Faster-RCNN as a plugin. It can theoretically ensure continuity in bounding box regression which to our best knowledge, has not been achieved in literature for rectangle-based object representation. For fairness and transparency of experiments, we have developed a modularized benchmark based on the open-source deep learning framework Jittor's detection toolbox JDet for OOD evaluation. On the popular DOTA dataset, by integrating Faster-RCNN as the same baseline model, our new method outperforms the peer method Gliding Vertex by 1.13% mAP50 (relative improvement 1.54%), and 2.46% mAP75 (relative improvement 5.91%), without any tricks.


[177] 2402.18980

Helper Data Schemes for Coded Modulation and Shaping in Physical Unclonable Functions

In this paper, we consider the generation and utilization of helper data for physical unclonable functions (PUFs) that provide real-valued readout symbols. Compared to classical binary PUFs, more entropy can be extracted from each basic building block (PUF node), resulting in longer keys/fingerprints and/or a higher reliability. To this end, a coded modulation and signal shaping scheme that matches the (approximately) Gaussian distribution of the readout has to be employed. A new helper data scheme is proposed that works with any type of coded modulation/shaping scheme. Compared to the permutation scheme from the literature, less amount of helper data has to be generated and a higher reliability is achieved. Moreover, the recently proposed idea of a two-metric helper data scheme is generalized to coded modulation and a general S-metric scheme. It is shown how extra helper data can be generated to improve decodability. The proposed schemes are assessed by numerical simulations and by evaluation of measurement data. We compare multi-level codes using a new rate design strategy with bit-interleaved coded modulation and trellis shaping with a distribution matcher. By selecting a suitable design, the rate per PUF node that can be reliably extracted can be as high as 2~bit/node.


[178] 2402.18982

Splitting integrators for linear Vlasov equations with stochastic perturbations

We consider a class of linear Vlasov partial differential equations driven by Wiener noise. Different types of stochastic perturbations are treated: additive noise, multiplicative It\^o and Stratonovich noise, and transport noise. We propose to employ splitting integrators for the temporal discretization of these stochastic partial differential equations. These integrators are designed in order to preserve qualitative properties of the exact solutions depending on the stochastic perturbation, such as preservation of norms or positivity of the solutions. We provide numerical experiments in order to illustrate the properties of the proposed integrators and investigate mean-square rates of convergence.


[179] 2402.18986

Always be Pre-Training: Representation Learning for Network Intrusion Detection with GNNs

Graph neural network-based network intrusion detection systems have recently demonstrated state-of-the-art performance on benchmark datasets. Nevertheless, these methods suffer from a reliance on target encoding for data pre-processing, limiting widespread adoption due to the associated need for annotated labels--a cost-prohibitive requirement. In this work, we propose a solution involving in-context pre-training and the utilization of dense representations for categorical features to jointly overcome the label-dependency limitation. Our approach exhibits remarkable data efficiency, achieving over 98% of the performance of the supervised state-of-the-art with less than 4% labeled data on the NF-UQ-NIDS-V2 dataset.


[180] 2402.18994

Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks

As the role of artificial intelligence becomes increasingly pivotal in modern society, the efficient training and deployment of deep neural networks have emerged as critical areas of focus. Recent advancements in attention-based large neural architectures have spurred the development of AI accelerators, facilitating the training of extensive, multi-billion parameter models. Despite their effectiveness, these powerful networks often incur high execution costs in production environments. Neuromorphic computing, inspired by biological neural processes, offers a promising alternative. By utilizing temporally-sparse computations, Spiking Neural Networks (SNNs) offer to enhance energy efficiency through a reduced and low-power hardware footprint. However, the training of SNNs can be challenging due to their recurrent nature which cannot as easily leverage the massive parallelism of modern AI accelerators. To facilitate the investigation of SNN architectures and dynamics researchers have sought to bridge Python-based deep learning frameworks such as PyTorch or TensorFlow with custom-implemented compute kernels. This paper introduces Spyx, a new and lightweight SNN simulation and optimization library designed in JAX. By pre-staging data in the expansive vRAM of contemporary accelerators and employing extensive JIT compilation, Spyx allows for SNN optimization to be executed as a unified, low-level program on NVIDIA GPUs or Google TPUs. This approach achieves optimal hardware utilization, surpassing the performance of many existing SNN training frameworks while maintaining considerable flexibility.


[181] 2402.18995

Negative-Binomial Randomized Gamma Markov Processes for Heterogeneous Overdispersed Count Time Series

Modeling count-valued time series has been receiving increasing attention since count time series naturally arise in physical and social domains. Poisson gamma dynamical systems (PGDSs) are newly-developed methods, which can well capture the expressive latent transition structure and bursty dynamics behind count sequences. In particular, PGDSs demonstrate superior performance in terms of data imputation and prediction, compared with canonical linear dynamical system (LDS) based methods. Despite these advantages, PGDS cannot capture the heterogeneous overdispersed behaviours of the underlying dynamic processes. To mitigate this defect, we propose a negative-binomial-randomized gamma Markov process, which not only significantly improves the predictive performance of the proposed dynamical system, but also facilitates the fast convergence of the inference algorithm. Moreover, we develop methods to estimate both factor-structured and graph-structured transition dynamics, which enable us to infer more explainable latent structure, compared with PGDSs. Finally, we demonstrate the explainable latent structure learned by the proposed method, and show its superior performance in imputing missing data and forecasting future observations, compared with the related models.


[182] 2402.18998

COFT-AD: COntrastive Fine-Tuning for Few-Shot Anomaly Detection

Existing approaches towards anomaly detection~(AD) often rely on a substantial amount of anomaly-free data to train representation and density models. However, large anomaly-free datasets may not always be available before the inference stage; in which case an anomaly detection model must be trained with only a handful of normal samples, a.k.a. few-shot anomaly detection (FSAD). In this paper, we propose a novel methodology to address the challenge of FSAD which incorporates two important techniques. Firstly, we employ a model pre-trained on a large source dataset to initialize model weights. Secondly, to ameliorate the covariate shift between source and target domains, we adopt contrastive training to fine-tune on the few-shot target domain data. To learn suitable representations for the downstream AD task, we additionally incorporate cross-instance positive pairs to encourage a tight cluster of the normal samples, and negative pairs for better separation between normal and synthesized negative samples. We evaluate few-shot anomaly detection on on 3 controlled AD tasks and 4 real-world AD tasks to demonstrate the effectiveness of the proposed method.


[183] 2402.19001

Analysis of the Two-Step Heterogeneous Transfer Learning for Laryngeal Blood Vessel Classification: Issue and Improvement

Transferring features learned from natural to medical images for classification is common. However, challenges arise due to the scarcity of certain medical image types and the feature disparities between natural and medical images. Two-step transfer learning has been recognized as a promising solution for this issue. However, choosing an appropriate intermediate domain would be critical in further improving the classification performance. In this work, we explore the effectiveness of using color fundus photographs of the diabetic retina dataset as an intermediate domain for two-step heterogeneous learning (THTL) to classify laryngeal vascular images with nine deep-learning models. Experiment results confirm that although the images in both the intermediate and target domains share vascularized characteristics, the accuracy is drastically reduced compared to one-step transfer learning, where only the last layer is fine-tuned (e.g., ResNet18 drops 14.7%, ResNet50 drops 14.8%). By analyzing the Layer Class Activation Maps (LayerCAM), we uncover a novel finding that the prevalent radial vascular pattern in the intermediate domain prevents learning the features of twisted and tangled vessels that distinguish the malignant class in the target domain. To address the performance drop, we propose the Step-Wise Fine-Tuning (SWFT) method on ResNet in the second step of THTL, resulting in substantial accuracy improvements. Compared to THTL's second step, where only the last layer is fine-tuned, accuracy increases by 26.1% for ResNet18 and 20.4% for ResNet50. Additionally, compared to training from scratch, using ImageNet as the source domain could slightly improve classification performance for laryngeal vascular, but the differences are insignificant.


[184] 2402.19002

GoalNet: Goal Areas Oriented Pedestrian Trajectory Prediction

Predicting the future trajectories of pedestrians on the road is an important task for autonomous driving. The pedestrian trajectory prediction is affected by scene paths, pedestrian's intentions and decision-making, which is a multi-modal problem. Most recent studies use past trajectories to predict a variety of potential future trajectory distributions, which do not account for the scene context and pedestrian targets. Instead of predicting the future trajectory directly, we propose to use scene context and observed trajectory to predict the goal points first, and then reuse the goal points to predict the future trajectories. By leveraging the information from scene context and observed trajectory, the uncertainty can be limited to a few target areas, which represent the "goals" of the pedestrians. In this paper, we propose GoalNet, a new trajectory prediction neural network based on the goal areas of a pedestrian. Our network can predict both pedestrian's trajectories and bounding boxes. The overall model is efficient and modular, and its outputs can be changed according to the usage scenario. Experimental results show that GoalNet significantly improves the previous state-of-the-art performance by 48.7% on the JAAD and 40.8% on the PIE dataset.


[185] 2402.19004

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

The development of high-resolution remote sensing satellites has provided great convenience for research work related to remote sensing. Segmentation and extraction of specific targets are essential tasks when facing the vast and complex remote sensing images. Recently, the introduction of Segment Anything Model (SAM) provides a universal pre-training model for image segmentation tasks. While the direct application of SAM to remote sensing image segmentation tasks does not yield satisfactory results, we propose RSAM-Seg, which stands for Remote Sensing SAM with Semantic Segmentation, as a tailored modification of SAM for the remote sensing field and eliminates the need for manual intervention to provide prompts. Adapter-Scale, a set of supplementary scaling modules, are proposed in the multi-head attention blocks of the encoder part of SAM. Furthermore, Adapter-Feature are inserted between the Vision Transformer (ViT) blocks. These modules aim to incorporate high-frequency image information and image embedding features to generate image-informed prompts. Experiments are conducted on four distinct remote sensing scenarios, encompassing cloud detection, field monitoring, building detection and road mapping tasks . The experimental results not only showcase the improvement over the original SAM and U-Net across cloud, buildings, fields and roads scenarios, but also highlight the capacity of RSAM-Seg to discern absent areas within the ground truth of certain datasets, affirming its potential as an auxiliary annotation method. In addition, the performance in few-shot scenarios is commendable, underscores its potential in dealing with limited datasets.


[186] 2402.19007

DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments

Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI. Existing datasets for developing ZSON algorithms lack consideration of dynamic obstacles, object attribute diversity, and scene texts, thus exhibiting noticeable discrepancy from real-world situations. To address these issues, we propose a Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) that comprises ten high-fidelity 3D scenes with over 18k tasks, aiming to mimic complex, dynamic real-world scenarios. Specifically, DOZE scenes feature multiple moving humanoid obstacles, a wide array of open-vocabulary objects, diverse distinct-attribute objects, and valuable textual hints. Besides, different from existing datasets that only provide collision checking between the agent and static obstacles, we enhance DOZE by integrating capabilities for detecting collisions between the agent and moving obstacles. This novel functionality enables evaluation of the agents' collision avoidance abilities in dynamic environments. We test four representative ZSON methods on DOZE, revealing substantial room for improvement in existing approaches concerning navigation efficiency, safety, and object recognition accuracy. Our dataset could be found at https://DOZE-Dataset.github.io/.


[187] 2402.19009

Generating, Reconstructing, and Representing Discrete and Continuous Data: Generalized Diffusion with Learnable Encoding-Decoding

The vast applications of deep generative models are anchored in three core capabilities -- generating new instances, reconstructing inputs, and learning compact representations -- across various data types, such as discrete text/protein sequences and continuous images. Existing model families, like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), autoregressive models, and diffusion models, generally excel in specific capabilities and data types but fall short in others. We introduce generalized diffusion with learnable encoder-decoder (DiLED), that seamlessly integrates the core capabilities for broad applicability and enhanced performance. DiLED generalizes the Gaussian noising-denoising in standard diffusion by introducing parameterized encoding-decoding. Crucially, DiLED is compatible with the well-established diffusion model objective and training recipes, allowing effective learning of the encoder-decoder parameters jointly with diffusion. By choosing appropriate encoder/decoder (e.g., large language models), DiLED naturally applies to different data types. Extensive experiments on text, proteins, and images demonstrate DiLED's flexibility to handle diverse data and tasks and its strong improvement over various existing models.


[188] 2402.19011

Ruledger: Ensuring Execution Integrity in Trigger-Action IoT Platforms

Smart home IoT systems utilize trigger-action platforms, e.g., IFTTT, to manage devices from various vendors. However, they may be abused by triggering malicious rule execution with forged IoT devices or events violating the execution integrity and the intentions of the users. To address this issue, we propose a ledger based IoT platform called Ruledger, which ensures the correct execution of rules by verifying the authenticity of the corresponding information. Ruledger utilizes smart contracts to enforce verifying the information associated with rule executions, e.g., the user and configuration information from users, device events, and triggers in the trigger-action platforms. In particular, we develop three algorithms to enable ledger-wallet based applications for Ruledger and guarantee that the records used for verification are stateful and correct. Thus, the execution integrity of rules is ensured even if devices and platforms in the smart home systems are compromised. We prototype Ruledger in a real IoT platform, i.e., IFTTT, and evaluate the performance with various settings. The experimental results demonstrate Ruledger incurs an average of 12.53% delay, which is acceptable for smart home systems.


[189] 2402.19012

Algorithmically Expressive, Always-Terminating Model for Reversible Computation

Concerning classical computational models able to express all the Primitive Recursive Functions (PRF), there are interesting results regarding limits on their algorithmic expressiveness or, equivalently, efficiency, namely the ability to express algorithms with minimal computational cost. By introducing the reversible programming model Forest, at our knowledge, we provide a first study of analogous properties, adapted to the context of reversible computational models that can represent all the functions in PRF. Firstly, we show that Forest extends Matos' linear reversible computational model MSRL, the very extension being a guaranteed terminating iteration that can be halted by means of logical predicates. The consequence is that Forest is PRF complete, because MSRL is. Secondly, we show that Forest is strictly algorithmically more expressive than MSRL: it can encode a reversible algorithm for the minimum between two integers in optimal time, while MSRL cannot.


[190] 2402.19013

Ultraviolet Positioning via TDOA: Error Analysis and System Prototype

This work performs the design, real-time hardware realization, and experimental evaluation of a positioning system by ultra-violet (UV) communication under photon-level signal detection. The positioning is based on time-difference of arrival (TDOA) principle. Time division-based transmission of synchronization sequence from three transmitters with known positions is applied. We investigate the positioning error via decomposing it into two parts, the transmitter-side timing error and the receiver-side synchronization error. The theoretical average error matches well with the simulation results, which indicates that theoretical fitting can provide reliable guidance and prediction for hardware experiments. We also conduct real-time hardware realization of the TDOA-based positioning system using Field Programmable Gate Array (FPGA), which is experimentally evaluated via outdoor experiments. Experimental results match well with the theoretical and simulation results.


[191] 2402.19014

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

Recently, the advent of Large Visual-Language Models (LVLMs) has received increasing attention across various domains, particularly in the field of visual document understanding (VDU). Different from conventional vision-language tasks, VDU is specifically concerned with text-rich scenarios containing abundant document elements. Nevertheless, the importance of fine-grained features remains largely unexplored within the community of LVLMs, leading to suboptimal performance in text-rich scenarios. In this paper, we abbreviate it as the fine-grained feature collapse issue. With the aim of filling this gap, we propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo), specifically tailored for the downstream tasks of VDU. DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of LVLM, which enhances visual representation in text-rich scenarios. It can represent that the contrastive learning between the visual holistic representations and the multimodal fine-grained features of document objects can assist the vision encoder in acquiring more effective visual cues, thereby enhancing the comprehension of text-rich documents in LVLMs. We also demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process. Extensive experimental results on multiple benchmarks of VDU reveal that LVLMs equipped with our proposed DoCo can achieve superior performance and mitigate the gap between VDU and generic vision-language tasks.


[192] 2402.19015

Fractional material derivative: pointwise representation and a finite volume numerical scheme

The fractional material derivative appears as the fractional operator that governs the dynamics of the scaling limits of L\'evy walks - a stochastic process that originates from the famous continuous-time random walks. It is usually defined as the Fourier-Laplace multiplier, therefore, it can be thought of as a pseudo-differential operator. In this paper, we show that there exists a local representation in time and space, pointwise, of the fractional material derivative. This allows us to define it on a space of locally integrable functions which is larger than the original one in which Fourier and Laplace transform exist as functions. We consider several typical differential equations involving the fractional material derivative and provide conditions for their solutions to exist. In some cases, the analytical solution can be found. For the general initial value problem, we devise a finite volume method and prove its stability, convergence, and conservation of probability. Numerical illustrations verify our analytical findings. Moreover, our numerical experiments show superiority in the computation time of the proposed numerical scheme over a Monte Carlo method applied to the problem of probability density function's derivation.


[193] 2402.19016

SPriFed-OMP: A Differentially Private Federated Learning Algorithm for Sparse Basis Recovery

Sparse basis recovery is a classical and important statistical learning problem when the number of model dimensions $p$ is much larger than the number of samples $n$. However, there has been little work that studies sparse basis recovery in the Federated Learning (FL) setting, where the client data's differential privacy (DP) must also be simultaneously protected. In particular, the performance guarantees of existing DP-FL algorithms (such as DP-SGD) will degrade significantly when $p \gg n$, and thus, they will fail to learn the true underlying sparse model accurately. In this work, we develop a new differentially private sparse basis recovery algorithm for the FL setting, called SPriFed-OMP. SPriFed-OMP converts OMP (Orthogonal Matching Pursuit) to the FL setting. Further, it combines SMPC (secure multi-party computation) and DP to ensure that only a small amount of noise needs to be added in order to achieve differential privacy. As a result, SPriFed-OMP can efficiently recover the true sparse basis for a linear model with only $n = O(\sqrt{p})$ samples. We further present an enhanced version of our approach, SPriFed-OMP-GRAD based on gradient privatization, that improves the performance of SPriFed-OMP. Our theoretical analysis and empirical results demonstrate that both SPriFed-OMP and SPriFed-OMP-GRAD terminate in a small number of steps, and they significantly outperform the previous state-of-the-art DP-FL solutions in terms of the accuracy-privacy trade-off.


[194] 2402.19025

Combination of Weak Learners eXplanations to Improve Random Forest eXplicability Robustness

The notion of robustness in XAI refers to the observed variations in the explanation of the prediction of a learned model with respect to changes in the input leading to that prediction. Intuitively, if the input being explained is modified slightly subtly enough so as to not change the prediction of the model too much, then we would expect that the explanation provided for that new input does not change much either. We argue that a combination through discriminative averaging of ensembles weak learners explanations can improve the robustness of explanations in ensemble methods.This approach has been implemented and tested with post-hoc SHAP method and Random Forest ensemble with successful results. The improvements obtained have been measured quantitatively and some insights into the explicability robustness in ensemble methods are presented.


[195] 2402.19026

Progressive Contrastive Learning with Multi-Prototype for Unsupervised Visible-Infrared Person Re-identification

Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match specified people in infrared images to visible images without annotation, and vice versa. USVI-ReID is a challenging yet under-explored task. Most existing methods address the USVI-ReID problem using cluster-based contrastive learning, which simply employs the cluster center as a representation of a person. However, the cluster center primarily focuses on shared information, overlooking disparity. To address the problem, we propose a Progressive Contrastive Learning with Multi-Prototype (PCLMP) method for USVI-ReID. In brief, we first generate the hard prototype by selecting the sample with the maximum distance from the cluster center. This hard prototype is used in the contrastive loss to emphasize disparity. Additionally, instead of rigidly aligning query images to a specific prototype, we generate the dynamic prototype by randomly picking samples within a cluster. This dynamic prototype is used to retain the natural variety of features while reducing instability in the simultaneous learning of both common and disparate information. Finally, we introduce a progressive learning strategy to gradually shift the model's attention towards hard samples, avoiding cluster deterioration. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets validate the effectiveness of the proposed method. PCLMP outperforms the existing state-of-the-art method with an average mAP improvement of 3.9%. The source codes will be released.


[196] 2402.19027

How to Train your Antivirus: RL-based Hardening through the Problem-Space

ML-based malware detection on dynamic analysis reports is vulnerable to both evasion and spurious correlations. In this work, we investigate a specific ML architecture employed in the pipeline of a widely-known commercial antivirus company, with the goal to harden it against adversarial malware. Adversarial training, the sole defensive technique that can confer empirical robustness, is not applicable out of the box in this domain, for the principal reason that gradient-based perturbations rarely map back to feasible problem-space programs. We introduce a novel Reinforcement Learning approach for constructing adversarial examples, a constituent part of adversarially training a model against evasion. Our approach comes with multiple advantages. It performs modifications that are feasible in the problem-space, and only those; thus it circumvents the inverse mapping problem. It also makes possible to provide theoretical guarantees on the robustness of the model against a particular set of adversarial capabilities. Our empirical exploration validates our theoretical insights, where we can consistently reach 0\% Attack Success Rate after a few adversarial retraining iterations.


[197] 2402.19028

Invariant Checking for SMT-based Systems with Quantifiers

This paper addresses the problem of checking invariant properties for a large class of symbolic transition systems, defined by a combination of SMT theories and quantifiers. State variables can be functions from an uninterpreted sort (finite, but unbounded) to an interpreted sort, such as the the integers under the theory of linear arithmetic. This formalism is very expressive and can be used for modeling parameterized systems, array-manipulating programs, and more. We propose two algorithms for finding universal inductive invariants for such systems. The first algorithm combines an IC3-style loop with a form of implicit predicate abstraction to construct an invariant in an incremental manner. The second algorithm constructs an under-approximation of the original problem, and searches for a formula which is an inductive invariant for this case; then, the invariant is generalized to the original case, and checked with a portfolio of techniques. We have implemented the two algorithms and conducted an extensive experimental evaluation, considering various benchmarks and different tools from the literature. As far as we know, our method is the first capable of handling in a large class of systems in a uniform way. The experiment shows that both algorithms are competitive with the state of the art.


[198] 2402.19033

High-Speed Motion Planning for Aerial Swarms in Unknown and Cluttered Environments

Coordinated flight of multiple drones allows to achieve tasks faster such as search and rescue and infrastructure inspection. Thus, pushing the state-of-the-art of aerial swarms in navigation speed and robustness is of tremendous benefit. In particular, being able to account for unexplored/unknown environments when planning trajectories allows for safer flight. In this work, we propose the first high-speed, decentralized, and synchronous motion planning framework (HDSM) for an aerial swarm that explicitly takes into account the unknown/undiscovered parts of the environment. The proposed approach generates an optimized trajectory for each planning agent that avoids obstacles and other planning agents while moving and exploring the environment. The only global information that each agent has is the target location. The generated trajectory is high-speed, safe from unexplored spaces, and brings the agent closer to its goal. The proposed method outperforms four recent state-of-the-art methods in success rate (100% success in reaching the target location), flight speed (67% faster), and flight time (42% lower). Finally, the method is validated on a set of Crazyflie nano-drones as a proof of concept.


[199] 2402.19037

A Deep-Learning Technique to Locate Cryptographic Operations in Side-Channel Traces

Side-channel attacks allow extracting secret information from the execution of cryptographic primitives by correlating the partially known computed data and the measured side-channel signal. However, to set up a successful side-channel attack, the attacker has to perform i) the challenging task of locating the time instant in which the target cryptographic primitive is executed inside a side-channel trace and then ii)the time-alignment of the measured data on that time instant. This paper presents a novel deep-learning technique to locate the time instant in which the target computed cryptographic operations are executed in the side-channel trace. In contrast to state-of-the-art solutions, the proposed methodology works even in the presence of trace deformations obtained through random delay insertion techniques. We validated our proposal through a successful attack against a variety of unprotected and protected cryptographic primitives that have been executed on an FPGA-implemented system-on-chip featuring a RISC-V CPU.


[200] 2402.19038

Understanding Fairness in Software Engineering: Insights from Stack Exchange

Software practitioners discuss problems at work with peers, in-person and online. These discussions can be technical (e.g., how to fix a bug?) and social (e.g., how to assign work fairly?). While there is a growing body of knowledge exploring fairness problems and solutions in the human and social factors of software engineering, most focus has been on specific problems. This study provides fairness discussions by software practitioners on Stack Exchange sites. We present an exploratory study presenting the fairness experience of software practitioners and fairness expectations in software teams. We also want to identify the fairness aspects software practitioners talk about the most. For example, do they care more about fairness in income or how they are treated in the workplace? Our investigation of fairness discussions on eight Stack Exchange sites resulted in a list of 136 posts (28 questions and 108 answers) manually curated from 4,178 candidate posts. The study reveals that the majority of fairness discussions (24 posts) revolve around the topic of income suggesting that many software practitioners are highly interested in matters related to their pay and how it is fairly distributed. Further, we noted that while not discussed as often, discussions on fairness in recruitment tend to receive the highest number of views and scores. Interestingly, the study shows that unfairness experiences extend beyond the protected attributes. In this study, only 25 out of 136 posts mention protected attributes, with gender mainly being discussed.


[201] 2402.19041

Atmospheric Turbulence Removal with Video Sequence Deep Visual Priors

Atmospheric turbulence poses a challenge for the interpretation and visual perception of visual imagery due to its distortion effects. Model-based approaches have been used to address this, but such methods often suffer from artefacts associated with moving content. Conversely, deep learning based methods are dependent on large and diverse datasets that may not effectively represent any specific content. In this paper, we address these problems with a self-supervised learning method that does not require ground truth. The proposed method is not dependent on any dataset outside of the single data sequence being processed but is also able to improve the quality of any input raw sequences or pre-processed sequences. Specifically, our method is based on an accelerated Deep Image Prior (DIP), but integrates temporal information using pixel shuffling and a temporal sliding window. This efficiently learns spatio-temporal priors leading to a system that effectively mitigates atmospheric turbulence distortions. The experiments show that our method improves visual quality results qualitatively and quantitatively.


[202] 2402.19044

DMSA -- Dense Multi Scan Adjustment for LiDAR Inertial Odometry and Global Optimization

We propose a new method for fine registering multiple point clouds simultaneously. The approach is characterized by being dense, therefore point clouds are not reduced to pre-selected features in advance. Furthermore, the approach is robust against small overlaps and dynamic objects, since no direct correspondences are assumed between point clouds. Instead, all points are merged into a global point cloud, whose scattering is then iteratively reduced. This is achieved by dividing the global point cloud into uniform grid cells whose contents are subsequently modeled by normal distributions. We show that the proposed approach can be used in a sliding window continuous trajectory optimization combined with IMU measurements to obtain a highly accurate and robust LiDAR inertial odometry estimation. Furthermore, we show that the proposed approach is also suitable for large scale keyframe optimization to increase accuracy. We provide the source code and some experimental data on https://github.com/davidskdds/DMSA_LiDAR_SLAM.git.


[203] 2402.19047

Theoretical Foundations of Deep Selective State-Space Models

Structured state-space models (SSMs) such as S4, stemming from the seminal work of Gu et al., are gaining popularity as effective approaches for modeling sequential data. Deep SSMs demonstrate outstanding performance across a diverse set of domains, at a reduced training and inference cost compared to attention-based transformers. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states (e.g. GateLoop, Mamba, GLA), then the resulting architecture can surpass in both in accuracy and efficiency attention-powered foundation models trained on text, at scales of billion parameters. In this paper, we give theoretical grounding to this recent finding using tools from Rough Path Theory: we show that when random linear recurrences are equipped with simple input-controlled transitions (selectivity mechanism), then the hidden state is provably a low-dimensional projection of a powerful mathematical object called the signature of the input -- capturing non-linear interactions between tokens at distinct timescales. Our theory not only motivates the success of modern selective state-space models such as Mamba but also provides a solid framework to understand the expressive power of future SSM variants.


[204] 2402.19052

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: A Benchmark Study

Comprehensive summaries of sessions enable an effective continuity in mental health counseling, facilitating informed therapy planning. Yet, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance. We introduce MentalCLOUDS, a counseling-component guided summarization dataset consisting of 191 counseling sessions with summaries focused on three distinct counseling components (aka counseling aspects). Additionally, we assess the capabilities of 11 state-of-the-art LLMs in addressing the task of component-guided summarization in counseling. The generated summaries are evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals. Our findings demonstrate the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART in terms of standard quantitative metrics such as Rouge-1, Rouge-2, Rouge-L, and BERTScore across all aspects of counseling components. Further, expert evaluation reveals that Mistral supersedes both MentalLlama and MentalBART based on six parameters -- affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models share the same weakness by demonstrating a potential for improvement in the opportunity costs and perceived effectiveness metrics.


[205] 2402.19054

RobWE: Robust Watermark Embedding for Personalized Federated Learning Model Ownership Protection

Embedding watermarks into models has been widely used to protect model ownership in federated learning (FL). However, existing methods are inadequate for protecting the ownership of personalized models acquired by clients in personalized FL (PFL). This is due to the aggregation of the global model in PFL, resulting in conflicts over clients' private watermarks. Moreover, malicious clients may tamper with embedded watermarks to facilitate model leakage and evade accountability. This paper presents a robust watermark embedding scheme, named RobWE, to protect the ownership of personalized models in PFL. We first decouple the watermark embedding of personalized models into two parts: head layer embedding and representation layer embedding. The head layer belongs to clients' private part without participating in model aggregation, while the representation layer is the shared part for aggregation. For representation layer embedding, we employ a watermark slice embedding operation, which avoids watermark embedding conflicts. Furthermore, we design a malicious watermark detection scheme enabling the server to verify the correctness of watermarks before aggregating local models. We conduct an exhaustive experimental evaluation of RobWE. The results demonstrate that RobWE significantly outperforms the state-of-the-art watermark embedding schemes in FL in terms of fidelity, reliability, and robustness.


[206] 2402.19056

Recovering the Polytropic Exponent in the Porous Medium Equation: Asymptotic Approach

In this paper we consider the time dependent Porous Medium Equation, $u_t = \Delta u^\gamma$ with real polytropic exponent $\gamma>1$, subject to a homogeneous Dirichlet boundary condition. We are interested in recovering $\gamma$ from the knowledge of the solution $u$ at a given large time $T$. Based on an asymptotic inequality satisfied by the solution $u(T)$, we propose a numerical algorithm allowing us to recover $\gamma$. An upper bound for the error between the exact and recovered $\gamma$ is then showed. Finally, numerical investigations are carried out in two dimensions.


[207] 2402.19058

On the Design of Human-Robot Collaboration Gestures

Effective communication between humans and collaborative robots is essential for seamless Human-Robot Collaboration (HRC). In noisy industrial settings, nonverbal communication, such as gestures, plays a key role in conveying commands and information to robots efficiently. While existing literature has thoroughly examined gesture recognition and robots' responses to these gestures, there is a notable gap in exploring the design of these gestures. The criteria for creating efficient HRC gestures are scattered across numerous studies. This paper surveys the design principles of HRC gestures, as contained in the literature, aiming to consolidate a set of criteria for HRC gesture design. It also examines the methods used for designing and evaluating HRC gestures to highlight research gaps and present directions for future research in this area.


[208] 2402.19059

VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

Developing a unified multi-task foundation model has become a critical challenge in computer vision research. In the current field of 3D computer vision, most datasets solely focus on a relatively limited set of tasks, which complicates the concurrent training requirements of various downstream tasks. This makes the training of multi-objective networks difficult to proceed with, which further hinders the development of foundation models in the 3D vision field. In this paper, we introduce VEnvision3D, a large 3D synthetic perception dataset for multi-task learning, including depth completion, segmentation, upsampling, place recognition, and 3D reconstruction. Since the data for each task was collected in the same scenarios, tasks are inherently aligned in terms of the utilized data. Therefore, such a unique attribute can assist in exploring the potential for the multi-task model and even the foundation model without separate training methods. Several new benchmarks based on the characteristics of the proposed dataset were presented. Extensive studies were performed on end-to-end models, revealing new observations, challenges, and opportunities for future research. In addition, we designed a straightfoward multi-task network to uncover the ability that VEnvision3D can offer for the foundation model. Our dataset and code will be open-sourced upon acceptance.


[209] 2402.19061

Optimal ANN-SNN Conversion with Group Neurons

Spiking Neural Networks (SNNs) have emerged as a promising third generation of neural networks, offering unique characteristics such as binary outputs, high sparsity, and biological plausibility. However, the lack of effective learning algorithms remains a challenge for SNNs. For instance, while converting artificial neural networks (ANNs) to SNNs circumvents the need for direct training of SNNs, it encounters issues related to conversion errors and high inference time delays. In order to reduce or even eliminate conversion errors while decreasing inference time-steps, we have introduced a novel type of neuron called Group Neurons (GNs). One GN is composed of multiple Integrate-and-Fire (IF) neurons as members, and its neural dynamics are meticulously designed. Based on GNs, we have optimized the traditional ANN-SNN conversion framework. Specifically, we replace the IF neurons in the SNNs obtained by the traditional conversion framework with GNs. The resulting SNNs, which utilize GNs, are capable of achieving accuracy levels comparable to ANNs even within extremely short inference time-steps. The experiments on CIFAR10, CIFAR100, and ImageNet datasets demonstrate the superiority of the proposed methods in terms of both inference accuracy and latency. Code is available at https://github.com/Lyu6PosHao/ANN2SNN_GN.


[210] 2402.19064

The Influence of Color Stimuli on Adolescents' Emotion Playing Mobile Games

Video games elicit emotions which can be influenced by color stimuli as shown by previous studies. However, little research has been conducted on whether this applies to mobile games played by adolescents. Therefore, we examined the influence of color stimuli hue and saturation on mobile game play. Adolescents (n=21) played a mobile platformer game with varying hue and saturation per level for about 25 minutes. We gathered data on emotional states after each level using the Self-Assessment Manikin questionnaire, recorded time spent in each level, and collected participant self-reports on their video game experience. We performed statistical tests, such as ANOVA, which depict no significant influence of hue and/or saturation on the emotional state of our players. We conclude that it is possible that color alone is not an effective measure for eliciting emotion in mobile games, and further research is needed to consider measures such as time spent in the game and screen size, as these are unique to mobile games. There was a noticeable variance in emotional response between male and female players, with a significant interaction of hue and saturation among male players for valence ratings. This may be an indication that color preference influences perceived pleasantness.


[211] 2402.19071

FATE in MMLA: A Student-Centred Exploration of Fairness, Accountability, Transparency, and Ethics in Multimodal Learning Analytics

Multimodal Learning Analytics (MMLA) integrates novel sensing technologies and artificial intelligence algorithms, providing opportunities to enhance student reflection during complex, collaborative learning experiences. Although recent advancements in MMLA have shown its capability to generate insights into diverse learning behaviours across various learning settings, little research has been conducted to evaluate these systems in authentic learning contexts, particularly regarding students' perceived fairness, accountability, transparency, and ethics (FATE). Understanding these perceptions is essential to using MMLA effectively without introducing ethical complications or negatively affecting how students learn. This study aimed to address this gap by assessing the FATE of MMLA in an authentic, collaborative learning context. We conducted semi-structured interviews with 14 undergraduate students who used MMLA visualisations for post-activity reflection. The findings highlighted the significance of accurate and comprehensive data representation to ensure visualisation fairness, the need for different levels of data access to foster accountability, the imperative of measuring and cultivating transparency with students, and the necessity of transforming informed consent from dichotomous to continuous and measurable scales. While students value the benefits of MMLA, they also emphasise the importance of ethical considerations, highlighting a pressing need for the LA and MMLA community to investigate and address FATE issues actively.


[212] 2402.19072

TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables

Recent studies have demonstrated remarkable performance in time series forecasting. However, due to the partially-observed nature of real-world applications, solely focusing on the target of interest, so-called endogenous variables, is usually insufficient to guarantee accurate forecasting. Notably, a system is often recorded into multiple variables, where the exogenous series can provide valuable external information for endogenous variables. Thus, unlike prior well-established multivariate or univariate forecasting that either treats all the variables equally or overlooks exogenous information, this paper focuses on a practical setting, which is time series forecasting with exogenous variables. We propose a novel framework, TimeXer, to utilize external information to enhance the forecasting of endogenous variables. With a deftly designed embedding layer, TimeXer empowers the canonical Transformer architecture with the ability to reconcile endogenous and exogenous information, where patch-wise self-attention and variate-wise cross-attention are employed. Moreover, a global endogenous variate token is adopted to effectively bridge the exogenous series into endogenous temporal patches. Experimentally, TimeXer significantly improves time series forecasting with exogenous variables and achieves consistent state-of-the-art performance in twelve real-world forecasting benchmarks.


[213] 2402.19076

Pointing out the Shortcomings of Relation Extraction Models with Semantically Motivated Adversarials

In recent years, large language models have achieved state-of-the-art performance across various NLP tasks. However, investigations have shown that these models tend to rely on shortcut features, leading to inaccurate predictions and causing the models to be unreliable at generalization to out-of-distribution (OOD) samples. For instance, in the context of relation extraction (RE), we would expect a model to identify the same relation independently of the entities involved in it. For example, consider the sentence "Leonardo da Vinci painted the Mona Lisa" expressing the created(Leonardo_da_Vinci, Mona_Lisa) relation. If we substiute "Leonardo da Vinci" with "Barack Obama", then the sentence still expresses the created relation. A robust model is supposed to detect the same relation in both cases. In this work, we describe several semantically-motivated strategies to generate adversarial examples by replacing entity mentions and investigate how state-of-the-art RE models perform under pressure. Our analyses show that the performance of these models significantly deteriorates on the modified datasets (avg. of -48.5% in F1), which indicates that these models rely to a great extent on shortcuts, such as surface forms (or patterns therein) of entities, without making full use of the information present in the sentences.


[214] 2402.19078

Smooth Tchebycheff Scalarization for Multi-Objective Optimization

Multi-objective optimization problems can be found in many real-world applications, where the objectives often conflict each other and cannot be optimized by a single solution. In the past few decades, numerous methods have been proposed to find Pareto solutions that represent different optimal trade-offs among the objectives for a given problem. However, these existing methods could have high computational complexity or may not have good theoretical properties for solving a general differentiable multi-objective optimization problem. In this work, by leveraging the smooth optimization technique, we propose a novel and lightweight smooth Tchebycheff scalarization approach for gradient-based multi-objective optimization. It has good theoretical properties for finding all Pareto solutions with valid trade-off preferences, while enjoying significantly lower computational complexity compared to other methods. Experimental results on various real-world application problems fully demonstrate the effectiveness of our proposed method.


[215] 2402.19080

MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing

Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide data-parallel operations, in a single-instruction multiple-data (SIMD) fashion. However, DRAM rows' large and rigid granularity limit the effectiveness and applicability of PUD in three ways. First, since applications have varying degrees of SIMD parallelism, PUD execution often leads to underutilization, throughput loss, and energy waste. Second, most PUD architectures are limited to the execution of parallel map operations. Third, the need to feed the wide DRAM row with tens of thousands of data elements combined with the lack of adequate compiler support for PUD systems create a programmability barrier. Our goal is to design a flexible PUD system that overcomes the limitations caused by the large and rigid granularity of PUD. To this end, we propose MIMDRAM, a hardware/software co-designed PUD system that introduces new mechanisms to allocate and control only the necessary resources for a given PUD operation. The key idea of MIMDRAM is to leverage fine-grained DRAM (i.e., the ability to independently access smaller segments of a large DRAM row) for PUD computation. MIMDRAM exploits this key idea to enable a multiple-instruction multiple-data (MIMD) execution model in each DRAM subarray. We evaluate MIMDRAM using twelve real-world applications and 495 multi-programmed application mixes. Our evaluation shows that MIMDRAM provides 34x the performance, 14.3x the energy efficiency, 1.7x the throughput, and 1.3x the fairness of a state-of-the-art PUD framework, along with 30.6x and 6.8x the energy efficiency of a high-end CPU and GPU, respectively. MIMDRAM adds small area cost to a DRAM chip (1.11%) and CPU die (0.6%).


[216] 2402.19082

VideoMAC: Video Masked Autoencoders Meet ConvNets

Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} PCK@0.1).


[217] 2402.19085

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving Pareto improvements in multi-objective alignment.


[218] 2402.19088

Survey in Characterization of Semantic Change

Live languages continuously evolve to integrate the cultural change of human societies. This evolution manifests through neologisms (new words) or \textbf{semantic changes} of words (new meaning to existing words). Understanding the meaning of words is vital for interpreting texts coming from different cultures (regionalism or slang), domains (e.g., technical terms), or periods. In computer science, these words are relevant to computational linguistics algorithms such as translation, information retrieval, question answering, etc. Semantic changes can potentially impact the quality of the outcomes of these algorithms. Therefore, it is important to understand and characterize these changes formally. The study of this impact is a recent problem that has attracted the attention of the computational linguistics community. Several approaches propose methods to detect semantic changes with good precision, but more effort is needed to characterize how the meaning of words changes and to reason about how to reduce the impact of semantic change. This survey provides an understandable overview of existing approaches to the \textit{characterization of semantic changes} and also formally defines three classes of characterizations: if the meaning of a word becomes more general or narrow (change in dimension) if the word is used in a more pejorative or positive/ameliorated sense (change in orientation), and if there is a trend to use the word in a, for instance, metaphoric or metonymic context (change in relation). We summarized the main aspects of the selected publications in a table and discussed the needs and trends in the research activities on semantic change characterization.


[219] 2402.19089

Around Don's conjecture for binary completely reachable automata

A word $w$ is called a reaching word of a subset $S$ of states in a deterministic finite automaton (DFA) if $S$ is the image of $Q$ under the action of $w$. A DFA is called completely reachable if every non-empty subset of the state set has a reaching word. A conjecture states that in every $n$-state completely reachable DFA, for every $k$-element subset of states, there exists a reaching word of length at most $n(n-k)$. We present infinitely many completely reachable DFAs with two letters that violate this conjecture. A subfamily of completely reachable DFAs with two letters, is called standardized DFAs, introduced by Casas and Volkov (2023). We prove that every $k$-element subset of states in an $n$-state standardized DFA has a reaching word of length $\le n(n-k) + n - 1$. Finally, we confirm the conjecture for standardized DFAs with additional properties, thus generalizing a result of Casas and Volkov (2023).


[220] 2402.19090

Best Arm Identification with Resource Constraints

Motivated by the cost heterogeneity in experimentation across different alternatives, we study the Best Arm Identification with Resource Constraints (BAIwRC) problem. The agent aims to identify the best arm under resource constraints, where resources are consumed for each arm pull. We make two novel contributions. We design and analyze the Successive Halving with Resource Rationing algorithm (SH-RR). The SH-RR achieves a near-optimal non-asymptotic rate of convergence in terms of the probability of successively identifying an optimal arm. Interestingly, we identify a difference in convergence rates between the cases of deterministic and stochastic resource consumption.


[221] 2402.19091

Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection

The recently developed and publicly available synthetic image generation methods and services make it possible to create extremely realistic imagery on demand, raising great risks for the integrity and safety of online information. State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models. However, such extracted features mostly encapsulate high-level visual semantics instead of fine-grained details, which are more important for the SID task. On the contrary, shallow layers encode low-level visual information. In this work, we leverage the image representations extracted by intermediate Transformer blocks of CLIP's image-encoder via a lightweight network that maps them to a learnable forgery-aware vector space capable of generalizing exceptionally well. We also employ a trainable module to incorporate the importance of each Transformer block to the final prediction. Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement. Notably, the best performing models require just a single epoch for training (~8 minutes). Code available at https://github.com/mever-team/rine.


[222] 2402.19097

TEncDM: Understanding the Properties of Diffusion Model in the Space of Language Model Encodings

Drawing inspiration from the success of diffusion models in various domains, numerous research papers proposed methods for adapting them to text data. Despite these efforts, none of them has managed to achieve the quality of the large language models. In this paper, we conduct a comprehensive analysis of key components of the text diffusion models and introduce a novel approach named Text Encoding Diffusion Model (TEncDM). Instead of the commonly used token embedding space, we train our model in the space of the language model encodings. Additionally, we propose to use a Transformer-based decoder that utilizes contextual information for text reconstruction. We also analyse self-conditioning and find that it increases the magnitude of the model outputs, allowing the reduction of the number of denoising steps at the inference stage. Evaluation of TEncDM on two downstream text generation tasks, QQP and XSum, demonstrates its superiority over existing non-autoregressive models.


[223] 2402.19101

Effective Two-Stage Knowledge Transfer for Multi-Entity Cross-Domain Recommendation

In recent years, the recommendation content on e-commerce platforms has become increasingly rich -- a single user feed may contain multiple entities, such as selling products, short videos, and content posts. To deal with the multi-entity recommendation problem, an intuitive solution is to adopt the shared-network-based architecture for joint training. The idea is to transfer the extracted knowledge from one type of entity (source entity) to another (target entity). However, different from the conventional same-entity cross-domain recommendation, multi-entity knowledge transfer encounters several important issues: (1) data distributions of the source entity and target entity are naturally different, making the shared-network-based joint training susceptible to the negative transfer issue, (2) more importantly, the corresponding feature schema of each entity is not exactly aligned (e.g., price is an essential feature for selling product while missing for content posts), making the existing methods no longer appropriate. Recent researchers have also experimented with the pre-training and fine-tuning paradigm. Again, they only consider the scenarios with the same entity type and feature systems, which is inappropriate in our case. To this end, we design a pre-training & fine-tuning based Multi-entity Knowledge Transfer framework called MKT. MKT utilizes a multi-entity pre-training module to extract transferable knowledge across different entities. In particular, a feature alignment module is first applied to scale and align different feature schemas. Afterward, a couple of knowledge extractors are employed to extract the common and entity-specific knowledge. In the end, the extracted common knowledge is adopted for target entity model training. Through extensive offline and online experiments, we demonstrated the superiority of MKT over multiple State-Of-The-Art methods.


[224] 2402.19102

FlatNAS: optimizing Flatness in Neural Architecture Search for Out-of-Distribution Robustness

Neural Architecture Search (NAS) paves the way for the automatic definition of Neural Network (NN) architectures, attracting increasing research attention and offering solutions in various scenarios. This study introduces a novel NAS solution, called Flat Neural Architecture Search (FlatNAS), which explores the interplay between a novel figure of merit based on robustness to weight perturbations and single NN optimization with Sharpness-Aware Minimization (SAM). FlatNAS is the first work in the literature to systematically explore flat regions in the loss landscape of NNs in a NAS procedure, while jointly optimizing their performance on in-distribution data, their out-of-distribution (OOD) robustness, and constraining the number of parameters in their architecture. Differently from current studies primarily concentrating on OOD algorithms, FlatNAS successfully evaluates the impact of NN architectures on OOD robustness, a crucial aspect in real-world applications of machine and deep learning. FlatNAS achieves a good trade-off between performance, OOD generalization, and the number of parameters, by using only in-distribution data in the NAS exploration. The OOD robustness of the NAS-designed models is evaluated by focusing on robustness to input data corruptions, using popular benchmark datasets in the literature.


[225] 2402.19103

Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models

Large Language Models (LLMs) have shown impressive capabilities but still suffer from the issue of hallucinations. A significant type of this issue is the false premise hallucination, which we define as the phenomenon when LLMs generate hallucinated text when confronted with false premise questions. In this paper, we perform a comprehensive analysis of the false premise hallucination and elucidate its internal working mechanism: a small subset of attention heads (which we designate as false premise heads) disturb the knowledge extraction process, leading to the occurrence of false premise hallucination. Based on our analysis, we propose \textbf{FAITH} (\textbf{F}alse premise \textbf{A}ttention head constra\textbf{I}ining for mi\textbf{T}igating \textbf{H}allucinations), a novel and effective method to mitigate false premise hallucinations. It constrains the false premise attention heads during the model inference process. Impressively, extensive experiments demonstrate that constraining only approximately $1\%$ of the attention heads in the model yields a notable increase of nearly $20\%$ of model performance.


[226] 2402.19105

CollaFuse: Navigating Limited Resources and Privacy in Collaborative Generative AI

In the landscape of generative artificial intelligence, diffusion-based models present challenges for socio-technical systems in data requirements and privacy. Traditional approaches like federated learning distribute the learning process but strain individual clients, especially with constrained resources (e.g., edge devices). In response to these challenges, we introduce CollaFuse, a novel framework inspired by split learning. Tailored for efficient and collaborative use of denoising diffusion probabilistic models, CollaFuse enables shared server training and inference, alleviating client computational burdens. This is achieved by retaining data and computationally inexpensive GPU processes locally at each client while outsourcing the computationally expensive processes to the shared server. Demonstrated in a healthcare context, CollaFuse enhances privacy by highly reducing the need for sensitive information sharing. These capabilities hold the potential to impact various application areas, such as the design of edge computing solutions, healthcare research, or autonomous driving. In essence, our work advances distributed machine learning, shaping the future of collaborative GenAI networks.


[227] 2402.19107

Rahmani Sort: A Novel Variant of Insertion Sort Algorithm with O(nlogn) Complexity

Various decision support systems are available that implement Data Mining and Data Warehousing techniques for diving into the sea of data for getting useful patterns of knowledge (pearls). Classification, regression, clustering, and many other algorithms are used to enhance the precision and accuracy of the decision process. So, there is scope for increasing the response time of the decision process, especially in mission-critical operations. If data are ordered with suitable and efficient sorting operation, the response time of the decision process can be minimized. Insertion sort is much more suitable for such applications due to its simple and straight logic along with its dynamic nature suitable for list implementation. But it is slower than merge sort and quick sort. The main reasons this is slow: firstly, a sequential search is used to find the actual position of the next key element into the sorted left subarray and secondly, shifting of elements is required by one position towards the right for accommodating the newly inserted element. Therefore, I propose a new algorithm by using a novel technique of binary search mechanism for finding the sorted location of the next key item into the previously sorted left subarray much quicker than the conventional insertion sort algorithm. Performance measurement in terms of the actual running time of the new algorithm has been compared with those of other conventional sorting algorithms apart from the insertion sort. The results obtained on various sample data show that the new algorithm is better in performance than the conventional insertion sort and merge sort algorithms.


[228] 2402.19108

DeepEraser: Deep Iterative Context Mining for Generic Text Eraser

In this work, we present DeepEraser, an effective deep network for generic text removal. DeepEraser utilizes a recurrent architecture that erases the text in an image via iterative operations. Our idea comes from the process of erasing pencil script, where the text area designated for removal is subject to continuous monitoring and the text is attenuated progressively, ensuring a thorough and clean erasure. Technically, at each iteration, an innovative erasing module is deployed, which not only explicitly aggregates the previous erasing progress but also mines additional semantic context to erase the target text. Through iterative refinements, the text regions are progressively replaced with more appropriate content and finally converge to a relatively accurate status. Furthermore, a custom mask generation strategy is introduced to improve the capability of DeepEraser for adaptive text removal, as opposed to indiscriminately removing all the text in an image. Our DeepEraser is notably compact with only 1.4M parameters and trained in an end-to-end manner. To verify its effectiveness, extensive experiments are conducted on several prevalent benchmarks, including SCUT-Syn, SCUT-EnsText, and Oxford Synthetic text dataset. The quantitative and qualitative results demonstrate the effectiveness of our DeepEraser over the state-of-the-art methods, as well as its strong generalization ability in custom mask text removal. The codes and pre-trained models are available at https://github.com/fh2019ustc/DeepEraser


[229] 2402.19110

Temporal-Aware Deep Reinforcement Learning for Energy Storage Bidding in Energy and Contingency Reserve Markets

The battery energy storage system (BESS) has immense potential for enhancing grid reliability and security through its participation in the electricity market. BESS often seeks various revenue streams by taking part in multiple markets to unlock its full potential, but effective algorithms for joint-market participation under price uncertainties are insufficiently explored in the existing research. To bridge this gap, we develop a novel BESS joint bidding strategy that utilizes deep reinforcement learning (DRL) to bid in the spot and contingency frequency control ancillary services (FCAS) markets. Our approach leverages a transformer-based temporal feature extractor to effectively respond to price fluctuations in seven markets simultaneously and helps DRL learn the best BESS bidding strategy in joint-market participation. Additionally, unlike conventional "black-box" DRL model, our approach is more interpretable and provides valuable insights into the temporal bidding behavior of BESS in the dynamic electricity market. We validate our method using realistic market prices from the Australian National Electricity Market. The results show that our strategy outperforms benchmarks, including both optimization-based and other DRL-based strategies, by substantial margins. Our findings further suggest that effective temporal-aware bidding can significantly increase profits in the spot and contingency FCAS markets compared to individual market participation.


[230] 2402.19116

How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding

Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching, while merely leveraging the coarse-grained sentence-image pairs for training. However, existing studies on WPG largely ignore the implicit phrase-region matching relations, which are crucial for evaluating the capability of models in understanding the deep multimodal semantics. To this end, this paper proposes an Implicit-Enhanced Causal Inference (IECI) approach to address the challenges of modeling the implicit relations and highlighting them beyond the explicit. Specifically, this approach leverages both the intervention and counterfactual techniques to tackle the above two challenges respectively. Furthermore, a high-quality implicit-enhanced dataset is annotated to evaluate IECI and detailed evaluations show the great advantages of IECI over the state-of-the-art baselines. Particularly, we observe an interesting finding that IECI outperforms the advanced multimodal LLMs by a large margin on this implicit-enhanced dataset, which may facilitate more research to evaluate the multimodal LLMs in this direction.


[231] 2402.19118

Continuous Sign Language Recognition Based on Motor attention mechanism and frame-level Self-distillation

Changes in facial expression, head movement, body movement and gesture movement are remarkable cues in sign language recognition, and most of the current continuous sign language recognition(CSLR) research methods mainly focus on static images in video sequences at the frame-level feature extraction stage, while ignoring the dynamic changes in the images. In this paper, we propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression, and obtain a dynamic representation of image changes. And for the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources by self-distilling the features of adjacent stages and using the higher-order features as teachers to guide the lower-order features. The combination of the two constitutes our proposed holistic model of CSLR Based on motor attention mechanism and frame-level Self-Distillation (MAM-FSD), which improves the inference ability and robustness of the model. We conduct experiments on three publicly available datasets, and the experimental results show that our proposed method can effectively extract the sign language motion information in videos, improve the accuracy of CSLR and reach the state-of-the-art level.


[232] 2402.19119

VIXEN: Visual Text Comparison Network for Image Difference Captioning

We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing image difference captioning (IDC) datasets by training on synthetically manipulated images from the recent InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT-3. We show that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content. Code and data are available at this http URL


[233] 2402.19120

A Naive Approach for Automatic Line-level Code Completion

Coding is an integral aspect of programming. A programmer can automatically complete a code fragment after writing a few tokens, and the process of automatic completion is known as code completion. Several research studies on code completion have previously been conducted for method body completion and method parameter completion. However, this fundamental study explores the automatic completion of any program statement that might not even be part of a method. The goal is to provide suggestions to the programmer for completing code throughout the codebase by identifying and analyzing code similarities. The proposed methodology can be regarded as a fundamental framework for automated code completion. From the investigation of hundreds of revisions of four subject systems written in C and Java, it is observed that the proposed method can automatically complete around 22% of code statements with an average accuracy of 87% that a programmer writes during development, accelerating software development time. The empirical analysis further demonstrates that the approach can be used with programming language neutrality. The study concludes by illustrating that taking 10 characters as prefixes before invoking completion provides maximum precision.


[234] 2402.19122

BigGait: Learning Gait Representation You Want by Large Vision Models

Gait recognition stands as one of the most pivotal remote identification technologies and progressively expands across research and industrial communities. However, existing gait recognition methods heavily rely on task-specific upstream driven by supervised learning to provide explicit gait representations, which inevitably introduce expensive annotation costs and potentially cause cumulative errors. Escaping from this trend, this work explores effective gait representations based on the all-purpose knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a simple yet efficient gait framework, termed BigGait. Specifically, the Gait Representation Extractor (GRE) in BigGait effectively transforms all-purpose knowledge into implicit gait features in an unsupervised manner, drawing from design principles of established gait representation construction approaches. Experimental results on CCPG, CAISA-B* and SUSTech1K indicate that BigGait significantly outperforms the previous methods in both self-domain and cross-domain tasks in most cases, and provides a more practical paradigm for learning the next-generation gait representation. Eventually, we delve into prospective challenges and promising directions in LVMs-based gait recognition, aiming to inspire future work in this emerging topic. The source code will be available at https://github.com/ShiqiYu/OpenGait.


[235] 2402.19125

Highly efficient Gauss's law-preserving spectral algorithms for Maxwell's double-curl source and eigenvalue problems based on eigen-decomposition

In this paper, we present Gauss's law-preserving spectral methods and their efficient solution algorithms for curl-curl source and eigenvalue problems in two and three dimensions arising from Maxwell's equations. Arbitrary order $H(curl)$-conforming spectral basis functions in two and three dimensions are firstly proposed using compact combination of Legendre polynomials. A mixed formulation involving a Lagrange multiplier is then adopted to preserve the Gauss's law in the weak sense. To overcome the bottleneck of computational efficiency caused by the saddle-point nature of the mixed scheme, we present highly efficient solution algorithms based on reordering and decoupling of the resultant linear algebraic system and numerical eigen-decomposition of one dimensional mass matrix. The proposed solution algorithms are direct methods requiring only several matrix-matrix or matrix-tensor products of $N$-by-$N$ matrices, where $N$ is the highest polynomial order in each direction. Compared with other direct methods, the computational complexities are reduced from $O(N^6)$ and $O(N^9)$ to $O(N^3)$ and $O(N^4)$ with small and constant pre-factors for 2D and 3D cases, respectively, and can further be accelerated to $O(N^{2.807})$ and $O(N^{3.807})$, when boosted with the Strassen's matrix multiplication algorithm. Moreover, these algorithms strictly obey the Helmholtz-Hodge decomposition, thus totally eliminate the spurious eigen-modes of non-physical zero eigenvalues. Extensions of the proposed methods and algorithms to problems in complex geometries with variable coefficients and inhomogeneous boundary conditions are discussed to deal with more general situations. Ample numerical examples for solving Maxwell's source and eigenvalue problems are presented to demonstrate the accuracy and efficiency of the proposed methods.


[236] 2402.19128

ARMCHAIR: integrated inverse reinforcement learning and model predictive control for human-robot collaboration

One of the key issues in human-robot collaboration is the development of computational models that allow robots to predict and adapt to human behavior. Much progress has been achieved in developing such models, as well as control techniques that address the autonomy problems of motion planning and decision-making in robotics. However, the integration of computational models of human behavior with such control techniques still poses a major challenge, resulting in a bottleneck for efficient collaborative human-robot teams. In this context, we present a novel architecture for human-robot collaboration: Adaptive Robot Motion for Collaboration with Humans using Adversarial Inverse Reinforcement learning (ARMCHAIR). Our solution leverages adversarial inverse reinforcement learning and model predictive control to compute optimal trajectories and decisions for a mobile multi-robot system that collaborates with a human in an exploration task. During the mission, ARMCHAIR operates without human intervention, autonomously identifying the necessity to support and acting accordingly. Our approach also explicitly addresses the network connectivity requirement of the human-robot team. Extensive simulation-based evaluations demonstrate that ARMCHAIR allows a group of robots to safely support a simulated human in an exploration scenario, preventing collisions and network disconnections, and improving the overall performance of the task.


[237] 2402.19132

Weighted least $\ell_p$ approximation on compact Riemannian manifolds

Given a sequence of Marcinkiewicz-Zygmund inequalities in $L_2$ on a compact space, Gr\"ochenig in \cite{G} discussed weighted least squares approximation and least squares quadrature. Inspired by this work, for all $1\le p\le\infty$, we develop weighted least $\ell_p$ approximation induced by a sequence of Marcinkiewicz-Zygmund inequalities in $L_p$ on a compact smooth Riemannian manifold $\Bbb M$ with normalized Riemannian measure (typical examples are the torus and the sphere). In this paper we derive corresponding approximation theorems with the error measured in $L_q,\,1\le q\le\infty$, and least quadrature errors for both Sobolev spaces $H_p^r(\Bbb M), \, r>d/p$ generated by eigenfunctions associated with the Laplace-Beltrami operator and Besov spaces $B_{p,\tau}^r(\Bbb M),\, 0<\tau\le \infty, r>d/p $ defined by best polynomial approximation. Finally, we discuss the optimality of the obtained results by giving sharp estimates of sampling numbers and optimal quadrature errors for the aforementioned spaces.


[238] 2402.19133

Evaluating Webcam-based Gaze Data as an Alternative for Human Rationale Annotations

Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based language models (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.


[239] 2402.19135

Think Fast, Think Slow, Think Critical: Designing an Automated Propaganda Detection Tool

In today's digital age, characterized by rapid news consumption and increasing vulnerability to propaganda, fostering citizens' critical thinking is crucial for stable democracies. This paper introduces the design of ClarifAI, a novel automated propaganda detection tool designed to nudge readers towards more critical news consumption by activating the analytical mode of thinking, following Kahneman's dual-system theory of cognition. Using Large Language Models, ClarifAI detects propaganda in news articles and provides context-rich explanations, enhancing users' understanding and critical thinking. Our contribution is threefold: first, we propose the design of ClarifAI; second, in an online experiment, we demonstrate that this design effectively encourages news readers to engage in more critical reading; and third, we emphasize the value of explanations for fostering critical thinking. The study thus offers both a practical tool and useful design knowledge for mitigating propaganda in digital news.


[240] 2402.19142

ProtoP-OD: Explainable Object Detection with Prototypical Parts

Interpretation and visualization of the behavior of detection transformers tends to highlight the locations in the image that the model attends to, but it provides limited insight into the \emph{semantics} that the model is focusing on. This paper introduces an extension to detection transformers that constructs prototypical local features and uses them in object detection. These custom features, which we call prototypical parts, are designed to be mutually exclusive and align with the classifications of the model. The proposed extension consists of a bottleneck module, the prototype neck, that computes a discretized representation of prototype activations and a new loss term that matches prototypes to object classes. This setup leads to interpretable representations in the prototype neck, allowing visual inspection of the image content perceived by the model and a better understanding of the model's reliability. We show experimentally that our method incurs only a limited performance penalty, and we provide examples that demonstrate the quality of the explanations provided by our method, which we argue outweighs the performance penalty.


[241] 2402.19144

Weakly Supervised Monocular 3D Detection with a Single-View Image

Monocular 3D detection (M3D) aims for precise 3D object localization from a single-view image which usually involves labor-intensive annotation of 3D detection boxes. Weakly supervised M3D has recently been studied to obviate the 3D annotation process by leveraging many existing 2D annotations, but it often requires extra training data such as LiDAR point clouds or multi-view images which greatly degrades its applicability and usability in various applications. We propose SKD-WM3D, a weakly supervised monocular 3D detection framework that exploits depth information to achieve M3D with a single-view image exclusively without any 3D annotations or other training data. One key design in SKD-WM3D is a self-knowledge distillation framework, which transforms image features into 3D-like representations by fusing depth information and effectively mitigates the inherent depth ambiguity in monocular scenarios with little computational overhead in inference. In addition, we design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy which facilitate knowledge acquisition and knowledge transfer, respectively. Extensive experiments show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.


[242] 2402.19145

A SAM-guided Two-stream Lightweight Model for Anomaly Detection

In industrial anomaly detection, model efficiency and mobile-friendliness become the primary concerns in real-world applications. Simultaneously, the impressive generalization capabilities of Segment Anything (SAM) have garnered broad academic attention, making it an ideal choice for localizing unseen anomalies and diverse real-world patterns. In this paper, considering these two critical factors, we propose a SAM-guided Two-stream Lightweight Model for unsupervised anomaly detection (STLM) that not only aligns with the two practical application requirements but also harnesses the robust generalization capabilities of SAM. We employ two lightweight image encoders, i.e., our two-stream lightweight module, guided by SAM's knowledge. To be specific, one stream is trained to generate discriminative and general feature representations in both normal and anomalous regions, while the other stream reconstructs the same images without anomalies, which effectively enhances the differentiation of two-stream representations when facing anomalous regions. Furthermore, we employ a shared mask decoder and a feature aggregation module to generate anomaly maps. Our experiments conducted on MVTec AD benchmark show that STLM, with about 16M parameters and achieving an inference time in 20ms, competes effectively with state-of-the-art methods in terms of performance, 98.26% on pixel-level AUC and 94.92% on PRO. We further experiment on more difficult datasets, e.g., VisA and DAGM, to demonstrate the effectiveness and generalizability of STLM.


[243] 2402.19146

Computing Longest Common Subsequence under Cartesian-Tree Matching Model

Two strings of the same length are said to Cartesian-tree match (CT-match) if their Cartesian-trees are isomorphic [Park et al., TCS 2020]. Cartesian-tree matching is a natural model that allows for capturing similarities of numerical sequences. Oizumi et al. [CPM 2022] showed that subsequence pattern matching under CT-matching model can be solved in polynomial time. This current article follows and extends this line of research: We present the first polynomial-time algorithm that finds the longest common subsequence under CT-matching of two given strings $S$ and $T$ of length $n$, in $O(n^6)$ time and $O(n^4)$ space for general ordered alphabets. We then show that the problem has a faster solution in the binary case, by presenting an $O(n^2 / \log n)$-time and space algorithm.


[244] 2402.19147

Efficient quaternion CUR method for low-rank approximation to quaternion matrix

The low-rank quaternion matrix approximation has been successfully applied in many applications involving signal processing and color image processing. However, the cost of quaternion models for generating low-rank quaternion matrix approximation is sometimes considerable due to the computation of the quaternion singular value decomposition (QSVD), which limits their application to real large-scale data. To address this deficiency, an efficient quaternion matrix CUR (QMCUR) method for low-rank approximation is suggested, which provides significant acceleration in color image processing. We first explore the QMCUR approximation method, which uses actual columns and rows of the given quaternion matrix, instead of the costly QSVD. Additionally, two different sampling strategies are used to sample the above-selected columns and rows. Then, the perturbation analysis is performed on the QMCUR approximation of noisy versions of low-rank quaternion matrices. Extensive experiments on both synthetic and real data further reveal the superiority of the proposed algorithm compared with other algorithms for getting low-rank approximation, in terms of both efficiency and accuracy.


[245] 2402.19150

Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts

Large Multimodal Models (LMMs) rely on pre-trained Vision Language Models (VLMs) and Large Language Models (LLMs) to perform amazing emergent abilities on various multimodal tasks in the joint space of vision and language. However, the Typographic Attack, which shows disruption to VLMs, has also been certified as a security vulnerability to LMMs. In this work, we first comprehensively investigate the distractibility of LMMs by typography. In particular, we introduce the Typographic Dataset designed to evaluate distractibility across various multi-modal subtasks, such as object recognition, visual attributes detection, enumeration, arithmetic computation, and commonsense reasoning. To further study the effect of typographic patterns on performance, we also scrutinize the effect of tuning various typographic factors, encompassing font size, color, opacity, and spatial positioning of typos. We discover that LMMs can partially distinguish visual contents and typos when confronting typographic attacks, which suggests that embeddings from vision encoders contain enough information to distinguish visual contents and typos in images. Inspired by such phenomena, we demonstrate that CLIP's performance of zero-shot classification on typo-ridden images can be significantly improved by providing more informative texts to match images. Furthermore, we also prove that LMMs can utilize more informative prompts to leverage information in embeddings to differentiate between visual content and typos. Finally, we propose a prompt information enhancement method that can effectively mitigate the effects of typography.


[246] 2402.19155

Beyond Language Models: Byte Models are Digital World Simulators

Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next token prediction in natural language processing, we introduce bGPT, a model with next byte prediction to simulate the digital world. bGPT matches specialized models in performance across various modalities, including text, audio, and images, and offers new possibilities for predicting, simulating, and diagnosing algorithm or hardware behaviour. It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte in converting ABC notation to MIDI format. In addition, bGPT demonstrates exceptional capabilities in simulating CPU behaviour, with an accuracy exceeding 99.99% in executing various operations. Leveraging next byte prediction, models like bGPT can directly learn from vast binary data, effectively simulating the intricate patterns of the digital world.


[247] 2402.19159

Trajectory Consistency Distillation

Latent Consistency Model (LCM) extends the Consistency Model to the latent space and leverages the guided consistency distillation technique to achieve impressive performance in accelerating text-to-image synthesis. However, we observed that LCM struggles to generate images with both clarity and detailed intricacy. To address this limitation, we initially delve into and elucidate the underlying causes. Our investigation identifies that the primary issue stems from errors in three distinct areas. Consequently, we introduce Trajectory Consistency Distillation (TCD), which encompasses trajectory consistency function and strategic stochastic sampling. The trajectory consistency function diminishes the distillation errors by broadening the scope of the self-consistency boundary condition and endowing the TCD with the ability to accurately trace the entire trajectory of the Probability Flow ODE. Additionally, strategic stochastic sampling is specifically designed to circumvent the accumulated errors inherent in multi-step consistency sampling, which is meticulously tailored to complement the TCD model. Experiments demonstrate that TCD not only significantly enhances image quality at low NFEs but also yields more detailed results compared to the teacher model at high NFEs.


[248] 2402.19160

Effective Message Hiding with Order-Preserving Mechanisms

Message hiding, a technique that conceals secret message bits within a cover image, aims to achieve an optimal balance among message capacity, recovery accuracy, and imperceptibility. While convolutional neural networks have notably improved message capacity and imperceptibility, achieving high recovery accuracy remains challenging. This challenge arises because convolutional operations struggle to preserve the sequential order of message bits and effectively address the discrepancy between these two modalities. To address this, we propose StegaFormer, an innovative MLP-based framework designed to preserve bit order and enable global fusion between modalities. Specifically, StegaFormer incorporates three crucial components: Order-Preserving Message Encoder (OPME), Decoder (OPMD) and Global Message-Image Fusion (GMIF). OPME and OPMD aim to preserve the order of message bits by segmenting the entire sequence into equal-length segments and incorporating sequential information during encoding and decoding. Meanwhile, GMIF employs a cross-modality fusion mechanism to effectively fuse the features from the two uncorrelated modalities. Experimental results on the COCO and DIV2K datasets demonstrate that StegaFormer surpasses existing state-of-the-art methods in terms of recovery accuracy, message capacity, and imperceptibility. We will make our code publicly available.


[249] 2402.19161

MemoNav: Working Memory Model for Visual Navigation

Image-goal navigation is a challenging task that requires an agent to navigate to a goal indicated by an image in unfamiliar environments. Existing methods utilizing diverse scene memories suffer from inefficient exploration since they use all historical observations for decision-making without considering the goal-relevant fraction. To address this limitation, we present MemoNav, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance. Specifically, we employ three types of navigation memory. The node features on a map are stored in the short-term memory (STM), as these features are dynamically updated. A forgetting module then retains the informative STM fraction to increase efficiency. We also introduce long-term memory (LTM) to learn global scene representations by progressively aggregating STM features. Subsequently, a graph attention module encodes the retained STM and the LTM to generate working memory (WM) which contains the scene features essential for efficient navigation. The synergy among these three memory types boosts navigation performance by enabling the agent to learn and leverage goal-relevant scene features within a topological map. Our evaluation on multi-goal tasks demonstrates that MemoNav significantly outperforms previous methods across all difficulty levels in both Gibson and Matterport3D scenes. Qualitative results further illustrate that MemoNav plans more efficient routes.


[250] 2402.19163

FedStruct: Federated Decoupled Learning over Interconnected Graphs

We address the challenge of federated learning on graph-structured data distributed across multiple clients. Specifically, we focus on the prevalent scenario of interconnected subgraphs, where inter-connections between different clients play a critical role. We present a novel framework for this scenario, named FedStruct, that harnesses deep structural dependencies. To uphold privacy, unlike existing methods, FedStruct eliminates the necessity of sharing or generating sensitive node features or embeddings among clients. Instead, it leverages explicit global graph structure information to capture inter-node dependencies. We validate the effectiveness of FedStruct through experimental results conducted on six datasets for semi-supervised node classification, showcasing performance close to the centralized approach across various scenarios, including different data partitioning methods, varying levels of label availability, and number of clients.


[251] 2402.19166

Conversational Language Models for Human-in-the-Loop Multi-Robot Coordination

With the increasing prevalence and diversity of robots interacting in the real world, there is need for flexible, on-the-fly planning and cooperation. Large Language Models are starting to be explored in a multimodal setup for communication, coordination, and planning in robotics. Existing approaches generally use a single agent building a plan, or have multiple homogeneous agents coordinating for a simple task. We present a decentralised, dialogical approach in which a team of agents with different abilities plans solutions through peer-to-peer and human-robot discussion. We suggest that argument-style dialogues are an effective way to facilitate adaptive use of each agent's abilities within a cooperative team. Two robots discuss how to solve a cleaning problem set by a human, define roles, and agree on paths they each take. Each step can be interrupted by a human advisor and agents check their plans with the human. Agents then execute this plan in the real world, collecting rubbish from people in each room. Our implementation uses text at every step, maintaining transparency and effective human-multi-robot interaction.


[252] 2402.19167

Teaching Large Language Models an Unseen Language on the Fly

Existing large language models struggle to support numerous low-resource languages, particularly the extremely low-resource ones where there is minimal training data available for effective parameter updating. We thus investigate whether LLMs can learn a new language on the fly solely through prompting. To study this question, we collect a research suite for Zhuang, a language supported by no LLMs currently. We introduce \textsc{DiPMT++}, a framework for adapting LLMs to unseen languages by in-context learning. Using a dictionary and only 5K parallel sentences, \textsc{DiPMT++} significantly enhances the performance of GPT-4 from 0 to 16 BLEU for Chinese-to-Zhuang translation and achieves 32 BLEU for Zhuang-to-Chinese translation. Furthermore, we demonstrate the practical utility of this framework in aiding humans to translate completely unseen languages, which could contribute to the preservation of linguistic diversity.


[253] 2402.19168

Disturbance Decoupling Problem for $n$-link chain pendulum on a cart system

A disturbance decoupling problem for a $n$-link chain pendulum on a cart is considered. A model of the cart developed in a coordinate-free framework and the linearized equations of this system are considered from [1]. It is shown that it is possible to design a suitable state feedback such that the angular position or velocity of the $n^{th}$-link can always be decoupled from the disturbance coming at the cart.


[254] 2402.19170

Improving Legal Judgement Prediction in Romanian with Long Text Encoders

In recent years,the entire field of Natural Language Processing (NLP) has enjoyed amazing novel results achieving almost human-like performance on a variety of tasks. Legal NLP domain has also been part of this process, as it has seen an impressive growth. However, general-purpose models are not readily applicable for legal domain. Due to the nature of the domain (e.g. specialized vocabulary, long documents) specific models and methods are often needed for Legal NLP. In this work we investigate both specialized and general models for predicting the final ruling of a legal case, task known as Legal Judgment Prediction (LJP). We particularly focus on methods to extend to sequence length of Transformer-based models to better understand the long documents present in legal corpora. Extensive experiments on 4 LJP datasets in Romanian, originating from 2 sources with significantly different sizes and document lengths, show that specialized models and handling long texts are critical for a good performance.


[255] 2402.19171

Towards Assessing Spread in Sets of Software Architecture Designs

Several approaches have recently used automated techniques to generate architecture design alternatives by means of optimization techniques. These approaches aim at improving an initial architecture with respect to quality aspects, such as performance, reliability, or maintainability. In this context, each optimization experiment usually produces a different set of architecture alternatives that is characterized by specific settings. As a consequence, the designer is left with the task of comparing such sets to identify the settings that lead to better solution sets for the problem. To assess the quality of solution sets, multi-objective optimization commonly relies on quality indicators. Among these, the quality indicator for the maximum spread estimates the diversity of the generated alternatives, providing a measure of how much of the solution space has been explored. However, the maximum spread indicator is computed only on the objective space and does not consider architectural information (e.g., components structure, design decisions) from the architectural space. In this paper, we propose a quality indicator for the spread that assesses the diversity of alternatives by taking into account architectural features. To compute the spread, we rely on a notion of distance between alternatives according to the way they were generated during the optimization. We demonstrate how our architectural quality indicator can be applied to a dataset from the literature.


[256] 2402.19173

StarCoder 2 and The Stack v2: The Next Generation

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.


[257] 2402.19180

ModZoo: A Large-Scale Study of Modded Android Apps and their Markets

We present the results of the first large-scale study into Android markets that offer modified or modded apps: apps whose features and functionality have been altered by a third-party. We analyse over 146k (thousand) apps obtained from 13 of the most popular modded app markets. Around 90% of apps we collect are altered in some way when compared to the official counterparts on Google Play. Modifications include games cheats, such as infinite coins or lives; mainstream apps with premium features provided for free; and apps with modified advertising identifiers or excluded ads. We find the original app developers lose significant potential revenue due to: the provision of paid for apps for free (around 5% of the apps across all markets); the free availability of premium features that require payment in the official app; and modified advertising identifiers. While some modded apps have all trackers and ads removed (3%), in general, the installation of these apps is significantly more risky for the user than the official version: modded apps are ten times more likely to be marked as malicious and often request additional permissions.


[258] 2402.19184

Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR

As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages.


[259] 2402.19186

Disentangling representations of retinal images with generative models

Retinal fundus images play a crucial role in the early detection of eye diseases and, using deep learning approaches, recent studies have even demonstrated their potential for detecting cardiovascular risk factors and neurological disorders. However, the impact of technical factors on these images can pose challenges for reliable AI applications in ophthalmology. For example, large fundus cohorts are often confounded by factors like camera type, image quality or illumination level, bearing the risk of learning shortcuts rather than the causal relationships behind the image generation process. Here, we introduce a novel population model for retinal fundus images that effectively disentangles patient attributes from camera effects, thus enabling controllable and highly realistic image generation. To achieve this, we propose a novel disentanglement loss based on distance correlation. Through qualitative and quantitative analyses, we demonstrate the effectiveness of this novel loss function in disentangling the learned subspaces. Our results show that our model provides a new perspective on the complex relationship between patient attributes and technical confounders in retinal fundus image generation.


[260] 2402.19189

Link Recommendation to Augment Influence Diffusion with Provable Guarantees

Link recommendation systems in online social networks (OSNs), such as Facebook's ``People You May Know'', Twitter's ``Who to Follow'', and Instagram's ``Suggested Accounts'', facilitate the formation of new connections among users. This paper addresses the challenge of link recommendation for the purpose of social influence maximization. In particular, given a graph $G$ and the seed set $S$, our objective is to select $k$ edges that connect seed nodes and ordinary nodes to optimize the influence dissemination of the seed set. This problem, referred to as influence maximization with augmentation (IMA), has been proven to be NP-hard. In this paper, we propose an algorithm, namely \textsf{AIS}, consisting of an efficient estimator for augmented influence estimation and an accelerated sampling approach. \textsf{AIS} provides a $(1-1/\mathrm{e}-\varepsilon)$-approximate solution with a high probability of $1-\delta$, and runs in $O(k^2 (m+n) \log (n / \delta) / \varepsilon^2 + k \left|E_{\mathcal{C}}\right|)$ time assuming that the influence of any singleton node is smaller than that of the seed set. To the best of our knowledge, this is the first algorithm that can be implemented on large graphs containing millions of nodes while preserving strong theoretical guarantees. We conduct extensive experiments to demonstrate the effectiveness and efficiency of our proposed algorithm.


[261] 2402.19191

An asymptotic-preserving method for the three-temperature radiative transfer model

We present an asymptotic-preserving (AP) numerical method for solving the three-temperature radiative transfer model, which holds significant importance in inertial confinement fusion. A carefully designedsplitting method is developed that can provide a general framework of extending AP schemes for the gray radiative transport equation to the more complex three-temperature radiative transfer model. The proposed scheme captures two important limiting models: the three-temperature radiation diffusion equation (3TRDE) when opacity approaches infinity and the two-temperature limit when the ion-electron coupling coefficient goes to infinity. We have rigorously demonstrated the AP property and energy conservation characteristics of the proposed scheme and its efficiency has been validated through a series of benchmark tests in the numerical part.


[262] 2402.19194

High Expectations: An Observational Study of Programming and Cannabis Intoxication

Anecdotal evidence of cannabis use by professional programmers abounds. Recent studies have found that some professionals regularly use cannabis while programming even for work-related tasks. However, accounts of the impacts of cannabis on programming vary widely and are often contradictory. For example, some programmers claim that it impairs their ability to generate correct solutions while others claim it enhances creativity and focus. There remains a need for an empirical understanding of the true impacts of cannabis on programming. This paper presents the first controlled observational study of the effects of cannabis on programming ability. Based on a within-subjects design with over 70 participants, we find that at ecologically valid dosages, cannabis significantly impairs programming performance. Programs implemented while high contain more bugs and take longer to write (p < 0.05), a small to medium effect (0.22 <= d <= 0.44). We also did not find any evidence that high programmers generate more divergent solutions. However, programmers can accurately assess differences in their programming performance (r = 0.59), even when under the influence of cannabis. We hope that this research will facilitate evidence-based policies and help developers make informed decisions regarding cannabis use while programming.


[263] 2402.19195

Negative Sampling in Knowledge Graph Representation Learning: A Review

Knowledge graph representation learning (KGRL) or knowledge graph embedding (KGE) plays a crucial role in AI applications for knowledge construction and information exploration. These models aim to encode entities and relations present in a knowledge graph into a lower-dimensional vector space. During the training process of KGE models, using positive and negative samples becomes essential for discrimination purposes. However, obtaining negative samples directly from existing knowledge graphs poses a challenge, emphasizing the need for effective generation techniques. The quality of these negative samples greatly impacts the accuracy of the learned embeddings, making their generation a critical aspect of KGRL. This comprehensive survey paper systematically reviews various negative sampling (NS) methods and their contributions to the success of KGRL. Their respective advantages and disadvantages are outlined by categorizing existing NS methods into five distinct categories. Moreover, this survey identifies open research questions that serve as potential directions for future investigations. By offering a generalization and alignment of fundamental NS concepts, this survey provides valuable insights for designing effective NS methods in the context of KGRL and serves as a motivating force for further advancements in the field.


[264] 2402.19196

Generative models struggle with kirigami metamaterials

Generative machine learning models have shown notable success in identifying architectures for metamaterials - materials whose behavior is determined primarily by their internal organization - that match specific target properties. By examining kirigami metamaterials, in which dependencies between cuts yield complex design restrictions, we demonstrate that this perceived success in the employment of generative models for metamaterials might be akin to survivorship bias. We assess the performance of the four most popular generative models - the Variational Autoencoder (VAE), the Generative Adversarial Network (GAN), the Wasserstein GAN (WGAN), and the Denoising Diffusion Probabilistic Model (DDPM) - in generating kirigami structures. Prohibiting cut intersections can prevent the identification of an appropriate similarity measure for kirigami metamaterials, significantly impacting the effectiveness of VAE and WGAN, which rely on the Euclidean distance - a metric shown to be unsuitable for considered geometries. This imposes significant limitations on employing modern generative models for the creation of diverse metamaterials.


[265] 2402.19197

Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction

Pixel-aligned implicit models, such as PIFu, PIFuHD, and ICON, are used for single-view clothed human reconstruction. These models need to be trained using a sampling training scheme. Existing sampling training schemes either fail to capture thin surfaces (e.g. ears, fingers) or cause noisy artefacts in reconstructed meshes. To address these problems, we introduce Fine Structured-Aware Sampling (FSS), a new sampling training scheme to train pixel-aligned implicit models for single-view human reconstruction. FSS resolves the aforementioned problems by proactively adapting to the thickness and complexity of surfaces. In addition, unlike existing sampling training schemes, FSS shows how normals of sample points can be capitalized in the training process to improve results. Lastly, to further improve the training process, FSS proposes a mesh thickness loss signal for pixel-aligned implicit models. It becomes computationally feasible to introduce this loss once a slight reworking of the pixel-aligned implicit function framework is carried out. Our results show that our methods significantly outperform SOTA methods qualitatively and quantitatively. Our code is publicly available at https://github.com/kcyt/FSS.


[266] 2402.19199

Rewriting and Inductive Reasoning

Rewriting techniques based on reduction orderings generate "just enough" consequences to retain first-order completeness. This is ideal for superposition-based first-order theorem proving, but for at least one approach to inductive reasoning we show that we are missing crucial consequences. We therefore extend the superposition calculus with rewriting-based techniques to generate sufficient consequences for automating induction in saturation. When applying our work within the unit-equational fragment, our experiments with the theorem prover Vampire show significant improvements for inductive reasoning.


[267] 2402.19200

PRSA: Prompt Reverse Stealing Attacks against Large Language Models

Prompt, recognized as crucial intellectual property, enables large language models (LLMs) to perform specific tasks without the need of fine-tuning, underscoring their escalating importance. With the rise of prompt-based services, such as prompt marketplaces and LLM applications, providers often display prompts' capabilities through input-output examples to attract users. However, this paradigm raises a pivotal security concern: does the exposure of input-output pairs pose the risk of potential prompt leakage, infringing on the intellectual property rights of the developers? To our knowledge, this problem still has not been comprehensively explored yet. To remedy this gap, in this paper, we perform the first in depth exploration and propose a novel attack framework for reverse-stealing prompts against commercial LLMs, namely PRSA. The main idea of PRSA is that by analyzing the critical features of the input-output pairs, we mimic and gradually infer (steal) the target prompts. In detail, PRSA mainly consists of two key phases: prompt mutation and prompt pruning. In the mutation phase, we propose a prompt attention algorithm based on differential feedback to capture these critical features for effectively inferring the target prompts. In the prompt pruning phase, we identify and mask the words dependent on specific inputs, enabling the prompts to accommodate diverse inputs for generalization. Through extensive evaluation, we verify that PRSA poses a severe threat in real world scenarios. We have reported these findings to prompt service providers and actively collaborate with them to take protective measures for prompt copyright.


[268] 2402.19204

PeLLE: Encoder-based language models for Brazilian Portuguese based on open data

In this paper we present PeLLE, a family of large language models based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus. Aiming at reproducible results, we describe details of the pretraining of the models. We also evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, contrasting performance of large versus smaller-but-curated pretrained models in several downstream tasks. We conclude that several tasks perform better with larger models, but some tasks benefit from smaller-but-curated data in its pretraining.


[269] 2402.19218

Memory-Augmented Generative Adversarial Transformers

Conversational AI systems that rely on Large Language Models, like Transformers, have difficulty interweaving external data (like facts) with the language they generate. Vanilla Transformer architectures are not designed for answering factual questions with high accuracy. This paper investigates a possible route for addressing this problem. We propose to extend the standard Transformer architecture with an additional memory bank holding extra information (such as facts drawn from a knowledge base), and an extra attention layer for addressing this memory. We add this augmented memory to a Generative Adversarial Network-inspired Transformer architecture. This setup allows for implementing arbitrary felicity conditions on the generated language of the Transformer. We first demonstrate how this machinery can be deployed for handling factual questions in goal-oriented dialogues. Secondly, we demonstrate that our approach can be useful for applications like {\it style adaptation} as well: the adaptation of utterances according to certain stylistic (external) constraints, like social properties of human interlocutors in dialogues.


[270] 2402.19222

Airport take-off and landing optimization through genetic algorithms

This research addresses the crucial issue of pollution from aircraft operations, focusing on optimizing both gate allocation and runway scheduling simultaneously, a novel approach not previously explored. The study presents an innovative genetic algorithm-based method for minimizing pollution from fuel combustion during aircraft take-off and landing at airports. This algorithm uniquely integrates the optimization of both landing gates and take-off/landing runways, considering the correlation between engine operation time and pollutant levels. The approach employs advanced constraint handling techniques to manage the intricate time and resource limitations inherent in airport operations. Additionally, the study conducts a thorough sensitivity analysis of the model, with a particular emphasis on the mutation factor and the type of penalty function, to fine-tune the optimization process. This dual-focus optimization strategy represents a significant advancement in reducing environmental impact in the aviation sector, establishing a new standard for comprehensive and efficient airport operation management.


[271] 2402.19223

Edit and Alphabet-Ordering Sensitivity of Lex-parse

We investigate the compression sensitivity [Akagi et al., 2023] of lex-parse [Navarro et al., 2021] for two operations: (1) single character edit and (2) modification of the alphabet ordering, and give tight upper and lower bounds for both operations. For both lower bounds, we use the family of Fibonacci words. For the bounds on edit operations, our analysis makes heavy use of properties of the Lyndon factorization of Fibonacci words to characterize the structure of lex-parse.


[272] 2402.19226

Investigating Gender Fairness in Machine Learning-driven Personalized Care for Chronic Pain

This study investigates gender fairness in personalized pain care recommendations using machine learning algorithms. Leveraging a contextual bandits framework, personalized recommendations are formulated and evaluated using LinUCB algorithm on a dataset comprising interactions with $164$ patients across $10$ sessions each. Results indicate that while adjustments to algorithm parameters influence the quality of pain care recommendations, this impact remains consistent across genders. However, when certain patient information, such as self-reported pain measurements, is absent, the quality of pain care recommendations for women is notably inferior to that for men.


[273] 2402.19229

CAPTURE-24: A large dataset of wrist-worn activity tracker data collected in the wild for human activity recognition

Existing activity tracker datasets for human activity recognition are typically obtained by having participants perform predefined activities in an enclosed environment under supervision. This results in small datasets with a limited number of activities and heterogeneity, lacking the mixed and nuanced movements normally found in free-living scenarios. As such, models trained on laboratory-style datasets may not generalise out of sample. To address this problem, we introduce a new dataset involving wrist-worn accelerometers, wearable cameras, and sleep diaries, enabling data collection for over 24 hours in a free-living setting. The result is CAPTURE-24, a large activity tracker dataset collected in the wild from 151 participants, amounting to 3883 hours of accelerometer data, of which 2562 hours are annotated. CAPTURE-24 is two to three orders of magnitude larger than existing publicly available datasets, which is critical to developing accurate human activity recognition models.


[274] 2402.19231

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the self-attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. Our method achieves 94.5% R@1 on Pitts30k using 512-dim global features. The code is released at https://github.com/Lu-Feng/CricaVPR.


[275] 2402.19232

Trained Random Forests Completely Reveal your Dataset

We introduce an optimization-based reconstruction attack capable of completely or near-completely reconstructing a dataset utilized for training a random forest. Notably, our approach relies solely on information readily available in commonly used libraries such as scikit-learn. To achieve this, we formulate the reconstruction problem as a combinatorial problem under a maximum likelihood objective. We demonstrate that this problem is NP-hard, though solvable at scale using constraint programming -- an approach rooted in constraint propagation and solution-domain reduction. Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. This holds true even with a small number of trees. Even with bootstrap aggregation, the majority of the data can also be reconstructed. These findings underscore a critical vulnerability inherent in widely adopted ensemble methods, warranting attention and mitigation. Although the potential for such reconstruction attacks has been discussed in privacy research, our study provides clear empirical evidence of their practicability.


[276] 2402.19233

Shared lightweight autonomous vehicles for urban food deliveries: A simulation study

In recent years, the rapid growth of on-demand deliveries, especially in food deliveries, has spurred the exploration of innovative mobility solutions. In this context, lightweight autonomous vehicles have emerged as a potential alternative. However, their fleet-level behavior remains largely unexplored. To address this gap, we have developed an agent-based model and an environmental impact study assessing the fleet performance of lightweight autonomous food delivery vehicles. This model explores critical factors such as fleet sizing, service level, operational strategies, and environmental impacts. We have applied this model to a case study in Cambridge, MA, USA, where results indicate that there could be environmental benefits in replacing traditional car-based deliveries with shared lightweight autonomous vehicle fleets. Lastly, we introduce an interactive platform that offers a user-friendly means of comprehending the model's performance and potential trade-offs, which can help inform decision-makers in the evolving landscape of food delivery innovation.


[277] 2402.19237

Context-based Interpretable Spatio-Temporal Graph Convolutional Network for Human Motion Forecasting

Human motion prediction is still an open problem extremely important for autonomous driving and safety applications. Due to the complex spatiotemporal relation of motion sequences, this remains a challenging problem not only for movement prediction but also to perform a preliminary interpretation of the joint connections. In this work, we present a Context-based Interpretable Spatio-Temporal Graph Convolutional Network (CIST-GCN), as an efficient 3D human pose forecasting model based on GCNs that encompasses specific layers, aiding model interpretability and providing information that might be useful when analyzing motion distribution and body behavior. Our architecture extracts meaningful information from pose sequences, aggregates displacements and accelerations into the input model, and finally predicts the output displacements. Extensive experiments on Human 3.6M, AMASS, 3DPW, and ExPI datasets demonstrate that CIST-GCN outperforms previous methods in human motion prediction and robustness. Since the idea of enhancing interpretability for motion prediction has its merits, we showcase experiments towards it and provide preliminary evaluations of such insights here. available code: https://github.com/QualityMinds/cistgcn


[278] 2402.19242

Derivative-enhanced Deep Operator Network

Deep operator networks (DeepONets), a class of neural operators that learn mappings between function spaces, have recently been developed as surrogate models for parametric partial differential equations (PDEs). In this work we propose a derivative-enhanced deep operator network (DE-DeepONet), which leverages the derivative information to enhance the prediction accuracy, and provide a more accurate approximation of the derivatives, especially when the training data are limited. DE-DeepONet incorporates dimension reduction of input into DeepONet and includes two types of derivative labels in the loss function for training, that is, the directional derivatives of the output function with respect to the input function and the gradient of the output function with respect to the physical domain variables. We test DE-DeepONet on three different equations with increasing complexity to demonstrate its effectiveness compared to the vanilla DeepONet.


[279] 2402.19248

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

How to better evaluate the capabilities of Large Language Models (LLMs) is the focal point and hot topic in current LLMs research. Previous work has noted that due to the extremely high cost of iterative updates of LLMs, they are often unable to answer the latest dynamic questions well. To promote the improvement of Chinese LLMs' ability to answer dynamic questions, in this paper, we introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet. We obtain high-quality data through a pipeline that combines humans and models, and carefully classify the samples according to the frequency of answer changes to facilitate a more fine-grained observation of LLMs' capabilities. We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA. Extensive experiments and valuable insights suggest that our proposed CDQA is challenging and worthy of more further study. We believe that the benchmark we provide will become the key data resource for improving LLMs' Chinese question-answering ability in the future.


[280] 2402.19249

Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting

The ability to reuse collected data and transfer trained policies between robots could alleviate the burden of additional data collection and training. While existing approaches such as pretraining plus finetuning and co-training show promise, they do not generalize to robots unseen in training. Focusing on common robot arms with similar workspaces and 2-jaw grippers, we investigate the feasibility of zero-shot transfer. Through simulation studies on 8 manipulation tasks, we find that state-based Cartesian control policies can successfully zero-shot transfer to a target robot after accounting for forward dynamics. To address robot visual disparities for vision-based policies, we introduce Mirage, which uses "cross-painting"--masking out the unseen target robot and inpainting the seen source robot--during execution in real time so that it appears to the policy as if the trained source robot were performing the task. Despite its simplicity, our extensive simulation and physical experiments provide strong evidence that Mirage can successfully zero-shot transfer between different robot arms and grippers with only minimal performance degradation on a variety of manipulation tasks such as picking, stacking, and assembly, significantly outperforming a generalist policy. Project website: https://robot-mirage.github.io/


[281] 2402.19250

Feature boosting with efficient attention for scene parsing

The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.


[282] 2402.19251

A Cognitive-Based Trajectory Prediction Approach for Autonomous Driving

In autonomous vehicle (AV) technology, the ability to accurately predict the movements of surrounding vehicles is paramount for ensuring safety and operational efficiency. Incorporating human decision-making insights enables AVs to more effectively anticipate the potential actions of other vehicles, significantly improving prediction accuracy and responsiveness in dynamic environments. This paper introduces the Human-Like Trajectory Prediction (HLTP) model, which adopts a teacher-student knowledge distillation framework inspired by human cognitive processes. The HLTP model incorporates a sophisticated teacher-student knowledge distillation framework. The "teacher" model, equipped with an adaptive visual sector, mimics the visual processing of the human brain, particularly the functions of the occipital and temporal lobes. The "student" model focuses on real-time interaction and decision-making, drawing parallels to prefrontal and parietal cortex functions. This approach allows for dynamic adaptation to changing driving scenarios, capturing essential perceptual cues for accurate prediction. Evaluated using the Macao Connected and Autonomous Driving (MoCAD) dataset, along with the NGSIM and HighD benchmarks, HLTP demonstrates superior performance compared to existing models, particularly in challenging environments with incomplete data. The project page is available at Github.


[283] 2402.19254

Machine learning for modular multiplication

Motivated by cryptographic applications, we investigate two machine learning approaches to modular multiplication: namely circular regression and a sequence-to-sequence transformer model. The limited success of both methods demonstrated in our results gives evidence for the hardness of tasks involving modular multiplication upon which cryptosystems are based.


[284] 2402.19255

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. However, there are increasing debates regarding whether these models truly understand and apply mathematical knowledge or merely rely on shortcuts for mathematical reasoning. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations. We introduce the adversarial grade school math (\datasetname) dataset, an extension of GSM8K augmented with various mathematical perturbations. Our experiments on 25 LLMs and 4 prompting techniques show that while LLMs exhibit different levels of math reasoning abilities, their performances are far from robust. In particular, even for problems that have been solved in GSM8K, LLMs can make mistakes when new statements are added or the question targets are altered. We also explore whether more robust performance can be achieved by composing existing prompting methods, in which we try an iterative method that generates and verifies each intermediate thought based on its reasoning goal and calculation result. Code and data are available at \url{https://github.com/qtli/GSM-Plus}.


[285] 2402.19257

More algorithmic results for problems of spread of influence in edge-weighted graphs with and without incentives

Many phenomena in real world social networks are interpreted as spread of influence between activated and non-activated network elements. These phenomena are formulated by combinatorial graphs, where vertices represent the elements and edges represent social ties between elements. A main problem is to study important subsets of elements (target sets or dynamic monopolies) such that their activation spreads to the entire network. In edge-weighted networks the influence between two adjacent vertices depends on the weight of their edge. In models with incentives, the main problem is to minimize total amount of incentives (called optimal target vectors) which can be offered to vertices such that some vertices are activated and their activation spreads to the whole network. Algorithmic study of target sets and vectors is a hot research field. We prove an inapproximability result for optimal target sets in edge weighted networks even for complete graphs. Some other hardness and polynomial time results are presented for optimal target vectors and degenerate threshold assignments in edge-weighted networks.


[286] 2402.19258

MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition

Human activity recognition (HAR) has been playing an increasingly important role in various domains such as healthcare, security monitoring, and metaverse gaming. Though numerous HAR methods based on computer vision have been developed to show prominent performance, they still suffer from poor robustness in adverse visual conditions in particular low illumination, which motivates WiFi-based HAR to serve as a good complementary modality. Existing solutions using WiFi and vision modalities rely on massive labeled data that are very cumbersome to collect. In this paper, we propose a novel unsupervised multimodal HAR solution, MaskFi, that leverages only unlabeled video and WiFi activity data for model training. We propose a new algorithm, masked WiFi-vision modeling (MI2M), that enables the model to learn cross-modal and single-modal features by predicting the masked sections in representation learning. Benefiting from our unsupervised learning procedure, the network requires only a small amount of annotated data for finetuning and can adapt to the new environment with better performance. We conduct extensive experiments on two WiFi-vision datasets collected in-house, and our method achieves human activity recognition and human identification in terms of both robustness and accuracy.


[287] 2402.19259

Total Completion Time Scheduling Under Scenarios

Scheduling jobs with given processing times on identical parallel machines so as to minimize their total completion time is one of the most basic scheduling problems. We study interesting generalizations of this classical problem involving scenarios. In our model, a scenario is defined as a subset of a predefined and fully specified set of jobs. The aim is to find an assignment of the whole set of jobs to identical parallel machines such that the schedule, obtained for the given scenarios by simply skipping the jobs not in the scenario, optimizes a function of the total completion times over all scenarios. While the underlying scheduling problem without scenarios can be solved efficiently by a simple greedy procedure (SPT rule), scenarios, in general, make the problem NP-hard. We paint an almost complete picture of the evolving complexity landscape, drawing the line between easy and hard. One of our main algorithmic contributions relies on a deep structural result on the maximum imbalance of an optimal schedule, based on a subtle connection to Hilbert bases of a related convex cone.


[288] 2402.19262

Masks, Signs, And Learning Rate Rewinding

Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP) to find lottery tickets in deep overparameterized neural networks. While both iterative pruning schemes couple structure and parameter learning, understanding how LRR excels in both aspects can bring us closer to the design of more flexible deep learning algorithms that can optimize diverse sets of sparse architectures. To this end, we conduct experiments that disentangle the effect of mask learning and parameter optimization and how both benefit from overparameterization. The ability of LRR to flip parameter signs early and stay robust to sign perturbations seems to make it not only more effective in mask identification but also in optimizing diverse sets of masks, including random ones. In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP, as it can escape initially problematic sign configurations.


[289] 2402.19263

Spinal Osteophyte Detection via Robust Patch Extraction on minimally annotated X-rays

The development and progression of arthritis is strongly associated with osteophytes, which are small and elusive bone growths. This paper presents one of the first efforts towards automated spinal osteophyte detection in spinal X-rays. A novel automated patch extraction process, called SegPatch, has been proposed based on deep learning-driven vertebrae segmentation and the enlargement of mask contours. A final patch classification accuracy of 84.5\% is secured, surpassing a baseline tiling-based patch generation technique by 9.5%. This demonstrates that even with limited annotations, SegPatch can deliver superior performance for detection of tiny structures such as osteophytes. The proposed approach has potential to assist clinicians in expediting the process of manually identifying osteophytes in spinal X-ray.


[290] 2402.19264

T3DNet: Compressing Point Cloud Models for Lightweight 3D Recognition

3D point cloud has been widely used in many mobile application scenarios, including autonomous driving and 3D sensing on mobile devices. However, existing 3D point cloud models tend to be large and cumbersome, making them hard to deploy on edged devices due to their high memory requirements and non-real-time latency. There has been a lack of research on how to compress 3D point cloud models into lightweight models. In this paper, we propose a method called T3DNet (Tiny 3D Network with augmEntation and disTillation) to address this issue. We find that the tiny model after network augmentation is much easier for a teacher to distill. Instead of gradually reducing the parameters through techniques such as pruning or quantization, we pre-define a tiny model and improve its performance through auxiliary supervision from augmented networks and the original model. We evaluate our method on several public datasets, including ModelNet40, ShapeNet, and ScanObjectNN. Our method can achieve high compression rates without significant accuracy sacrifice, achieving state-of-the-art performances on three datasets against existing methods. Amazingly, our T3DNet is 58 times smaller and 54 times faster than the original model yet with only 1.4% accuracy descent on the ModelNet40 dataset.


[291] 2402.19265

Learning Logic Specifications for Policy Guidance in POMDPs: an Inductive Logic Programming Approach

Partially Observable Markov Decision Processes (POMDPs) are a powerful framework for planning under uncertainty. They allow to model state uncertainty as a belief probability distribution. Approximate solvers based on Monte Carlo sampling show great success to relax the computational demand and perform online planning. However, scaling to complex realistic domains with many actions and long planning horizons is still a major challenge, and a key point to achieve good performance is guiding the action-selection process with domain-dependent policy heuristics which are tailored for the specific application domain. We propose to learn high-quality heuristics from POMDP traces of executions generated by any solver. We convert the belief-action pairs to a logical semantics, and exploit data- and time-efficient Inductive Logic Programming (ILP) to generate interpretable belief-based policy specifications, which are then used as online heuristics. We evaluate thoroughly our methodology on two notoriously challenging POMDP problems, involving large action spaces and long planning horizons, namely, rocksample and pocman. Considering different state-of-the-art online POMDP solvers, including POMCP, DESPOT and AdaOPS, we show that learned heuristics expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specific heuristics within lower computational time. Moreover, they well generalize to more challenging scenarios not experienced in the training phase (e.g., increasing rocks and grid size in rocksample, incrementing the size of the map and the aggressivity of ghosts in pocman).


[292] 2402.19267

Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation

Employing extensive datasets enables the training of multilingual machine translation models; however, these models often fail to accurately translate sentences within specialized domains. Although obtaining and translating domain-specific data incurs high costs, it is inevitable for high-quality translations. Hence, finding the most 'effective' data with an unsupervised setting becomes a practical strategy for reducing labeling costs. Recent research indicates that this effective data could be found by selecting 'properly difficult data' based on its volume. This means the data should not be excessively challenging or overly simplistic, especially if the amount of data is limited. However, we found that establishing a criterion for unsupervised data selection remains challenging, as the 'proper difficulty' might vary based on the data domain being trained on. We introduce a novel unsupervised data selection method, 'Capturing Perplexing Named Entities', which adopts the maximum inference entropy in translated named entities as a selection measure. The motivation was that named entities in domain-specific data are considered the most complex portion of the data and should be predicted with high confidence. When verified with the 'Korean-English Parallel Corpus of Specialized Domains,' our method served as a robust guidance for unsupervised data selection, in contrast to existing methods.


[293] 2402.19270

Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching

Geometric knowledge has been shown to be beneficial for the stereo matching task. However, prior attempts to integrate geometric insights into stereo matching algorithms have largely focused on geometric knowledge from single images while crucial cross-view factors such as occlusion and matching uniqueness have been overlooked. To address this gap, we propose a novel Intra-view and Cross-view Geometric knowledge learning Network (ICGNet), specifically crafted to assimilate both intra-view and cross-view geometric knowledge. ICGNet harnesses the power of interest points to serve as a channel for intra-view geometric understanding. Simultaneously, it employs the correspondences among these points to capture cross-view geometric relationships. This dual incorporation empowers the proposed ICGNet to leverage both intra-view and cross-view geometric knowledge in its learning process, substantially improving its ability to estimate disparities. Our extensive experiments demonstrate the superiority of the ICGNet over contemporary leading models.


[294] 2402.19273

PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval

In the field of urban planning, general-purpose large language models often struggle to meet the specific needs of planners. Tasks like generating urban planning texts, retrieving related information, and evaluating planning documents pose unique challenges. To enhance the efficiency of urban professionals and overcome these obstacles, we introduce PlanGPT, the first specialized Large Language Model tailored for urban and spatial planning. Developed through collaborative efforts with institutions like the Chinese Academy of Urban Planning, PlanGPT leverages a customized local database retrieval framework, domain-specific fine-tuning of base models, and advanced tooling capabilities. Empirical tests demonstrate that PlanGPT has achieved advanced performance, delivering responses of superior quality precisely tailored to the intricacies of urban planning.


[295] 2402.19275

Adaptive Testing Environment Generation for Connected and Automated Vehicles with Dense Reinforcement Learning

The assessment of safety performance plays a pivotal role in the development and deployment of connected and automated vehicles (CAVs). A common approach involves designing testing scenarios based on prior knowledge of CAVs (e.g., surrogate models), conducting tests in these scenarios, and subsequently evaluating CAVs' safety performances. However, substantial differences between CAVs and the prior knowledge can significantly diminish the evaluation efficiency. In response to this issue, existing studies predominantly concentrate on the adaptive design of testing scenarios during the CAV testing process. Yet, these methods have limitations in their applicability to high-dimensional scenarios. To overcome this challenge, we develop an adaptive testing environment that bolsters evaluation robustness by incorporating multiple surrogate models and optimizing the combination coefficients of these surrogate models to enhance evaluation efficiency. We formulate the optimization problem as a regression task utilizing quadratic programming. To efficiently obtain the regression target via reinforcement learning, we propose the dense reinforcement learning method and devise a new adaptive policy with high sample efficiency. Essentially, our approach centers on learning the values of critical scenes displaying substantial surrogate-to-real gaps. The effectiveness of our method is validated in high-dimensional overtaking scenarios, demonstrating that our approach achieves notable evaluation efficiency.


[296] 2402.19279

SIFT-Aided Rectified 2D-DIC for Displacement and Strain Measurements in Asphalt Concrete Testing

Two-dimensional digital image correlation (2D-DIC) is a widely used optical technique to measure displacement and strain during asphalt concrete (AC) testing. An accurate 2-D DIC measurement can only be achieved when the camera's principal axis is perpendicular to the planar specimen surface. However, this requirement may not be met during testing due to device constraints. This paper proposes a simple and reliable method to correct errors induced by non-perpendicularity. The method is based on image feature matching and rectification. No additional equipment is needed. A theoretical error analysis was conducted to quantify the effect of a non-perpendicular camera alignment on measurement accuracy. The proposed method was validated numerically using synthetic images and experimentally in an AC fracture test. It achieved relatively high accuracy, even under considerable camera rotation angle and large deformation. As a pre-processing technique, the proposed method showed promising performance in assisting the recently developed CrackPropNet for automated crack propagation measurement under a non-perpendicular camera alignment.


[297] 2402.19280

Mobile Health Text Misinformation Identification Using Mobile Data Mining

More than six million people died of the COVID-19 by April 2022. The heavy casualties have put people on great and urgent alert and people try to find all kinds of information to keep them from being inflected by the coronavirus. This research tries to find out whether the mobile health text information sent to peoples devices is correct as smartphones becoming the major information source for people. The proposed method uses various mobile information retrieval and data mining technologies including lexical analysis, stopword elimination, stemming, and decision trees to classify the mobile health text information to one of the following classes: (i) true, (ii) fake, (iii) misinformative, (iv) disinformative, and (v) neutral. Experiment results show the accuracy of the proposed method is above the threshold value 50 percentage, but is not optimal. It is because the problem, mobile text misinformation identification, is intrinsically difficult.


[298] 2402.19282

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 300B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.


[299] 2402.19287

StiefelGen: A Simple, Model Agnostic Approach for Time Series Data Augmentation over Riemannian Manifolds

Data augmentation is an area of research which has seen active development in many machine learning fields, such as in image-based learning models, reinforcement learning for self driving vehicles, and general noise injection for point cloud data. However, convincing methods for general time series data augmentation still leaves much to be desired, especially since the methods developed for these models do not readily cross-over. Three common approaches for time series data augmentation include: (i) Constructing a physics-based model and then imbuing uncertainty over the coefficient space (for example), (ii) Adding noise to the observed data set(s), and, (iii) Having access to ample amounts of time series data sets from which a robust generative neural network model can be trained. However, for many practical problems that work with time series data in the industry: (i) One usually does not have access to a robust physical model, (ii) The addition of noise can in of itself require large or difficult assumptions (for example, what probability distribution should be used? Or, how large should the noise variance be?), and, (iii) In practice, it can be difficult to source a large representative time series data base with which to train the neural network model for the underlying problem. In this paper, we propose a methodology which attempts to simultaneously tackle all three of these previous limitations to a large extent. The method relies upon the well-studied matrix differential geometry of the Stiefel manifold, as it proposes a simple way in which time series signals can placed on, and then smoothly perturbed over the manifold. We attempt to clarify how this method works by showcasing several potential use cases which in particular work to take advantage of the unique properties of this underlying manifold.


[300] 2402.19290

Estimation and Deconvolution of Second Order Cyclostationary Signals

This method solves the dual problem of blind deconvolution and estimation of the time waveform of noisy second-order cyclo-stationary (CS2) signals that traverse a Transfer Function (TF) en route to a sensor. We have proven that the deconvolution filter exists and eliminates the TF effect from signals whose statistics vary over time. This method is blind, meaning it does not require prior knowledge about the signals or TF. Simulations demonstrate the algorithm high precision across various signal types, TFs, and Signal-to-Noise Ratios (SNRs). In this study, the CS2 signals family is restricted to the product of a deterministic periodic function and white noise. Furthermore, this method has the potential to improve the training of Machine Learning models where the aggregation of signals from identical systems but with different TFs is required.


[301] 2402.19292

Fundamental Limits of Throughput and Availability: Applications to prophet inequalities & transaction fee mechanism design

This paper studies the fundamental limits of availability and throughput for independent and heterogeneous demands of a limited resource. Availability is the probability that the demands are below the capacity of the resource. Throughput is the expected fraction of the resource that is utilized by the demands. We offer a concentration inequality generator that gives lower bounds on feasible availability and throughput pairs with a given capacity and independent but not necessarily identical distributions of up-to-unit demands. We show that availability and throughput cannot both be poor. These bounds are analogous to tail inequalities on sums of independent random variables, but hold throughout the support of the demand distribution. This analysis gives analytically tractable bounds supporting the unit-demand characterization of Chawla, Devanur, and Lykouris (2023) and generalizes to up-to-unit demands. Our bounds also provide an approach towards improved multi-unit prophet inequalities (Hajiaghayi, Kleinberg, and Sandholm, 2007). They have applications to transaction fee mechanism design (for blockchains) where high availability limits the probability of profitable user-miner coalitions (Chung and Shi, 2023).


[302] 2402.19294

Degradation Modeling and Prognostic Analysis Under Unknown Failure Modes

Operating units often experience various failure modes in complex systems, leading to distinct degradation paths. Relying on a prognostic model trained on a single failure mode may lead to poor generalization performance across multiple failure modes. Therefore, accurately identifying the failure mode is of critical importance. Current prognostic approaches either ignore failure modes during degradation or assume known failure mode labels, which can be challenging to acquire in practice. Moreover, the high dimensionality and complex relations of sensor signals make it challenging to identify the failure modes accurately. To address these issues, we propose a novel failure mode diagnosis method that leverages a dimension reduction technique called UMAP (Uniform Manifold Approximation and Projection) to project and visualize each unit's degradation trajectory into a lower dimension. Then, using these degradation trajectories, we develop a time series-based clustering method to identify the training units' failure modes. Finally, we introduce a monotonically constrained prognostic model to predict the failure mode labels and RUL of the test units simultaneously using the obtained failure modes of the training units. The proposed prognostic model provides failure mode-specific RUL predictions while preserving the monotonic property of the RUL predictions across consecutive time steps. We evaluate the proposed model using a case study with the aircraft gas turbine engine dataset.


[303] 2402.19295

Anomaly Detection in Offshore Wind Turbine Structures using Hierarchical Bayesian Modelling

Population-based structural health monitoring (PBSHM), aims to share information between members of a population. An offshore wind (OW) farm could be considered as a population of nominally-identical wind-turbine structures. However, benign variations exist among members, such as geometry, sea-bed conditions and temperature differences. These factors could influence structural properties and therefore the dynamic response, making it more difficult to detect structural problems via traditional SHM techniques. This paper explores the use of a hierarchical Bayesian model to infer expected soil stiffness distributions at both population and local levels, as a basis to perform anomaly detection, in the form of scour, for new and existing turbines. To do this, observations of natural frequency will be generated as though they are from a small population of wind turbines. Differences between individual observations will be introduced by postulating distributions over the soil stiffness and measurement noise, as well as reducing soil depth (to represent scour), in the case of anomaly detection.


[304] 2402.19296

An AI based Digital Score of Tumour-Immune Microenvironment Predicts Benefit to Maintenance Immunotherapy in Advanced Oesophagogastric Adenocarcinoma

Gastric and oesophageal (OG) cancers are the leading causes of cancer mortality worldwide. In OG cancers, recent studies have showed that PDL1 immune checkpoint inhibitors (ICI) in combination with chemotherapy improves patient survival. However, our understanding of the tumour immune microenvironment in OG cancers remains limited. In this study, we interrogate multiplex immunofluorescence (mIF) images taken from patients with advanced Oesophagogastric Adenocarcinoma (OGA) who received first-line fluoropyrimidine and platinum-based chemotherapy in the PLATFORM trial (NCT02678182) to predict the efficacy of the treatment and to explore the biological basis of patients responding to maintenance durvalumab (PDL1 inhibitor). Our proposed Artificial Intelligence (AI) based marker successfully identified responder from non-responder (p < 0.05) as well as those who could potentially benefit from ICI with statistical significance (p < 0.05) for both progression free and overall survival. Our findings suggest that T cells that express FOXP3 seem to heavily influence the patient treatment response and survival outcome. We also observed that higher levels of CD8+PD1+ cells are consistently linked to poor prognosis for both OS and PFS, regardless of ICI.


[305] 2402.19298

Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing

Face Anti-Spoofing (FAS) is crucial for securing face recognition systems against presentation attacks. With advancements in sensor manufacture and multi-modal learning techniques, many multi-modal FAS approaches have emerged. However, they face challenges in generalizing to unseen attacks and deployment conditions. These challenges arise from (1) modality unreliability, where some modality sensors like depth and infrared undergo significant domain shifts in varying environments, leading to the spread of unreliable information during cross-modal feature fusion, and (2) modality imbalance, where training overly relies on a dominant modality hinders the convergence of others, reducing effectiveness against attack types that are indistinguishable sorely using the dominant modality. To address modality unreliability, we propose the Uncertainty-Guided Cross-Adapter (U-Adapter) to recognize unreliably detected regions within each modality and suppress the impact of unreliable regions on other modalities. For modality imbalance, we propose a Rebalanced Modality Gradient Modulation (ReGrad) strategy to rebalance the convergence speed of all modalities by adaptively adjusting their gradients. Besides, we provide the first large-scale benchmark for evaluating multi-modal FAS performance under domain generalization scenarios. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. Source code and protocols will be released on https://github.com/OMGGGGG/mmdg.


[306] 2402.19299

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Large Language Models (LLMs) have demonstrated proficiency in utilizing various tools by coding, yet they face limitations in handling intricate logic and precise control. In embodied tasks, high-level planning is amenable to direct coding, while low-level actions often necessitate task-specific refinement, such as Reinforcement Learning (RL). To seamlessly integrate both modalities, we introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent. The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks. This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline. Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it rapidly obtains diamonds within a single day on an RTX3090. Additionally, it achieves SOTA performance across all designated MineDojo tasks.


[307] 2402.19302

DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly

Reassembly tasks play a fundamental role in many fields and multiple approaches exist to solve specific reassembly problems. In this context, we posit that a general unified model can effectively address them all, irrespective of the input data type (images, 3D, etc.). We introduce DiffAssemble, a Graph Neural Network (GNN)-based architecture that learns to solve reassembly tasks using a diffusion model formulation. Our method treats the elements of a set, whether pieces of 2D patch or 3D object fragments, as nodes of a spatial graph. Training is performed by introducing noise into the position and rotation of the elements and iteratively denoising them to reconstruct the coherent initial pose. DiffAssemble achieves state-of-the-art (SOTA) results in most 2D and 3D reassembly tasks and is the first learning-based approach that solves 2D puzzles for both rotation and translation. Furthermore, we highlight its remarkable reduction in run-time, performing 11 times faster than the quickest optimization-based method for puzzle solving. Code available at https://github.com/IIT-PAVIS/DiffAssemble


[308] 2402.19303

Learnability Gaps of Strategic Classification

In contrast with standard classification tasks, strategic classification involves agents strategically modifying their features in an effort to receive favorable predictions. For instance, given a classifier determining loan approval based on credit scores, applicants may open or close their credit cards to fool the classifier. The learning goal is to find a classifier robust against strategic manipulations. Various settings, based on what and when information is known, have been explored in strategic classification. In this work, we focus on addressing a fundamental question: the learnability gaps between strategic classification and standard learning. We essentially show that any learnable class is also strategically learnable: we first consider a fully informative setting, where the manipulation structure (which is modeled by a manipulation graph $G^\star$) is known and during training time the learner has access to both the pre-manipulation data and post-manipulation data. We provide nearly tight sample complexity and regret bounds, offering significant improvements over prior results. Then, we relax the fully informative setting by introducing two natural types of uncertainty. First, following Ahmadi et al. (2023), we consider the setting in which the learner only has access to the post-manipulation data. We improve the results of Ahmadi et al. (2023) and close the gap between mistake upper bound and lower bound raised by them. Our second relaxation of the fully informative setting introduces uncertainty to the manipulation structure. That is, we assume that the manipulation graph is unknown but belongs to a known class of graphs. We provide nearly tight bounds on the learning complexity in various unknown manipulation graph settings. Notably, our algorithm in this setting is of independent interest and can be applied to other problems such as multi-label learning.


[309] 2402.19305

HyenaPixel: Global Image Context with Convolutions

In vision tasks, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, convolution requires multiple stacked layers and a hierarchical structure for large context. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to the non-causal two-dimensional image space. We scale the Hyena convolution kernels beyond the feature map size up to 191$\times$191 to maximize the ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 83.0% and 83.5%, respectively, while outperforming other large-kernel networks. Combining HyenaPixel with attention further increases accuracy to 83.6%. We attribute the success of attention to the lack of spatial bias in later stages and support this finding with bidirectional Hyena.


[310] 2402.19308

Loss-Free Machine Unlearning

We present a machine unlearning approach that is both retraining- and label-free. Most existing machine unlearning approaches require a model to be fine-tuned to remove information while preserving performance. This is computationally expensive and necessitates the storage of the whole dataset for the lifetime of the model. Retraining-free approaches often utilise Fisher information, which is derived from the loss and requires labelled data which may not be available. Thus, we present an extension to the Selective Synaptic Dampening algorithm, substituting the diagonal of the Fisher information matrix for the gradient of the l2 norm of the model output to approximate sensitivity. We evaluate our method in a range of experiments using ResNet18 and Vision Transformer. Results show our label-free method is competitive with existing state-of-the-art approaches.


[311] 2402.19309

Closed-loop training of static output feedback neural network controllers for large systems: A distillation case study

The online implementation of model predictive control for constrained multivariate systems has two main disadvantages: it requires an estimate of the entire model state and an optimisation problem must be solved online. These issues have typically been treated separately. This work proposes an integrated approach for the offline training of an output feedback neural network controller in closed loop. Online this neural network controller computers the plant inputs cheaply using noisy measurements. In addition, the controller can be trained to only make use of certain predefined measurements. Further, a heuristic approach is proposed to perform the automatic selection of important measurements. The proposed method is demonstrated by extensive simulations using a non-linear distillation column model of 50 states.


[312] 2402.19315

On the Existence of Static Equilibria of a Cable-Suspended Load with Non-stopping Flying Carriers

Aerial cooperative robotic manipulation of cable-suspended objects has been largely studied as it allows handling large and heavy objects, and cables offer multiple advantages, such as their low weight and cost efficiency. Multirotors have been typically considered, which, however, can be unsuitable for long-lasting manipulation tasks due to their scarce endurance. Hence, this work investigates whether non-stop flights are possible for maintaining constant the pose of cable-suspended objects. First, we show that one or two flying carriers alone cannot perform non-stop flights while maintaining a constant pose of the suspended object. Instead, we demonstrate that \emph{three} flying carriers can achieve this task provided that the orientation of the load at the equilibrium is such that the components of the cable forces that balance the external force (typically gravity) do not belong to the plane of the cable anchoring points on the load. Numerical tests are presented in support of the analytical results.


[313] 2402.19318

DISCERN: Designing Decision Support Interfaces to Investigate the Complexities of Workplace Social Decision-Making With Line Managers

Line managers form the first level of management in organizations, and must make complex decisions, while maintaining relationships with those impacted by their decisions. Amidst growing interest in technology-supported decision-making at work, their needs remain understudied. Further, most existing design knowledge for supporting social decision-making comes from domains where decision-makers are more socially detached from those they decide for. We conducted iterative design research with line managers within a technology organization, investigating decision-making practices, and opportunities for technological support. Through formative research, development of a decision-representation tool -- DISCERN -- and user enactments, we identify their communication and analysis needs that lack adequate support. We found they preferred tools for externalizing reasoning rather than tools that replace interpersonal interactions, and they wanted tools to support a range of intuitive and calculative decision-making. We discuss how design of social decision-making supports, especially in the workplace, can more explicitly support highly interactional social decision-making.


[314] 2402.19319

Attacks Against Mobility Prediction in 5G Networks

The $5^{th}$ generation of mobile networks introduces a new Network Function (NF) that was not present in previous generations, namely the Network Data Analytics Function (NWDAF). Its primary objective is to provide advanced analytics services to various entities within the network and also towards external application services in the 5G ecosystem. One of the key use cases of NWDAF is mobility trajectory prediction, which aims to accurately support efficient mobility management of User Equipment (UE) in the network by allocating ``just in time'' necessary network resources. In this paper, we show that there are potential mobility attacks that can compromise the accuracy of these predictions. In a semi-realistic scenario with 10,000 subscribers, we demonstrate that an adversary equipped with the ability to hijack cellular mobile devices and clone them can significantly reduce the prediction accuracy from 75\% to 40\% using just 100 adversarial UEs. While a defense mechanism largely depends on the attack and the mobility types in a particular area, we prove that a basic KMeans clustering is effective in distinguishing legitimate and adversarial UEs.


[315] 2402.19322

Verification of Neural Networks' Global Robustness

Neural networks are successful in various applications but are also susceptible to adversarial attacks. To show the safety of network classifiers, many verifiers have been introduced to reason about the local robustness of a given input to a given perturbation. While successful, local robustness cannot generalize to unseen inputs. Several works analyze global robustness properties, however, neither can provide a precise guarantee about the cases where a network classifier does not change its classification. In this work, we propose a new global robustness property for classifiers aiming at finding the minimal globally robust bound, which naturally extends the popular local robustness property for classifiers. We introduce VHAGaR, an anytime verifier for computing this bound. VHAGaR relies on three main ideas: encoding the problem as a mixed-integer programming and pruning the search space by identifying dependencies stemming from the perturbation or network computation and generalizing adversarial attacks to unknown inputs. We evaluate VHAGaR on several datasets and classifiers and show that, given a three hour timeout, the average gap between the lower and upper bound on the minimal globally robust bound computed by VHAGaR is 1.9, while the gap of an existing global robustness verifier is 154.7. Moreover, VHAGaR is 130.6x faster than this verifier. Our results further indicate that leveraging dependencies and adversarial attacks makes VHAGaR 78.6x faster.


[316] 2402.19325

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes vector representations of the speakers in a conversation - attractors. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom allowing them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.


[317] 2402.19326

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However, existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision, which are insufficient to capture the complex visual appearance of pathogenetic images, hindering the generalizability of models on diverse downstream tasks. Additionally, processing high-resolution WSIs can be computationally expensive. In this paper, we propose a novel "Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interplay between localized visual patterns and fine-grained pathological semantics. Specifically, with meticulously designed queries, we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module, we enable prompts to capture crucial visual information in WSIs, which enhances representation learning and augments generalization capabilities significantly. Furthermore, given that pathological visual patterns are redundantly distributed across tissue slices, we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments.


[318] 2402.19328

Seeking Soulmate via Voice: Understanding Promises and Challenges of Online Synchronized Voice-Based Mobile Dating

Online dating has become a popular way for individuals to connect with potential romantic partners. Many dating apps use personal profiles that include a headshot and self-description, allowing users to present themselves and search for compatible matches. However, this traditional model often has limitations. In this study, we explore a non-traditional voice-based dating app called "Soul". Unlike traditional platforms that rely heavily on profile information, Soul facilitates user interactions through voice-based communication. We conducted semi-structured interviews with 18 dedicated Soul users to investigate how they engage with the platform and perceive themselves and others in this unique dating environment. Our findings indicate that the role of voice as a moderator influences impression management and shapes perceptions between the sender and the receiver of the voice. Additionally, the synchronous voice-based and community-based dating model offers benefits to users in the Chinese cultural context. Our study contributes to understanding the affordances introduced by voice-based interactions in online dating in China.


[319] 2402.19330

A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation

Effectively addressing the challenge of industrial Anomaly Detection (AD) necessitates an ample supply of defective samples, a constraint often hindered by their scarcity in industrial contexts. This paper introduces a novel algorithm designed to augment defective samples, thereby enhancing AD performance. The proposed method tailors the blended latent diffusion model for defect sample generation, employing a diffusion model to generate defective samples in the latent space. A feature editing process, controlled by a "trimap" mask and text prompts, refines the generated samples. The image generation inference process is structured into three stages: a free diffusion stage, an editing diffusion stage, and an online decoder adaptation stage. This sophisticated inference strategy yields high-quality synthetic defective samples with diverse pattern variations, leading to significantly improved AD accuracies based on the augmented training set. Specifically, on the widely recognized MVTec AD dataset, the proposed method elevates the state-of-the-art (SOTA) performance of AD with augmented data by 1.5%, 1.9%, and 3.1% for AD metrics AP, IAP, and IAP90, respectively. The implementation code of this work can be found at the GitHub repository https://github.com/GrandpaXun242/AdaBLDM.git


[320] 2402.19333

Compact Speech Translation Models via Discrete Speech Units Pretraining

Using Self-Supervised Learning (SSL) as model initialization is now common to obtain strong results in Speech Translation (ST). However, they also impose a large memory footprint, hindering on-device deployment. In this paper, we leverage the SSL models by pretraining smaller models on their Discrete Speech Units (DSU). We pretrain encoder-decoder models on 1) Filterbank-to-DSU and 2) DSU-to-Translation data, and take the encoder from 1) and the decoder from 2) to initialise a new model, finetuning this on limited speech-translation data. The final model becomes compact by using the DSU pretraining to distil the knowledge of the SSL model. Our method has several benefits over using DSU as model inputs, such as shorter inference pipeline and robustness over (DSU) tokenization. In contrast to ASR pretraining, it does not require transcripts, making it applicable to low-resource settings. Evaluation on CoVoST-2 X-En shows that our method is >$0.5$ BLEU better than a ST model that directly finetune the SSL model, given only half the model size, and on a par with ASR pretraining.


[321] 2402.19334

Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we explore various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks without additional resources or specific knowledge. Our approach consistently outperforms the other advanced baselines, leading to an average of 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus.


[322] 2402.19339

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

The increasing demand for automatic high-level image understanding, particularly in detecting abstract concepts (AC) within images, underscores the necessity for innovative and more interpretable approaches. These approaches need to harmonize traditional deep vision methods with the nuanced, context-dependent knowledge humans employ to interpret images at intricate semantic levels. In this work, we leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. We compute KG embeddings and experiment with relative representations and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. Our results show that our hybrid KGE-ViT methods outperform existing techniques in AC image classification. The posthoc interpretability analyses reveal the visual transformer's proficiency in capturing pixel-level visual attributes, contrasting with our method's efficacy in representing more abstract and semantic scene elements. We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neuro-symbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available online.


[323] 2402.19340

One model to use them all: Training a segmentation model with complementary datasets

Understanding a surgical scene is crucial for computer-assisted surgery systems to provide any intelligent assistance functionality. One way of achieving this scene understanding is via scene segmentation, where every pixel of a frame is classified and therefore identifies the visible structures and tissues. Progress on fully segmenting surgical scenes has been made using machine learning. However, such models require large amounts of annotated training data, containing examples of all relevant object classes. Such fully annotated datasets are hard to create, as every pixel in a frame needs to be annotated by medical experts and, therefore, are rarely available. In this work, we propose a method to combine multiple partially annotated datasets, which provide complementary annotations, into one model, enabling better scene segmentation and the use of multiple readily available datasets. Our method aims to combine available data with complementary labels by leveraging mutual exclusive properties to maximize information. Specifically, we propose to use positive annotations of other classes as negative samples and to exclude background pixels of binary annotations, as we cannot tell if they contain a class not annotated but predicted by the model. We evaluate our method by training a DeepLabV3 on the publicly available Dresden Surgical Anatomy Dataset, which provides multiple subsets of binary segmented anatomical structures. Our approach successfully combines 6 classes into one model, increasing the overall Dice Score by 4.4% compared to an ensemble of models trained on the classes individually. By including information on multiple classes, we were able to reduce confusion between stomach and colon by 24%. Our results demonstrate the feasibility of training a model on multiple datasets. This paves the way for future work further alleviating the need for one large, fully segmented datasets.


[324] 2402.19341

RoadRunner -- Learning Traversability Estimation for Autonomous Off-road Driving

Autonomous navigation at high speeds in off-road environments necessitates robots to comprehensively understand their surroundings using onboard sensing only. The extreme conditions posed by the off-road setting can cause degraded camera image quality due to poor lighting and motion blur, as well as limited sparse geometric information available from LiDAR sensing when driving at high speeds. In this work, we present RoadRunner, a novel framework capable of predicting terrain traversability and an elevation map directly from camera and LiDAR sensor inputs. RoadRunner enables reliable autonomous navigation, by fusing sensory information, handling of uncertainty, and generation of contextually informed predictions about the geometry and traversability of the terrain while operating at low latency. In contrast to existing methods relying on classifying handcrafted semantic classes and using heuristics to predict traversability costs, our method is trained end-to-end in a self-supervised fashion. The RoadRunner network architecture builds upon popular sensor fusion network architectures from the autonomous driving domain, which embed LiDAR and camera information into a common Bird's Eye View perspective. Training is enabled by utilizing an existing traversability estimation stack to generate training data in hindsight in a scalable manner from real-world off-road driving datasets. Furthermore, RoadRunner improves the system latency by a factor of roughly 4, from 500 ms to 140 ms, while improving the accuracy for traversability costs and elevation map predictions. We demonstrate the effectiveness of RoadRunner in enabling safe and reliable off-road navigation at high speeds in multiple real-world driving scenarios through unstructured desert environments.


[325] 2402.19344

The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition

This paper describes the 6th Affective Behavior Analysis in-the-wild (ABAW) Competition, which is part of the respective Workshop held in conjunction with IEEE CVPR 2024. The 6th ABAW Competition addresses contemporary challenges in understanding human emotions and behaviors, crucial for the development of human-centered technologies. In more detail, the Competition focuses on affect related benchmarking tasks and comprises of five sub-challenges: i) Valence-Arousal Estimation (the target is to estimate two continuous affect dimensions, valence and arousal), ii) Expression Recognition (the target is to recognise between the mutually exclusive classes of the 7 basic expressions and 'other'), iii) Action Unit Detection (the target is to detect 12 action units), iv) Compound Expression Recognition (the target is to recognise between the 7 mutually exclusive compound expression classes), and v) Emotional Mimicry Intensity Estimation (the target is to estimate six continuous emotion dimensions). In the paper, we present these Challenges, describe their respective datasets and challenge protocols (we outline the evaluation metrics) and present the baseline systems as well as their obtained performance. More information for the Competition can be found in: \url{https://affective-behavior-analysis-in-the-wild.github.io/6th}.


[326] 2402.19347

#PoetsOfInstagram: Navigating The Practices And Challenges Of Novice Poets On Instagram

Commencing as a photo-sharing platform, Instagram has since become multifaceted, accommodating diverse art forms, with poetry emerging as a prominent one. However, the academic understanding of Instagram's poetry community is limited, yet its significance emerges from its distinctive utilization of a primarily visual social media platform guided by recommendation algorithms for disseminating poetry, further characterized by a predominantly novice creative population. We employ qualitative analysis to explore motivations, experiences, and algorithmic influence within Instagram's poetry community. We demonstrate that participants prioritize conforming to algorithmic constraints for visibility, yet maintain their community's values of integrity and originality, illustrating the tension between algorithmic growth and participant authenticity. We introduce the concept of Algorithmically Mediated Creative Labor, a phenomenon specific to non-monetizing creative users who are impacted by the prioritization of professional creators and continually adapt their creative endeavors to align with platform logic, thereby affecting their motivation and creative outputs.


[327] 2402.19348

Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for sustainable development by harnessing the power of cross-domain data fusion from diverse sources (e.g., geographical, traffic, social media, and environmental data) and modalities (e.g., spatio-temporal, visual, and textual modalities). Recently, we are witnessing a rising trend that utilizes various deep-learning methods to facilitate cross-domain data fusion in smart cities. To this end, we propose the first survey that systematically reviews the latest advancements in deep learning-based data fusion methods tailored for urban computing. Specifically, we first delve into data perspective to comprehend the role of each modality and data source. Secondly, we classify the methodology into four primary categories: feature-based, alignment-based, contrast-based, and generation-based fusion methods. Thirdly, we further categorize multi-modal urban applications into seven types: urban planning, transportation, economy, public safety, society, environment, and energy. Compared with previous surveys, we focus more on the synergy of deep learning methods with urban computing applications. Furthermore, we shed light on the interplay between Large Language Models (LLMs) and urban computing, postulating future research directions that could revolutionize the field. We firmly believe that the taxonomy, progress, and prospects delineated in our survey stand poised to significantly enrich the research community. The summary of the comprehensive and up-to-date paper list can be found at https://github.com/yoshall/Awesome-Multimodal-Urban-Computing.


[328] 2402.19350

Prompting Explicit and Implicit Knowledge for Multi-hop Question Answering Based on Human Reading Process

Pre-trained language models (PLMs) leverage chains-of-thought (CoT) to simulate human reasoning and inference processes, achieving proficient performance in multi-hop QA. However, a gap persists between PLMs' reasoning abilities and those of humans when tackling complex problems. Psychological studies suggest a vital connection between explicit information in passages and human prior knowledge during reading. Nevertheless, current research has given insufficient attention to linking input passages and PLMs' pre-training-based knowledge from the perspective of human cognition studies. In this study, we introduce a \textbf{P}rompting \textbf{E}xplicit and \textbf{I}mplicit knowledge (PEI) framework, which uses prompts to connect explicit and implicit knowledge, aligning with human reading process for multi-hop QA. We consider the input passages as explicit knowledge, employing them to elicit implicit knowledge through unified prompt reasoning. Furthermore, our model incorporates type-specific reasoning via prompts, a form of implicit knowledge. Experimental results show that PEI performs comparably to the state-of-the-art on HotpotQA. Ablation studies confirm the efficacy of our model in bridging and integrating explicit and implicit knowledge.


[329] 2402.19355

Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification

Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectures. Additionally, we introduce a method for identifying the victim model on which the adversarial attack is carried out. To achieve this, we generate a new dataset containing multiple attacks performed against various victim models. We achieve an AUC of 0.982 for attack detection, with no more than a 0.03 drop in performance for unknown attacks. Our attack classification accuracy (excluding benign) reaches 86.48% across eight attack types using our LightResNet34 architecture, while our victim model classification accuracy reaches 72.28% across four victim models.


[330] 2402.19361

Watermark Stealing in Large Language Models

LLM watermarking has attracted attention as a promising way to detect AI-generated content, with some works suggesting that current schemes may already be fit for deployment. In this work we dispute this claim, identifying watermark stealing (WS) as a fundamental vulnerability of these schemes. We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks, as suggested in prior work, but also greatly boosts scrubbing attacks, which was previously unnoticed. We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings. We show that for under $50 an attacker can both spoof and scrub state-of-the-art schemes previously considered safe, with average success rate of over 80%. Our findings challenge common beliefs about LLM watermarking, stressing the need for more robust schemes. We make all our code and additional examples available at https://watermark-stealing.org.


[331] 2402.19364

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.


[332] 2402.19365

On Efficient Computation of DiRe Committees

Consider a committee election consisting of (i) a set of candidates who are divided into arbitrary groups each of size \emph{at most} two and a diversity constraint that stipulates the selection of \emph{at least} one candidate from each group and (ii) a set of voters who are divided into arbitrary populations each approving \emph{at most} two candidates and a representation constraint that stipulates the selection of \emph{at least} one candidate from each population who has a non-null set of approved candidates. The DiRe (Diverse + Representative) committee feasibility problem (a.k.a. the minimum vertex cover problem on unweighted undirected graphs) concerns the determination of the smallest size committee that satisfies the given constraints. Here, for this problem, we discover an unconditional deterministic polynomial-time algorithm that is an amalgamation of maximum matching, breadth-first search, maximal matching, and local minimization.


[333] 2402.19366

SoK: Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency

The growing number of cases requiring digital forensic analysis raises concerns about law enforcement's ability to conduct investigations promptly. Consequently, this systemisation of knowledge paper delves into the potential and effectiveness of integrating Large Language Models (LLMs) into digital forensic investigation to address these challenges. A thorough literature review is undertaken, encompassing existing digital forensic models, tools, LLMs, deep learning techniques, and the utilisation of LLMs in investigations. The review identifies current challenges within existing digital forensic processes and explores both the obstacles and possibilities of incorporating LLMs. In conclusion, the study asserts that the adoption of LLMs in digital forensics, with appropriate constraints, holds the potential to enhance investigation efficiency, improve traceability, and alleviate technical and judicial barriers faced by law enforcement entities.


[334] 2402.19369

Structure Preserving Diffusion Models

Diffusion models have become the leading distribution-learning method in recent years. Herein, we introduce structure-preserving diffusion processes, a family of diffusion processes for learning distributions that possess additional structure, such as group symmetries, by developing theoretical conditions under which the diffusion transition steps preserve said symmetry. While also enabling equivariant data sampling trajectories, we exemplify these results by developing a collection of different symmetry equivariant diffusion models capable of learning distributions that are inherently symmetric. Empirical studies, over both synthetic and real-world datasets, are used to validate the developed models adhere to the proposed theory and are capable of achieving improved performance over existing methods in terms of sample equality. We also show how the proposed models can be used to achieve theoretically guaranteed equivariant image noise reduction without prior knowledge of the image orientation.


[335] 2402.19371

OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models

LLMs have become increasingly capable at accomplishing a range of specialized-tasks and can be utilized to expand equitable access to medical knowledge. Most medical LLMs have involved extensive fine-tuning, leveraging specialized medical data and significant, thus costly, amounts of computational power. Many of the top performing LLMs are proprietary and their access is limited to very few research groups. However, open-source (OS) models represent a key area of growth for medical LLMs due to significant improvements in performance and an inherent ability to provide the transparency and compliance required in healthcare. We present OpenMedLM, a prompting platform which delivers state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks. We evaluated a range of OS foundation LLMs (7B-70B) on four medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU medical-subset). We employed a series of prompting strategies, including zero-shot, few-shot, chain-of-thought (random selection and kNN selection), and ensemble/self-consistency voting. We found that OpenMedLM delivers OS SOTA results on three common medical LLM benchmarks, surpassing the previous best performing OS models that leveraged computationally costly extensive fine-tuning. The model delivers a 72.6% accuracy on the MedQA benchmark, outperforming the previous SOTA by 2.4%, and achieves 81.7% accuracy on the MMLU medical-subset, establishing itself as the first OS LLM to surpass 80% accuracy on this benchmark. Our results highlight medical-specific emergent properties in OS LLMs which have not yet been documented to date elsewhere, and showcase the benefits of further leveraging prompt engineering to improve the performance of accessible LLMs for medical applications.


[336] 2402.19375

Unveiling Internet Censorship: Analysing the Impact of Nation States' Content Control Efforts on Internet Architecture and Routing Patterns

Heightened interest from nation states to perform content censorship make it evermore critical to identify the impact of censorship efforts on the Internet. We undertake a study of Internet architecture, capturing the state of Internet topology with greater completeness than existing state-of-the-art. We describe our methodology for this, including the tooling we create to collect and process data from a wide range of sources. We analyse this data to find key patterns in nation states with higher censorship, discovering a funnelling effect wherein higher Internet censorship effort is reflected in a constraining effect on a state's Internet routing architecture. However, there are a small number of nation states that do not follow this trend, for which we provide an analysis and explanation, demonstrating a relationship between geographical factors in addition to geopolitics. In summary, our work provides a deeper understanding of how these censorship measures impact the overall functioning and dynamics of the Internet.


[337] 2402.19376

OzMAC: An Energy-Efficient Sparsity-Exploiting Multiply-Accumulate-Unit Design for DL Inference

General Matrix Multiply (GEMM) hardware, employing large arrays of multiply-accumulate (MAC) units, perform bulk of the computation in deep learning (DL). Recent trends have established 8-bit integer (INT8) as the most widely used precision for DL inference. This paper proposes a novel MAC design capable of dynamically exploiting bit sparsity (i.e., number of `0' bits within a binary value) in input data to achieve significant improvements on area, power and energy. The proposed architecture, called OzMAC (Omit-zero-MAC), skips over zeros within a binary input value and performs simple shift-and-add-based compute in place of expensive multipliers. We implement OzMAC in SystemVerilog and present post-synthesis performance-power-area (PPA) results using commercial TSMC N5 (5nm) process node. Using eight pretrained INT8 deep neural networks (DNNs) as benchmarks, we demonstrate the existence of high bit sparsity in real DNN workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput relative to conventional MAC, it still achieves 30% improvement on both power and energy.


[338] 2402.19379

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Match Human Crowd Accuracy

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is statistically equivalent to the human crowd. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety applications throughout society.


[339] 2402.19381

Optimized Bayesian Framework for Inverse Heat Transfer Problems Using Reduced Order Methods

A stochastic inverse heat transfer problem is formulated to infer the transient heat flux, treated as an unknown Neumann boundary condition. Therefore, an Ensemble-based Simultaneous Input and State Filtering as a Data Assimilation technique is utilized for simultaneous temperature distribution prediction and heat flux estimation. This approach is incorporated with Radial Basis Functions not only to lessen the size of unknown inputs but also to mitigate the computational burden of this technique. The procedure applies to the specific case of a mold used in Continuous Casting machinery, and it is based on the sequential availability of temperature provided by thermocouples inside the mold. Our research represents a significant contribution to achieving probabilistic boundary condition estimation in real-time handling with noisy measurements and errors in the model. We additionally demonstrate the procedure's dependence on some hyperparameters that are not documented in the existing literature. Accurate real-time prediction of the heat flux is imperative for the smooth operation of Continuous Casting machinery at the boundary region where the Continuous Casting mold and the molten steel meet which is not also physically measurable. Thus, this paves the way for efficient real-time monitoring and control, which is critical for preventing caster shutdowns.


[340] 2402.19385

Towards Safe and Reliable Autonomous Driving: Dynamic Occupancy Set Prediction

In the rapidly evolving field of autonomous driving, accurate trajectory prediction is pivotal for vehicular safety. However, trajectory predictions often deviate from actual paths, particularly in complex and challenging environments, leading to significant errors. To address this issue, our study introduces a novel method for Dynamic Occupancy Set (DOS) prediction, enhancing trajectory prediction capabilities. This method effectively combines advanced trajectory prediction networks with a DOS prediction module, overcoming the shortcomings of existing models. It provides a comprehensive and adaptable framework for predicting the potential occupancy sets of traffic participants. The main contributions of this research include: 1) A novel DOS prediction model tailored for complex scenarios, augmenting traditional trajectory prediction; 2) The development of unique DOS representations and evaluation metrics; 3) Extensive validation through experiments, demonstrating enhanced performance and adaptability. This research contributes to the advancement of safer and more efficient intelligent vehicle and transportation systems.


[341] 2402.19401

Assessing Visually-Continuous Corruption Robustness of Neural Networks Relative to Human Performance

While Neural Networks (NNs) have surpassed human accuracy in image classification on ImageNet, they often lack robustness against image corruption, i.e., corruption robustness. Yet such robustness is seemingly effortless for human perception. In this paper, we propose visually-continuous corruption robustness (VCR) -- an extension of corruption robustness to allow assessing it over the wide and continuous range of changes that correspond to the human perceptive quality (i.e., from the original image to the full distortion of all perceived visual information), along with two novel human-aware metrics for NN evaluation. To compare VCR of NNs with human perception, we conducted extensive experiments on 14 commonly used image corruptions with 7,718 human participants and state-of-the-art robust NN models with different training objectives (e.g., standard, adversarial, corruption robustness), different architectures (e.g., convolution NNs, vision transformers), and different amounts of training data augmentation. Our study showed that: 1) assessing robustness against continuous corruption can reveal insufficient robustness undetected by existing benchmarks; as a result, 2) the gap between NN and human robustness is larger than previously known; and finally, 3) some image corruptions have a similar impact on human perception, offering opportunities for more cost-effective robustness assessments. Our validation set with 14 image corruptions, human robustness data, and the evaluation code is provided as a toolbox and a benchmark.


[342] 2402.19402

A Scalable and Transferable Time Series Prediction Framework for Demand Forecasting

Time series forecasting is one of the most essential and ubiquitous tasks in many business problems, including demand forecasting and logistics optimization. Traditional time series forecasting methods, however, have resulted in small models with limited expressive power because they have difficulty in scaling their model size up while maintaining high accuracy. In this paper, we propose Forecasting orchestra (Forchestra), a simple but powerful framework capable of accurately predicting future demand for a diverse range of items. We empirically demonstrate that the model size is scalable to up to 0.8 billion parameters. The proposed method not only outperforms existing forecasting models with a significant margin, but it could generalize well to unseen data points when evaluated in a zero-shot fashion on downstream datasets. Last but not least, we present extensive qualitative and quantitative studies to analyze how the proposed model outperforms baseline models and differs from conventional approaches. The original paper was presented as a full paper at ICDM 2022 and is available at: https://ieeexplore.ieee.org/document/10027662.


[343] 2402.19404

Entity-Aware Multimodal Alignment Framework for News Image Captioning

News image captioning task is a variant of image captioning task which requires model to generate a more informative caption with news image and the associated news article. Multimodal Large Language models have developed rapidly in recent years and is promising in news image captioning task. However, according to our experiments, common MLLMs are not good at generating the entities in zero-shot setting. Their abilities to deal with the entities information are still limited after simply fine-tuned on news image captioning dataset. To obtain a more powerful model to handle the multimodal entity information, we design two multimodal entity-aware alignment tasks and an alignment framework to align the model and generate the news image captions. Our method achieves better results than previous state-of-the-art models in CIDEr score (72.33 -> 86.29) on GoodNews dataset and (70.83 -> 85.61) on NYTimes800k dataset.


[344] 2402.19405

Navigating Hallucinations for Reasoning of Unintentional Activities

In this work we present a novel task of understanding unintentional human activities in videos. We formalize this problem as a reasoning task under zero-shot scenario, where given a video of an unintentional activity we want to know why it transitioned from intentional to unintentional. We first evaluate the effectiveness of current state-of-the-art Large Multimodal Models on this reasoning task and observe that they suffer from hallucination. We further propose a novel prompting technique,termed as Dream of Thoughts (DoT), which allows the model to navigate through hallucinated thoughts to achieve better reasoning. To evaluate the performance on this task, we also introduce three different specialized metrics designed to quantify the models reasoning capability. We perform our experiments on two different datasets, OOPs and UCF-Crimes, and our findings show that DOT prompting technique is able to outperform standard prompting, while minimizing hallucinations.


[345] 2402.19406

On the Scaling Laws of Geographical Representation in Language Models

Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size. Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.


[346] 2402.19407

MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

With the increasing multimedia information, multimodal recommendation has received extensive attention. It utilizes multimodal information to alleviate the data sparsity problem in recommendation systems, thus improving recommendation accuracy. However, the reliance on labeled data severely limits the performance of multimodal recommendation models. Recently, self-supervised learning has been used in multimodal recommendations to mitigate the label sparsity problem. Nevertheless, the state-of-the-art methods cannot avoid the modality noise when aligning multimodal information due to the large differences in the distributions of different modalities. To this end, we propose a Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method to address the label sparsity problem and the modality alignment problem. Specifically, MENTOR first enhances the specific features of each modality using the graph convolutional network (GCN) and fuses the visual and textual modalities. It then enhances the item representation via the item semantic graph for all modalities, including the fused modality. Then, it introduces two multilevel self-supervised tasks: the multilevel cross-modal alignment task and the general feature enhancement task. The multilevel cross-modal alignment task aligns each modality under the guidance of the ID embedding from multiple levels while maintaining the historical interaction information. The general feature enhancement task enhances the general feature from both the graph and feature perspectives to improve the robustness of our model. Extensive experiments on three publicly available datasets demonstrate the effectiveness of our method. Our code is publicly available at https://github.com/Jinfeng-Xu/MENTOR.


[347] 2402.19410

Genie: Smart ROS-based Caching for Connected Autonomous Robots

Despite the promising future of autonomous robots, several key issues currently remain that can lead to compromised performance and safety. One such issue is latency, where we find that even the latest embedded platforms from NVIDIA fail to execute intelligence tasks (e.g., object detection) of autonomous vehicles in a real-time fashion. One remedy to this problem is the promising paradigm of edge computing. Through collaboration with our industry partner, we identify key prohibitive limitations of the current edge mindset: (1) servers are not distributed enough and thus, are not close enough to vehicles, (2) current proposed edge solutions do not provide substantially better performance and extra information specific to autonomous vehicles to warrant their cost to the user, and (3) the state-of-the-art solutions are not compatible with popular frameworks used in autonomous systems, particularly the Robot Operating System (ROS). To remedy these issues, we provide Genie, an encapsulation technique that can enable transparent caching in ROS in a non-intrusive way (i.e., without modifying the source code), can build the cache in a distributed manner (in contrast to traditional central caching methods), and can construct a collective three-dimensional object map to provide substantially better latency (even on low-power edge servers) and higher quality data to all vehicles in a certain locality. We fully implement our design on state-of-the-art industry-adopted embedded and edge platforms, using the prominent autonomous driving software Autoware, and find that Genie can enhance the latency of Autoware Vision Detector by 82% on average, enable object reusability 31% of the time on average and as much as 67% for the incoming requests, and boost the confidence in its object map considerably over time.


[348] 2402.19411

PaECTER: Patent-level Representation Learning using Citation-informed Transformers

PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained language model (BERT for Patents) on our patent citation prediction test dataset on two different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners. PaECTER is available on Hugging Face.


[349] 2402.19414

Higher-Order Networks Representation and Learning: A Survey

Network data has become widespread, larger, and more complex over the years. Traditional network data is dyadic, capturing the relations among pairs of entities. With the need to model interactions among more than two entities, significant research has focused on higher-order networks and ways to represent, analyze, and learn from them. There are two main directions to studying higher-order networks. One direction has focused on capturing higher-order patterns in traditional (dyadic) graphs by changing the basic unit of study from nodes to small frequently observed subgraphs, called motifs. As most existing network data comes in the form of pairwise dyadic relationships, studying higher-order structures within such graphs may uncover new insights. The second direction aims to directly model higher-order interactions using new and more complex representations such as simplicial complexes or hypergraphs. Some of these models have long been proposed, but improvements in computational power and the advent of new computational techniques have increased their popularity. Our goal in this paper is to provide a succinct yet comprehensive summary of the advanced higher-order network analysis techniques. We provide a systematic review of its foundations and algorithms, along with use cases and applications of higher-order networks in various scientific domains.


[350] 2402.19416

Vision-Radio Experimental Infrastructure Architecture Towards 6G

Telecommunications and computer vision have evolved separately so far. Yet, with the shift to sub-terahertz (sub-THz) and terahertz (THz) radio communications, there is an opportunity to explore computer vision technologies together with radio communications, considering the dependency of both technologies on Line of Sight. The combination of radio sensing and computer vision can address challenges such as obstructions and poor lighting. Also, machine learning algorithms, capable of processing multimodal data, play a crucial role in deriving insights from raw and low-level sensing data, offering a new level of abstraction that can enhance various applications and use cases such as beamforming and terminal handovers. This paper introduces CONVERGE, a pioneering vision-radio paradigm that bridges this gap by leveraging Integrated Sensing and Communication (ISAC) to facilitate a dual "View-to-Communicate, Communicate-to-View" approach. CONVERGE offers tools that merge wireless communications and computer vision, establishing a novel Research Infrastructure (RI) that will be open to the scientific community and capable of providing open datasets. This new infrastructure will support future research in 6G and beyond concerning multiple verticals, such as telecommunications, automotive, manufacturing, media, and health.


[351] 2402.19420

Understanding Iterative Combinatorial Auction Designs via Multi-Agent Reinforcement Learning

Iterative combinatorial auctions are widely used in high stakes settings such as spectrum auctions. Such auctions can be hard to understand analytically, making it difficult for bidders to determine how to behave and for designers to optimize auction rules to ensure desirable outcomes such as high revenue or welfare. In this paper, we investigate whether multi-agent reinforcement learning (MARL) algorithms can be used to understand iterative combinatorial auctions, given that these algorithms have recently shown empirical success in several other domains. We find that MARL can indeed benefit auction analysis, but that deploying it effectively is nontrivial. We begin by describing modelling decisions that keep the resulting game tractable without sacrificing important features such as imperfect information or asymmetry between bidders. We also discuss how to navigate pitfalls of various MARL algorithms, how to overcome challenges in verifying convergence, and how to generate and interpret multiple equilibria. We illustrate the promise of our resulting approach by using it to evaluate a specific rule change to a clock auction, finding substantially different auction outcomes due to complex changes in bidders' behavior.


[352] 2402.19421

Crafting Knowledge: Exploring the Creative Mechanisms of Chat-Based Search Engines

In the domain of digital information dissemination, search engines act as pivotal conduits linking information seekers with providers. The advent of chat-based search engines utilizing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG), exemplified by Bing Chat, marks an evolutionary leap in the search ecosystem. They demonstrate metacognitive abilities in interpreting web information and crafting responses with human-like understanding and creativity. Nonetheless, the intricate nature of LLMs renders their "cognitive" processes opaque, challenging even their designers' understanding. This research aims to dissect the mechanisms through which an LLM-powered chat-based search engine, specifically Bing Chat, selects information sources for its responses. To this end, an extensive dataset has been compiled through engagements with New Bing, documenting the websites it cites alongside those listed by the conventional search engine. Employing natural language processing (NLP) techniques, the research reveals that Bing Chat exhibits a preference for content that is not only readable and formally structured, but also demonstrates lower perplexity levels, indicating a unique inclination towards text that is predictable by the underlying LLM. Further enriching our analysis, we procure an additional dataset through interactions with the GPT-4 based knowledge retrieval API, unveiling a congruent text preference between the RAG API and Bing Chat. This consensus suggests that these text preferences intrinsically emerge from the underlying language models, rather than being explicitly crafted by Bing Chat's developers. Moreover, our investigation documents a greater similarity among websites cited by RAG technologies compared to those ranked highest by conventional search engines.


[353] 2402.19422

PEM: Prototype-based Efficient MaskFormer for Image Segmentation

Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.


[354] 2402.19423

Leveraging AI Predicted and Expert Revised Annotations in Interactive Segmentation: Continual Tuning or Full Training?

Interactive segmentation, an integration of AI algorithms and human expertise, premises to improve the accuracy and efficiency of curating large-scale, detailed-annotated datasets in healthcare. Human experts revise the annotations predicted by AI, and in turn, AI improves its predictions by learning from these revised annotations. This interactive process continues to enhance the quality of annotations until no major revision is needed from experts. The key challenge is how to leverage AI predicted and expert revised annotations to iteratively improve the AI. Two problems arise: (1) The risk of catastrophic forgetting--the AI tends to forget the previously learned classes if it is only retrained using the expert revised classes. (2) Computational inefficiency when retraining the AI using both AI predicted and expert revised annotations; moreover, given the dominant AI predicted annotations in the dataset, the contribution of newly revised annotations--often account for a very small fraction--to the AI training remains marginal. This paper proposes Continual Tuning to address the problems from two perspectives: network design and data reuse. Firstly, we design a shared network for all classes followed by class-specific networks dedicated to individual classes. To mitigate forgetting, we freeze the shared network for previously learned classes and only update the class-specific network for revised classes. Secondly, we reuse a small fraction of data with previous annotations to avoid over-computing. The selection of such data relies on the importance estimate of each data. The importance score is computed by combining the uncertainty and consistency of AI predictions. Our experiments demonstrate that Continual Tuning achieves a speed 16x greater than repeatedly training AI from scratch without compromising the performance.


[355] 2402.19427

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.


[356] 2402.19431

Compositional API Recommendation for Library-Oriented Code Generation

Large language models (LLMs) have achieved exceptional performance in code generation. However, the performance remains unsatisfactory in generating library-oriented code, especially for the libraries not present in the training data of LLMs. Previous work utilizes API recommendation technology to help LLMs use libraries: it retrieves APIs related to the user requirements, then leverages them as context to prompt LLMs. However, developmental requirements can be coarse-grained, requiring a combination of multiple fine-grained APIs. This granularity inconsistency makes API recommendation a challenging task. To address this, we propose CAPIR (Compositional API Recommendation), which adopts a "divide-and-conquer" strategy to recommend APIs for coarse-grained requirements. Specifically, CAPIR employs an LLM-based Decomposer to break down a coarse-grained task description into several detailed subtasks. Then, CAPIR applies an embedding-based Retriever to identify relevant APIs corresponding to each subtask. Moreover, CAPIR leverages an LLM-based Reranker to filter out redundant APIs and provides the final recommendation. To facilitate the evaluation of API recommendation methods on coarse-grained requirements, we present two challenging benchmarks, RAPID (Recommend APIs based on Documentation) and LOCG (Library-Oriented Code Generation). Experimental results on these benchmarks, demonstrate the effectiveness of CAPIR in comparison to existing baselines. Specifically, on RAPID's Torchdata-AR dataset, compared to the state-of-the-art API recommendation approach, CAPIR improves recall@5 from 18.7% to 43.2% and precision@5 from 15.5% to 37.1%. On LOCG's Torchdata-Code dataset, compared to code generation without API recommendation, CAPIR improves pass@100 from 16.0% to 28.0%.


[357] 2402.19432

Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation

Recent years in robotics and imitation learning have shown remarkable progress in training large-scale foundation models by leveraging data across a multitude of embodiments. The success of such policies might lead us to wonder: just how diverse can the robots in the training set be while still facilitating positive transfer? In this work, we study this question in the context of heterogeneous embodiments, examining how even seemingly very different domains, such as robotic navigation and manipulation, can provide benefits when included in the training data for the same model. We train a single goal-conditioned policy that is capable of controlling robotic arms, quadcopters, quadrupeds, and mobile bases. We then investigate the extent to which transfer can occur across navigation and manipulation on these embodiments by framing them as a single goal-reaching task. We find that co-training with navigation data can enhance robustness and performance in goal-conditioned manipulation with a wrist-mounted camera. We then deploy our policy trained only from navigation-only and static manipulation-only data on a mobile manipulator, showing that it can control a novel embodiment in a zero-shot manner. These results provide evidence that large-scale robotic policies can benefit from data collected across various embodiments. Further information and robot videos can be found on our project website this http URL


[358] 2402.19434

Digital Twin Aided Massive MIMO: CSI Compression and Feedback

Deep learning (DL) approaches have demonstrated high performance in compressing and reconstructing the channel state information (CSI) and reducing the CSI feedback overhead in massive MIMO systems. One key challenge, however, with the DL approaches is the demand for extensive training data. Collecting this real-world CSI data incurs significant overhead that hinders the DL approaches from scaling to a large number of communication sites. To address this challenge, we propose a novel direction that utilizes site-specific \textit{digital twins} to aid the training of DL models. The proposed digital twin approach generates site-specific synthetic CSI data from the EM 3D model and ray tracing, which can then be used to train the DL model without real-world data collection. To further improve the performance, we adopt online data selection to refine the DL model training with a small real-world CSI dataset. Results show that a DL model trained solely on the digital twin data can achieve high performance when tested in a real-world deployment. Further, leveraging domain adaptation techniques, the proposed approach requires orders of magnitude less real-world data to approach the same performance of the model trained completely on a real-world CSI dataset.


[359] 2402.19437

Differentially Private Worst-group Risk Minimization

We initiate a systematic study of worst-group risk minimization under $(\epsilon, \delta)$-differential privacy (DP). The goal is to privately find a model that approximately minimizes the maximal risk across $p$ sub-populations (groups) with different distributions, where each group distribution is accessed via a sample oracle. We first present a new algorithm that achieves excess worst-group population risk of $\tilde{O}(\frac{p\sqrt{d}}{K\epsilon} + \sqrt{\frac{p}{K}})$, where $K$ is the total number of samples drawn from all groups and $d$ is the problem dimension. Our rate is nearly optimal when each distribution is observed via a fixed-size dataset of size $K/p$. Our result is based on a new stability-based analysis for the generalization error. In particular, we show that $\Delta$-uniform argument stability implies $\tilde{O}(\Delta + \frac{1}{\sqrt{n}})$ generalization error w.r.t. the worst-group risk, where $n$ is the number of samples drawn from each sample oracle. Next, we propose an algorithmic framework for worst-group population risk minimization using any DP online convex optimization algorithm as a subroutine. Hence, we give another excess risk bound of $\tilde{O}\left( \sqrt{\frac{d^{1/2}}{\epsilon K}} +\sqrt{\frac{p}{K\epsilon^2}} \right)$. Assuming the typical setting of $\epsilon=\Theta(1)$, this bound is more favorable than our first bound in a certain range of $p$ as a function of $K$ and $d$. Finally, we study differentially private worst-group empirical risk minimization in the offline setting, where each group distribution is observed by a fixed-size dataset. We present a new algorithm with nearly optimal excess risk of $\tilde{O}(\frac{p\sqrt{d}}{K\epsilon})$.


[360] 2402.19441

3D Gaussian Model for Animation and Texturing

3D Gaussian Splatting has made a marked impact on neural rendering by achieving impressive fidelity and performance. Despite this achievement, however, it is not readily applicable to developing interactive applications. Real-time applications like XR apps and games require functions such as animation, UV-mapping, and model editing simultaneously manipulated through the usage of a 3D model. We propose a modeling that is analogous to typical 3D models, which we call 3D Gaussian Model (3DGM); it provides a manipulatable proxy for novel animation and texture transfer. By binding the 3D Gaussians in texture space and re-projecting them back to world space through implicit shell mapping, we show how our 3D modeling can serve as a valid rendering methodology for interactive applications. It is further noted that recently, 3D mesh reconstruction works have been able to produce high-quality mesh for rendering. Our work, on the other hand, only requires an approximated geometry for rendering an object in high fidelity. Applicationwise, we will show that our proxy-based 3DGM is capable of driving novel animation without animated training data and texture transferring via UV mapping of the 3D Gaussians. We believe the result indicates the potential of our work for enabling interactive applications for 3D Gaussian Splatting.


[361] 2402.19442

Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality

We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases -- a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the sense that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. The key technique for our convergence analysis is to map the gradient flow dynamics in the parameter space to a set of ordinary differential equations in the spectral domain, where the relative magnitudes of the semi-singular values of the attention weights determines task allocation. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model.


[362] 2402.19443

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.


[363] 2402.19446

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather make intelligent decisions over a multi-turn interaction to accomplish a task (e.g., when interacting with the web, using tools, or providing customer support). Reinforcement learning (RL) provides a general paradigm to address such agent tasks, but current RL methods for LLMs largely focus on optimizing single-turn rewards. By construction, most single-turn RL methods cannot endow LLMs with the ability to intelligently seek information over multiple turns, perform credit assignment, or reason about their past actions -- all of which are critical in agent tasks. This raises the question: how can we design effective and efficient multi-turn RL algorithms for LLMs? In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e.g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively. To do this, our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel: a high-level off-policy value-based RL algorithm to aggregate reward over utterances, and a low-level RL algorithm that utilizes this high-level value function to train a token policy within each utterance or turn. Our hierarchical framework, Actor-Critic Framework with a Hierarchical Structure (ArCHer), can also give rise to other RL methods. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks, attaining a sample efficiency of about 100x over existing methods, while also improving with larger model capacity (upto the 7 billion scale that we tested on).


[364] 2402.19449

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks, but it is unclear why this happens. We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficulties in the optimization dynamics. When training with gradient descent, the loss associated with infrequent words decreases slower than the loss associated with frequent ones. As most samples come from relatively infrequent words, the average loss decreases slowly with gradient descent. On the other hand, Adam and sign-based methods do not suffer from this problem and improve predictions on all classes. To establish that this behavior is indeed caused by class imbalance, we show empirically that it persist through different architectures and data types, on language transformers, vision CNNs, and linear models. We further study this phenomenon on a linear classification with cross-entropy loss, showing that heavy-tailed class imbalance leads to ill-conditioning, and that the normalization used by Adam can counteract it.


[365] 2402.19450

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

We propose a framework for robust evaluation of reasoning capabilities of language models, using functional variants of benchmarks. Models that solve a reasoning test should exhibit no difference in performance over the static version of a problem compared to a snapshot of the functional variant. We have rewritten the relevant fragment of the MATH benchmark into its functional variant MATH(), with functionalization of other benchmarks to follow. When evaluating current state-of-the-art models over snapshots of MATH(), we find a reasoning gap -- the percentage difference between the static and functional accuracies. We find reasoning gaps from 58.35% to 80.31% among the state-of-the-art closed and open weights models that perform well on static benchmarks, with the caveat that the gaps are likely to be smaller with more sophisticated prompting strategies. Here we show that models which anecdotally have good reasoning performance over real-world tasks, have quantifiable lower gaps, motivating the open problem of building "gap 0" models. Code for evaluation and new evaluation datasets, three MATH() snapshots, are publicly available at https://github.com/consequentai/fneval/.


[366] 2402.19457

$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

Assessing the quality of summarizers poses significant challenges. In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We theoretically establish a direct relationship between the resulting error probability of these tasks and the mutual information between source texts and generated summaries. We introduce $\texttt{COSMIC}$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like $\texttt{BERTScore}$ and $\texttt{ROUGE}$ highlight the competitive performance of $\texttt{COSMIC}$.


[367] 2402.19460

Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks

Uncertainty quantification, once a singular task, has evolved into a spectrum of tasks, including abstained prediction, out-of-distribution detection, and aleatoric uncertainty quantification. The latest goal is disentanglement: the construction of multiple estimators that are each tailored to one and only one task. Hence, there is a plethora of recent advances with different intentions - that often entirely deviate from practical behavior. This paper conducts a comprehensive evaluation of numerous uncertainty estimators across diverse tasks on ImageNet. We find that, despite promising theoretical endeavors, disentanglement is not yet achieved in practice. Additionally, we reveal which uncertainty estimators excel at which specific tasks, providing insights for practitioners and guiding future research toward task-centric and disentangled uncertainty estimation methods. Our code is available at https://github.com/bmucsanyi/bud.


[368] 2402.19463

SeMoLi: What Moves Together Belongs Together

We tackle semi-supervised object detection based on motion cues. Recent results suggest that heuristic-based clustering methods in conjunction with object trackers can be used to pseudo-label instances of moving objects and use these as supervisory signals to train 3D object detectors in Lidar data without manual supervision. We re-think this approach and suggest that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner. We leverage recent advances in scene flow estimation to obtain point trajectories from which we extract long-term, class-agnostic motion patterns. Revisiting correlation clustering in the context of message passing networks, we learn to group those motion patterns to cluster points to object instances. By estimating the full extent of the objects, we obtain per-scan 3D bounding boxes that we use to supervise a Lidar object detection network. Our method not only outperforms prior heuristic-based approaches (57.5 AP, +14 improvement over prior work), more importantly, we show we can pseudo-label and train object detectors across datasets.


[369] 2402.19464

Curiosity-driven Red-teaming for Large Language Models

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam}


[370] 2402.19465

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.


[371] 2402.19467

TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning

It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method's experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best of both worlds contrast to black-box methods.


[372] 2402.19469

Humanoid Locomotion as Next Token Prediction

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.


[373] 2402.19471

Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling

Questions combine our mastery of language with our remarkable facility for reasoning about uncertainty. How do people navigate vast hypothesis spaces to pose informative questions given limited cognitive resources? We study these tradeoffs in a classic grounded question-asking task based on the board game Battleship. Our language-informed program sampling (LIPS) model uses large language models (LLMs) to generate natural language questions, translate them into symbolic programs, and evaluate their expected information gain. We find that with a surprisingly modest resource budget, this simple Monte Carlo optimization strategy yields informative questions that mirror human performance across varied Battleship board scenarios. In contrast, LLM-only baselines struggle to ground questions in the board state; notably, GPT-4V provides no improvement over non-visual baselines. Our results illustrate how Bayesian models of question-asking can leverage the statistics of language to capture human priors, while highlighting some shortcomings of pure LLMs as grounded reasoners.


[374] 2402.19472

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing (for now) 1.69M and 1.98M test samples, respectively. While reducing overfitting, lifelong benchmarks introduce a key challenge: the high cost of evaluating a growing number of models across an ever-expanding sample set. To address this challenge, we also introduce an efficient evaluation framework: Sort \& Search (S&S), which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples, enabling cost-effective lifelong benchmarking. Extensive empirical evaluations across 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (1000x reduction) on a single A100 GPU, with low approximation error. As such, lifelong benchmarks offer a robust, practical solution to the "benchmark exhaustion" problem.


[375] 2402.19473

Retrieval-Augmented Generation for AI-Generated Content: A Survey

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by advancements in model algorithms, scalable foundation model architectures, and the availability of ample high-quality datasets. While AIGC has achieved remarkable performance, it still faces challenges, such as the difficulty of maintaining up-to-date and long-tail knowledge, the risk of data leakage, and the high costs associated with training and inference. Retrieval-Augmented Generation (RAG) has recently emerged as a paradigm to address such challenges. In particular, RAG introduces the information retrieval process, which enhances AIGC results by retrieving relevant objects from available data stores, leading to greater accuracy and robustness. In this paper, we comprehensively review existing efforts that integrate RAG technique into AIGC scenarios. We first classify RAG foundations according to how the retriever augments the generator. We distill the fundamental abstractions of the augmentation methodologies for various retrievers and generators. This unified perspective encompasses all RAG scenarios, illuminating advancements and pivotal technologies that help with potential future progress. We also summarize additional enhancements methods for RAG, facilitating effective engineering and implementation of RAG systems. Then from another view, we survey on practical applications of RAG across different modalities and tasks, offering valuable references for researchers and practitioners. Furthermore, we introduce the benchmarks for RAG, discuss the limitations of current RAG systems, and suggest potential directions for future research. Project: https://github.com/hymie122/RAG-Survey


[376] 2402.19474

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

We present the All-Seeing Project V2: a new model and dataset designed for understanding object relations in images. Specifically, we propose the All-Seeing Model V2 (ASMv2) that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation (ReC) task. Leveraging this unified task, our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them, diminishing the relation hallucination often encountered by Multi-modal Large Language Models (MLLMs). To facilitate training and evaluation of MLLMs in relation understanding, we created the first high-quality ReC dataset ({AS-V2) which is aligned with the format of standard instruction tuning data. In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs. Notably, our ASMv2 achieves an overall accuracy of 52.04 on this relation-aware benchmark, surpassing the 43.14 of LLaVA-1.5 by a large margin. We hope that our work can inspire more future research and contribute to the evolution towards artificial general intelligence. Our project is released at https://github.com/OpenGVLab/all-seeing.


[377] 2402.19475

The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?

While language models are increasingly more proficient at code generation, they still frequently generate incorrect programs. Many of these programs are obviously wrong, but others are more subtle and pass weaker correctness checks such as being able to compile. In this work, we focus on these counterfeit samples: programs sampled from a language model that 1) have a high enough log-probability to be generated at a moderate temperature and 2) pass weak correctness checks. Overall, we discover that most models have a very shallow understanding of counterfeits through three clear failure modes. First, models mistakenly classify them as correct. Second, models are worse at reasoning about the execution behaviour of counterfeits and often predict their execution results as if they were correct. Third, when asking models to fix counterfeits, the likelihood of a model successfully repairing a counterfeit is often even lower than that of sampling a correct program from scratch. Counterfeits also have very unexpected properties: first, counterfeit programs for problems that are easier for a model to solve are not necessarily easier to detect and only slightly easier to execute and repair. Second, counterfeits from a given model are just as confusing to the model itself as they are to other models. Finally, both strong and weak models are able to generate counterfeit samples that equally challenge all models. In light of our findings, we recommend that care and caution be taken when relying on models to understand their own samples, especially when no external feedback is incorporated.


[378] 2402.19477

Learning a Generalized Physical Face Model From Data

Physically-based simulation is a powerful approach for 3D facial animation as the resulting deformations are governed by physical constraints, allowing to easily resolve self-collisions, respond to external forces and perform realistic anatomy edits. Today's methods are data-driven, where the actuations for finite elements are inferred from captured skin geometry. Unfortunately, these approaches have not been widely adopted due to the complexity of initializing the material space and learning the deformation model for each character separately, which often requires a skilled artist followed by lengthy network training. In this work, we aim to make physics-based facial animation more accessible by proposing a generalized physical face model that we learn from a large 3D face dataset in a simulation-free manner. Once trained, our model can be quickly fit to any unseen identity and produce a ready-to-animate physical face model automatically. Fitting is as easy as providing a single 3D face scan, or even a single face image. After fitting, we offer intuitive animation controls, as well as the ability to retarget animations across characters. All the while, the resulting animations allow for physical effects like collision avoidance, gravity, paralysis, bone reshaping and more.


[379] 2402.19479

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.


[380] 2402.19481

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, na\"{\i}vely implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.


[381] 2401.15324

Neutrino Reconstruction in TRIDENT Based on Graph Neural Network

TRopIcal DEep-sea Neutrino Telescope (TRIDENT) is a next-generation neutrino telescope to be located in the South China Sea. With a large detector volume and the use of advanced hybrid digital optical modules (hDOMs), TRIDENT aims to discover multiple astrophysical neutrino sources and probe all-flavor neutrino physics. The reconstruction resolution of primary neutrinos is on the critical path to these scientific goals. We have developed a novel reconstruction method based on graph neural network (GNN) for TRIDENT. In this paper, we present the reconstruction performance of the GNN-based approach on both track- and shower-like neutrino events in TRIDENT.


[382] 2402.18575

DiffuseRAW: End-to-End Generative RAW Image Processing for Low-Light Images

Imaging under extremely low-light conditions presents a significant challenge and is an ill-posed problem due to the low signal-to-noise ratio (SNR) caused by minimal photon capture. Previously, diffusion models have been used for multiple kinds of generative tasks and image-to-image tasks, however, these models work as a post-processing step. These diffusion models are trained on processed images and learn on processed images. However, such approaches are often not well-suited for extremely low-light tasks. Unlike the task of low-light image enhancement or image-to-image enhancement, we tackle the task of learning the entire image-processing pipeline, from the RAW image to a processed image. For this task, a traditional image processing pipeline often consists of multiple specialized parts that are overly reliant on the downstream tasks. Unlike these, we develop a new generative ISP that relies on fine-tuning latent diffusion models on RAW images and generating processed long-exposure images which allows for the apt use of the priors from large text-to-image generation models. We evaluate our approach on popular end-to-end low-light datasets for which we see promising results and set a new SoTA on the See-in-Dark (SID) dataset. Furthermore, with this work, we hope to pave the way for more generative and diffusion-based image processing and other problems on RAW data.


[383] 2402.18583

Binding-Adaptive Diffusion Models for Structure-Based Drug Design

Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Avg. Vina Score, while maintaining proper molecular properties. Our code is available at https://github.com/YangLing0818/BindDM


[384] 2402.18598

Note: Evolutionary Game Theory Focus Informational Health: The Cocktail Party Effect Through Werewolfgame under Incomplete Information and ESS Search Method Using Expected Gains of Repeated Dilemmas

We explore the state of information disruption caused by the cocktail party effect within the framework of non-perfect information games and evolutive games with multiple werewolves. In particular, we mathematically model and analyze the effects on the gain of each strategy choice and the formation process of evolutionary stable strategies (ESS) under the assumption that the pollution risk of fake news is randomly assigned in the context of repeated dilemmas. We will develop the computational process in detail, starting with the construction of the gain matrix, modeling the evolutionary dynamics using the replicator equation, and identifying the ESS. In addition, numerical simulations will be performed to observe system behavior under different initial conditions and parameter settings to better understand the impact of the spread of fake news on strategy evolution. This research will provide theoretical insights into the complex issues of contemporary society regarding the authenticity of information and expand the range of applications of evolutionary game theory.


[385] 2402.18600

Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina

Diabetes mellitus (DM) predisposes patients to vascular complications. Retinal images and vasculature reflect the body's micro- and macrovascular health. They can be used to diagnose DM complications, including diabetic retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular disease, as well as forecast the risk of cardiovascular events. Artificial intelligence (AI)-enabled systems developed for high-throughput detection of DR using digitized retinal images have become clinically adopted. Beyond DR screening, AI integration also holds immense potential to address challenges associated with the holistic care of the patient with DM. In this work, we aim to comprehensively review the literature for studies on AI applications based on retinal images related to DM diagnosis, prognostication, and management. We will describe the findings of holistic AI-assisted diabetes care, including but not limited to DR screening, and discuss barriers to implementing such systems, including issues concerning ethics, data privacy, equitable access, and explainability. With the ability to evaluate the patient's health status vis a vis DM complication as well as risk prognostication of future cardiovascular complications, AI-assisted retinal image analysis has the potential to become a central tool for modern personalized medicine in patients with DM.


[386] 2402.18611

HemaGraph: Breaking Barriers in Hematologic Single Cell Classification with Graph Attention

In the realm of hematologic cell populations classification, the intricate patterns within flow cytometry data necessitate advanced analytical tools. This paper presents 'HemaGraph', a novel framework based on Graph Attention Networks (GATs) for single-cell multi-class classification of hematological cells from flow cytometry data. Harnessing the power of GATs, our method captures subtle cell relationships, offering highly accurate patient profiling. Based on evaluation of data from 30 patients, HemaGraph demonstrates classification performance across five different cell classes, outperforming traditional methodologies and state-of-the-art methods. Moreover, the uniqueness of this framework lies in the training and testing phase of HemaGraph, where it has been applied for extremely large graphs, containing up to hundreds of thousands of nodes and two million edges, to detect low frequency cell populations (e.g. 0.01% for one population), with accuracies reaching 98%. Our findings underscore the potential of HemaGraph in improving hematoligic multi-class classification, paving the way for patient-personalized interventions. To the best of our knowledge, this is the first effort to use GATs, and Graph Neural Networks (GNNs) in general, to classify cell populations from single-cell flow cytometry data. We envision applying this method to single-cell data from larger cohort of patients and on other hematologic diseases.


[387] 2402.18612

Understanding random forests and overfitting: a visualization and simulation study

Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For the case studies, risk estimates were visualised using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true c-statistic and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 were simulated and RF models trained with minimum node size 2 or 20 using ranger package, resulting in 192 scenarios in total. The visualizations suggested that the model learned spikes of probability around events in the training set. A cluster of events created a bigger peak, isolated events local peaks. In the simulation study, median training c-statistics were between 0.97 and 1 unless there were 4 or 16 binary predictors with minimum node size 20. Median test c-statistics were higher with higher events per variable, higher minimum node size, and binary predictors. Median training slopes were always above 1, and were not correlated with median test slopes across scenarios (correlation -0.11). Median test slopes were higher with higher true c-statistic, higher minimum node size, and higher sample size. Random forests learn local probability peaks that often yield near perfect training c-statistics without strongly affecting c-statistics on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.


[388] 2402.18684

Quantum State Compression with Polar Codes

In the quantum compression scheme proposed by Schumacher, Alice compresses a message that Bob decompresses. In that approach, there is some probability of failure and, even when successful, some distortion of the state. For sufficiently large blocklengths, both of these imperfections can be made arbitrarily small while achieving a compression rate that asymptotically approaches the source coding bound. However, direct implementation of Schumacher compression suffers from poor circuit complexity. In this paper, we consider a slightly different approach based on classical syndrome source coding. The idea is to use a linear error-correcting code and treat the message to be compressed as an error pattern. If the message is a correctable error (i.e., a coset leader) then Alice can use the error-correcting code to convert her message to a corresponding quantum syndrome. An implementation of this based on polar codes is described and simulated. As in classical source coding based on polar codes, Alice maps the information into the ``frozen" qubits that constitute the syndrome. To decompress, Bob utilizes a quantum version of successive cancellation coding.


[389] 2402.18697

Inferring Dynamic Networks from Marginals with Iterative Proportional Fitting

A common network inference problem, arising from real-world data constraints, is how to infer a dynamic network from its time-aggregated adjacency matrix and time-varying marginals (i.e., row and column sums). Prior approaches to this problem have repurposed the classic iterative proportional fitting (IPF) procedure, also known as Sinkhorn's algorithm, with promising empirical results. However, the statistical foundation for using IPF has not been well understood: under what settings does IPF provide principled estimation of a dynamic network from its marginals, and how well does it estimate the network? In this work, we establish such a setting, by identifying a generative network model whose maximum likelihood estimates are recovered by IPF. Our model both reveals implicit assumptions on the use of IPF in such settings and enables new analyses, such as structure-dependent error bounds on IPF's parameter estimates. When IPF fails to converge on sparse network data, we introduce a principled algorithm that guarantees IPF converges under minimal changes to the network structure. Finally, we conduct experiments with synthetic and real-world data, which demonstrate the practical value of our theoretical and algorithmic contributions.


[390] 2402.18703

Zero-error communication, scrambling, and ergodicity

The long term behaviour of a quantum channel under iterations (i.e. under repeated applications of itself) yields a plethora of interesting properties. These include ergodicity, mixing, eventual scrambling, becoming strictly positive, and the vanishing of its one-shot zero error capacities. We derive relations between these seemingly different properties and find novel bounds on indices which quantify the minimum number of iterations needed for the onset of some of these properties. We obtain a lower bound on the one-shot zero-error classical capacity of $n$ iterations of an ergodic channel (for any positive integer $n$) in terms of the cardinality of its peripheral spectrum. We also find upper bounds on the minimum number of iterations needed for the one-shot capacities of any channel to stabilize. We consider two classes of quantum channels, satisfying certain symmetries, for which upper bounds on the above indices are optimal, since they reduce to the corresponding indices for a stochastic matrix (for which the bounds are known to be optimal). As an auxiliary result, we obtain a trade-off relation between the one-shot zero error classical and quantum capacities of a quantum channel.


[391] 2402.18729

A Priori Uncertainty Quantification of Reacting Turbulence Closure Models using Bayesian Neural Networks

While many physics-based closure model forms have been posited for the sub-filter scale (SFS) in large eddy simulation (LES), vast amounts of data available from direct numerical simulation (DNS) create opportunities to leverage data-driven modeling techniques. Albeit flexible, data-driven models still depend on the dataset and the functional form of the model chosen. Increased adoption of such models requires reliable uncertainty estimates both in the data-informed and out-of-distribution regimes. In this work, we employ Bayesian neural networks (BNNs) to capture both epistemic and aleatoric uncertainties in a reacting flow model. In particular, we model the filtered progress variable scalar dissipation rate which plays a key role in the dynamics of turbulent premixed flames. We demonstrate that BNN models can provide unique insights about the structure of uncertainty of the data-driven closure models. We also propose a method for the incorporation of out-of-distribution information in a BNN. The efficacy of the model is demonstrated by a priori evaluation on a dataset consisting of a variety of flame conditions and fuels.


[392] 2402.18758

Analog Isolated Multilevel Quantizer for Voltage Sensing while Maintaining Galvanic Isolation

A low-power, compact device for performing measurements in electrical systems with isolated voltage domains is proposed. Isolated measurements are required in numerous applications. For instance, a measurement of the bus voltage for a system with a high supply voltage and lower isolated local voltage level may be needed for system health monitoring and control. Such a requirement may necessitate the use of isolation amplifiers to provide voltage telemetry for the local system. Isolation amplifiers require dual galvanically isolated supplies and use magnetic, capacitive, or optical barriers between primary and secondary sides. Producing this supplemental voltage requires an extra voltage converter, which consumes power and generates electromagnetic interference which must, in turn, be filtered. Complex designs incorporating feedback are needed to achieve linear response. The proposed Analog Isolated Multilevel Quantizer (AIMQ) addresses these issues by monitoring the primary-side signal and communicating the results to the secondary side using a novel scheme involving Zener diodes, optocouplers, transistors, one-hot coding, and discrete outputs. The result is a low power isolated transducer that can in principle be extended to an arbitrary bit depth.


[393] 2402.18761

Exploration of Learned Lifting-Based Transform Structures for Fully Scalable and Accessible Wavelet-Like Image Compression

This paper provides a comprehensive study on features and performance of different ways to incorporate neural networks into lifting-based wavelet-like transforms, within the context of fully scalable and accessible image compression. Specifically, we explore different arrangements of lifting steps, as well as various network architectures for learned lifting operators. Moreover, we examine the impact of the number of learned lifting steps, the number of channels, the number of layers and the support of kernels in each learned lifting operator. To facilitate the study, we investigate two generic training methodologies that are simultaneously appropriate to a wide variety of lifting structures considered. Experimental results ultimately suggest that retaining fixed lifting steps from the base wavelet transform is highly beneficial. Moreover, we demonstrate that employing more learned lifting steps and more layers in each learned lifting operator do not contribute strongly to the compression performance. However, benefits can be obtained by utilizing more channels in each learned lifting operator. Ultimately, the learned wavelet-like transform proposed in this paper achieves over 25% bit-rate savings compared to JPEG 2000 with compact spatial support.


[394] 2402.18777

GDCNet: Calibrationless geometric distortion correction of echo planar imaging data using deep learning

Functional magnetic resonance imaging techniques benefit from echo-planar imaging's fast image acquisition but are susceptible to inhomogeneities in the main magnetic field, resulting in geometric distortion and signal loss artifacts in the images. Traditional methods leverage a field map or voxel displacement map for distortion correction. However, voxel displacement map estimation requires additional sequence acquisitions, and the accuracy of the estimation influences correction performance. This work implements a novel approach called GDCNet, which estimates a geometric distortion map by non-linear registration to T1-weighted anatomical images and applies it for distortion correction. GDCNet demonstrated fast distortion correction of functional images in retrospectively and prospectively acquired datasets. Among the compared models, the 2D self-supervised configuration resulted in a statistically significant improvement to normalized mutual information between distortion-corrected functional and T1-weighted images compared to the benchmark methods FUGUE and TOPUP. Furthermore, GDCNet models achieved processing speeds 14 times faster than TOPUP in the prospective dataset.


[395] 2402.18790

The Power of Unentangled Quantum Proofs with Non-negative Amplitudes

Quantum entanglement is a fundamental property of quantum mechanics and plays a crucial role in quantum computation and information. We study entanglement via the lens of computational complexity by considering quantum generalizations of the class NP with multiple unentangled quantum proofs, the so-called QMA(2) and its variants. The complexity of QMA(2) is a longstanding open problem, and only the trivial bounds QMA $\subseteq$ QMA(2) $\subseteq$ NEXP are known. In this work, we study the power of unentangled quantum proofs with non-negative amplitudes, a class which we denote $\text{QMA}^+(2)$. In this setting, we are able to design proof verification protocols for problems both using logarithmic size quantum proofs and having a constant probability gap in distinguishing yes from no instances. In particular, we design global protocols for small set expansion, unique games, and PCP verification. As a consequence, we obtain NP $\subseteq \text{QMA}^+_{\log}(2)$ with a constant gap. By virtue of the new constant gap, we are able to ``scale up'' this result to $\text{QMA}^+(2)$, obtaining the full characterization $\text{QMA}^+(2)$=NEXP by establishing stronger explicitness properties of the PCP for NEXP. One key novelty of these protocols is the manipulation of quantum proofs in a global and coherent way yielding constant gaps. Previous protocols (only available for general amplitudes) are either local having vanishingly small gaps or treat the quantum proofs as classical probability distributions requiring polynomially many proofs thereby not implying non-trivial bounds on QMA(2). Finally, we show that QMA(2) is equal to $\text{QMA}^+(2)$ provided the gap of the latter is a sufficiently large constant. In particular, if $\text{QMA}^+(2)$ admits gap amplification, then QMA(2)=NEXP.


[396] 2402.18830

Training-set-free two-stage deep learning for Spectroscopic data de-noising

De-noising is a prominent step in the spectra post-processing procedure. Previous machine learning-based methods are fast but mostly based on supervised learning and require a training set that may be typically expensive in real experimental measurements. Unsupervised learning-based algorithms are slow and require many iterations to achieve convergence. Here, we bridge this gap by proposing a training-set-free two-stage deep learning method. We show that the fuzzy fixed input in previous methods can be improved by introducing an adaptive prior. Combined with more advanced optimization techniques, our approach can achieve five times acceleration compared to previous work. Theoretically, we study the landscape of a corresponding non-convex linear problem, and our results indicates that this problem has benign geometry for first-order algorithms to converge.


[397] 2402.18856

Anatomy-guided fiber trajectory distribution estimation for cranial nerves tractography

Diffusion MRI tractography is an important tool for identifying and analyzing the intracranial course of cranial nerves (CNs). However, the complex environment of the skull base leads to ambiguous spatial correspondence between diffusion directions and fiber geometry, and existing diffusion tractography methods of CNs identification are prone to producing erroneous trajectories and missing true positive connections. To overcome the above challenge, we propose a novel CNs identification framework with anatomy-guided fiber trajectory distribution, which incorporates anatomical shape prior knowledge during the process of CNs tracing to build diffusion tensor vector fields. We introduce higher-order streamline differential equations for continuous flow field representations to directly characterize the fiber trajectory distribution of CNs from the tract-based level. The experimental results on the vivo HCP dataset and the clinical MDM dataset demonstrate that the proposed method reduces false-positive fiber production compared to competing methods and produces reconstructed CNs (i.e. CN II, CN III, CN V, and CN VII/VIII) that are judged to better correspond to the known anatomy.


[398] 2402.18867

Message-Enhanced DeGroot Model

Understanding the impact of messages on agents' opinions over social networks is important. However, to our best knowledge, there has been limited quantitative investigation into this phenomenon in the prior works. To address this gap, this paper proposes the Message-Enhanced DeGroot model. The Bounded Brownian Message model provides a quantitative description of the message evolution, jointly considering temporal continuity, randomness, and polarization from mass media theory. The Message-Enhanced DeGroot model, combining the Bounded Brownian Message model with the traditional DeGroot model, quantitatively describes the evolution of agents' opinions under the influence of messages. We theoretically study the probability distribution and statistics of the messages and agents' opinions and quantitatively analyze the impact of messages on opinions. We also conduct simulations to validate our analyses.


[399] 2402.18871

LoLiSRFlow: Joint Single Image Low-light Enhancement and Super-resolution via Cross-scale Transformer-based Conditional Flow

The visibility of real-world images is often limited by both low-light and low-resolution, however, these issues are only addressed in the literature through Low-Light Enhancement (LLE) and Super- Resolution (SR) methods. Admittedly, a simple cascade of these approaches cannot work harmoniously to cope well with the highly ill-posed problem for simultaneously enhancing visibility and resolution. In this paper, we propose a normalizing flow network, dubbed LoLiSRFLow, specifically designed to consider the degradation mechanism inherent in joint LLE and SR. To break the bonds of the one-to-many mapping for low-light low-resolution images to normal-light high-resolution images, LoLiSRFLow directly learns the conditional probability distribution over a variety of feasible solutions for high-resolution well-exposed images. Specifically, a multi-resolution parallel transformer acts as a conditional encoder that extracts the Retinex-induced resolution-and-illumination invariant map as the previous one. And the invertible network maps the distribution of usually exposed high-resolution images to a latent distribution. The backward inference is equivalent to introducing an additional constrained loss for the normal training route, thus enabling the manifold of the natural exposure of the high-resolution image to be immaculately depicted. We also propose a synthetic dataset modeling the realistic low-light low-resolution degradation, named DFSR-LLE, containing 7100 low-resolution dark-light/high-resolution normal sharp pairs. Quantitative and qualitative experimental results demonstrate the effectiveness of our method on both the proposed synthetic and real datasets.


[400] 2402.18930

Variable-Rate Learned Image Compression with Multi-Objective Optimization and Quantization-Reconstruction Offsets

Achieving successful variable bitrate compression with computationally simple algorithms from a single end-to-end learned image or video compression model remains a challenge. Many approaches have been proposed, including conditional auto-encoders, channel-adaptive gains for the latent tensor or uniformly quantizing all elements of the latent tensor. This paper follows the traditional approach to vary a single quantization step size to perform uniform quantization of all latent tensor elements. However, three modifications are proposed to improve the variable rate compression performance. First, multi objective optimization is used for (post) training. Second, a quantization-reconstruction offset is introduced into the quantization operation. Third, variable rate quantization is used also for the hyper latent. All these modifications can be made on a pre-trained single-rate compression model by performing post training. The algorithms are implemented into three well-known image compression models and the achieved variable rate compression results indicate negligible or minimal compression performance loss compared to training multiple models. (Codes will be shared at https://github.com/InterDigitalInc/CompressAI)


[401] 2402.18932

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.


[402] 2402.18968

Ambisonics Networks -- The Effect Of Radial Functions Regularization

Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This can be overcome by regularization, with the downside of introducing errors to the Ambisonics encoding. This paper aims to investigate the impact of different ways of regularization on Deep Neural Network (DNN) training and performance. Ideally, these networks should be robust to the way of regularization. Simulated data of a single speaker in a room and experimental data from the LOCATA challenge were used to evaluate this robustness on an example algorithm of speaker localization based on the direct-path dominance (DPD) test. Results show that performance may be sensitive to the way of regularization, and an informed approach is proposed and investigated, highlighting the importance of regularization information.


[403] 2402.18984

Graph Burning: Bounds and Hardness

The burning number of a graph $G$, denoted by $b(G)$, is the minimum number of steps required to burn all the vertices of a graph where in each step the existing fire spreads to all the adjacent vertices and one additional vertex can be burned as a new fire source. In this paper, we study the burning number problem both from an algorithmic and a structural point of view. The decision problem of computing the burning number of an input graph is known to be NP-Complete for trees with maximum degree at most three and interval graphs. Here, we prove that this problem is NP-Complete even when restricted to connected proper interval graphs and connected cubic graphs. The well-known burning number conjecture asserts that all the vertices of any graph of order $n$ can be burned in $\lceil \sqrt{n}~\rceil$ steps. In line with this conjecture, upper and lower bounds of $b(G)$ are well-studied for various special graph classes. Here, we provide an improved upper bound for the burning number of connected $P_k$-free graphs and show that the bound is tight up to an additive constant $1$. Finally, we study two variants of the problem, namely edge burning (only edges are burned) and total burning (both vertices and edges are burned). In particular, we establish their relationship with the burning number problem and evaluate the complexity of these variants.


[404] 2402.19020

Unsupervised Learning of High-resolution Light Field Imaging via Beam Splitter-based Hybrid Lenses

In this paper, we design a beam splitter-based hybrid light field imaging prototype to record 4D light field image and high-resolution 2D image simultaneously, and make a hybrid light field dataset. The 2D image could be considered as the high-resolution ground truth corresponding to the low-resolution central sub-aperture image of 4D light field image. Subsequently, we propose an unsupervised learning-based super-resolution framework with the hybrid light field dataset, which adaptively settles the light field spatial super-resolution problem with a complex degradation model. Specifically, we design two loss functions based on pre-trained models that enable the super-resolution network to learn the detailed features and light field parallax structure with only one ground truth. Extensive experiments demonstrate the same superiority of our approach with supervised learning-based state-of-the-art ones. To our knowledge, it is the first end-to-end unsupervised learning-based spatial super-resolution approach in light field imaging research, whose input is available from our beam splitter-based hybrid light field system. The hardware and software together may help promote the application of light field super-resolution to a great extent.


[405] 2402.19043

WDM: 3D Wavelet Diffusion Models for High-Resolution Medical Image Synthesis

Due to the three-dimensional nature of CT- or MR-scans, generative modeling of medical images is a particularly challenging task. Existing approaches mostly apply patch-wise, slice-wise, or cascaded generation techniques to fit the high-dimensional data into the limited GPU memory. However, these approaches may introduce artifacts and potentially restrict the model's applicability for certain downstream tasks. This work presents WDM, a wavelet-based medical image synthesis framework that applies a diffusion model on wavelet decomposed images. The presented approach is a simple yet effective way of scaling diffusion models to high resolutions and can be trained on a single 40 GB GPU. Experimental results on BraTS and LIDC-IDRI unconditional image generation at a resolution of $128 \times 128 \times 128$ show state-of-the-art image fidelity (FID) and sample diversity (MS-SSIM) scores compared to GANs, Diffusion Models, and Latent Diffusion Models. Our proposed method is the only one capable of generating high-quality images at a resolution of $256 \times 256 \times 256$.


[406] 2402.19062

Graph Convolutional Neural Networks for Automated Echocardiography View Recognition: A Holistic Approach

To facilitate diagnosis on cardiac ultrasound (US), clinical practice has established several standard views of the heart, which serve as reference points for diagnostic measurements and define viewports from which images are acquired. Automatic view recognition involves grouping those images into classes of standard views. Although deep learning techniques have been successful in achieving this, they still struggle with fully verifying the suitability of an image for specific measurements due to factors like the correct location, pose, and potential occlusions of cardiac structures. Our approach goes beyond view classification and incorporates a 3D mesh reconstruction of the heart that enables several more downstream tasks, like segmentation and pose estimation. In this work, we explore learning 3D heart meshes via graph convolutions, using similar techniques to learn 3D meshes in natural images, such as human pose estimation. As the availability of fully annotated 3D images is limited, we generate synthetic US images from 3D meshes by training an adversarial denoising diffusion model. Experiments were conducted on synthetic and clinical cases for view recognition and structure detection. The approach yielded good performance on synthetic images and, despite being exclusively trained on synthetic data, it already showed potential when applied to clinical images. With this proof-of-concept, we aim to demonstrate the benefits of graphs to improve cardiac view recognition that can ultimately lead to better efficiency in cardiac diagnosis.


[407] 2402.19084

High multiplicity of positive solutions in a superlinear problem of Moore-Nehari type

In this paper we consider a superlinear one-dimensional elliptic boundary value problem that generalizes the one studied by Moore and Nehari in [43]. Specifically, we deal with piecewise-constant weight functions in front of the nonlinearity with an arbitrary number $\kappa\geq 1$ of vanishing regions. We study, from an analytic and numerical point of view, the number of positive solutions, depending on the value of a parameter $\lambda$ and on $\kappa$. Our main results are twofold. On the one hand, we study analytically the behavior of the solutions, as $\lambda\downarrow-\infty$, in the regions where the weight vanishes. Our result leads us to conjecture the existence of $2^{\kappa+1}-1$ solutions for sufficiently negative $\lambda$. On the other hand, we support such a conjecture with the results of numerical simulations which also shed light on the structure of the global bifurcation diagrams in $\lambda$ and the profiles of positive solutions. Finally, we give additional numerical results suggesting that the same high multiplicity result holds true for a much larger class of weights, also arbitrarily close to situations where there is uniqueness of positive solutions.


[408] 2402.19095

A Protein Structure Prediction Approach Leveraging Transformer and CNN Integration

Proteins are essential for life, and their structure determines their function. The protein secondary structure is formed by the folding of the protein primary structure, and the protein tertiary structure is formed by the bending and folding of the secondary structure. Therefore, the study of protein secondary structure is very helpful to the overall understanding of protein structure. Although the accuracy of protein secondary structure prediction has continuously improved with the development of machine learning and deep learning, progress in the field of protein structure prediction, unfortunately, remains insufficient to meet the large demand for protein information. Therefore, based on the advantages of deep learning-based methods in feature extraction and learning ability, this paper adopts a two-dimensional fusion deep neural network model, DstruCCN, which uses Convolutional Neural Networks (CCN) and a supervised Transformer protein language model for single-sequence protein structure prediction. The training features of the two are combined to predict the protein Transformer binding site matrix, and then the three-dimensional structure is reconstructed using energy minimization.


[409] 2402.19106

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound.


[410] 2402.19109

Confidence and Assurance of Percentiles

Confidence interval of mean is often used when quoting statistics. The same rigor is often missing when quoting percentiles and tolerance or percentile intervals. This article derives the expression for confidence in percentiles of a sample population. Confidence intervals of median is compared to those of mean for a few sample distributions. The concept of assurance from reliability engineering is then extended to percentiles. The assurance level of sorted samples simply matches the confidence and percentile levels. Numerical method to compute assurance using Brent's optimization method is provided as an open-source python package.


[411] 2402.19111

Deep Network for Image Compressed Sensing Coding Using Local Structural Sampling

Existing image compressed sensing (CS) coding frameworks usually solve an inverse problem based on measurement coding and optimization-based image reconstruction, which still exist the following two challenges: 1) The widely used random sampling matrix, such as the Gaussian Random Matrix (GRM), usually leads to low measurement coding efficiency. 2) The optimization-based reconstruction methods generally maintain a much higher computational complexity. In this paper, we propose a new CNN based image CS coding framework using local structural sampling (dubbed CSCNet) that includes three functional modules: local structural sampling, measurement coding and Laplacian pyramid reconstruction. In the proposed framework, instead of GRM, a new local structural sampling matrix is first developed, which is able to enhance the correlation between the measurements through a local perceptual sampling strategy. Besides, the designed local structural sampling matrix can be jointly optimized with the other functional modules during training process. After sampling, the measurements with high correlations are produced, which are then coded into final bitstreams by the third-party image codec. At last, a Laplacian pyramid reconstruction network is proposed to efficiently recover the target image from the measurement domain to the image domain. Extensive experimental results demonstrate that the proposed scheme outperforms the existing state-of-the-art CS coding methods, while maintaining fast computational speed.


[412] 2402.19172

Point Processes and spatial statistics in time-frequency analysis

A finite-energy signal is represented by a square-integrable, complex-valued function $t\mapsto s(t)$ of a real variable $t$, interpreted as time. Similarly, a noisy signal is represented by a random process. Time-frequency analysis, a subfield of signal processing, amounts to describing the temporal evolution of the frequency content of a signal. Loosely speaking, if $s$ is the audio recording of a musical piece, time-frequency analysis somehow consists in writing the musical score of the piece. Mathematically, the operation is performed through a transform $\mathcal{V}$, mapping $s \in L^2(\mathbb{R})$ onto a complex-valued function $\mathcal{V}s \in L^2(\mathbb{R}^2)$ of time $t$ and angular frequency $\omega$. The squared modulus $(t, \omega) \mapsto \vert\mathcal{V}s(t,\omega)\vert^2$ of the time-frequency representation is known as the spectrogram of $s$; in the musical score analogy, a peaked spectrogram at $(t_0,\omega_0)$ corresponds to a musical note at angular frequency $\omega_0$ localized at time $t_0$. More generally, the intuition is that upper level sets of the spectrogram contain relevant information about in the original signal. Hence, many signal processing algorithms revolve around identifying maxima of the spectrogram. In contrast, zeros of the spectrogram indicate perfect silence, that is, a time at which a particular frequency is absent. Assimilating $\mathbb{R}^2$ to $\mathbb{C}$ through $z = \omega + \mathrm{i}t$, this chapter focuses on time-frequency transforms $\mathcal{V}$ that map signals to analytic functions. The zeros of the spectrogram of a noisy signal are then the zeros of a random analytic function, hence forming a Point Process in $\mathbb{C}$. This chapter is devoted to the study of these Point Processes, to their links with zeros of Gaussian Analytic Functions, and to designing signal detection and denoising algorithms using spatial statistics.


[413] 2402.19212

Deep Reinforcement Learning: A Convex Optimization Approach

In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to the given sampled states and actions of the current episode. For stable nonlinear systems, we show that the algorithm converges and that the converging parameters of the trained neural network can be made arbitrarily close to the optimal neural network parameters. In particular, if the regularization parameter is $\rho$ and the time horizon is $T$, then the parameters of the trained neural network converge to $w$, where the distance between $w$ from the optimal parameters $w^\star$ is bounded by $\mathcal{O}(\rho T^{-1})$. That is, when the number of episodes goes to infinity, there exists a constant $C$ such that \[\|w-w^\star\| \le C\cdot\frac{\rho}{T}.\] In particular, our algorithm converges arbitrarily close to the optimal neural network parameters as the time horizon increases or as the regularization parameter decreases.


[414] 2402.19215

Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts

Super-resolution (SR) is an ill-posed inverse problem, where the size of the set of feasible solutions that are consistent with a given low-resolution image is very large. Many algorithms have been proposed to find a "good" solution among the feasible solutions that strike a balance between fidelity and perceptual quality. Unfortunately, all known methods generate artifacts and hallucinations while trying to reconstruct high-frequency (HF) image details. A fundamental question is: Can a model learn to distinguish genuine image details from artifacts? Although some recent works focused on the differentiation of details and artifacts, this is a very challenging problem and a satisfactory solution is yet to be found. This paper shows that the characterization of genuine HF details versus artifacts can be better learned by training GAN-based SR models using wavelet-domain loss functions compared to RGB-domain or Fourier-space losses. Although wavelet-domain losses have been used in the literature before, they have not been used in the context of the SR task. More specifically, we train the discriminator only on the HF wavelet sub-bands instead of on RGB images and the generator is trained by a fidelity loss over wavelet subbands to make it sensitive to the scale and orientation of structures. Extensive experimental results demonstrate that our model achieves better perception-distortion trade-off according to multiple objective measures and visual evaluations.


[415] 2402.19234

Broadcast independence number of oriented circulant graphs

In 2001, D. Erwin \cite{Erw01} introduced in his Ph.D. dissertation the notion of broadcast independence in unoriented graphs. Since then, some results but not many, are published on this notion, including research work on the broadcast independence number of unoriented circulant graphs \cite{LBS23}. In this paper, we are focused in the same parameter but of the class of oriented circulant graphs. An independent broadcast on an oriented graph $\overrightarrow{G}$ is a function $f: V\longrightarrow \{0,\ldots,\diam(\overrightarrow{G})\}$ such that $(i)$ $f(v)\leq e(v)$ for every vertex $v\in V(\overrightarrow{G})$, where $\diam(\overrightarrow{G})$ denotes the diameter of $\overrightarrow{G}$ and $e(v)$ the eccentricity of vertex $v$, and $(ii)$ $d_{\overrightarrow{G}}(u,v) > f(u)$ for every distinct vertices $u$, $v$ with $f(u)$, $f(v)>0$, where $d_{\overrightarrow{G}}(u,v)$ denotes the length of a shortest oriented path from $u$ to $v$. The broadcast independence number $\beta_b(\overrightarrow{G})$ of $\overrightarrow{G}$ is then the maximum value of $\sum_{v \in V} f(v)$, taken over all independent broadcasts on $\overrightarrow{G}$. The goal of this paper is to study the properties of independent broadcasts of oriented circulant graphs $\overrightarrow{C}(n;1,a)$, for any integers $n$ and $a$ with $n>|a|\geq 1$ and $a \notin \{1,n-1\}$. Then, we give some bounds and some exact values for the number $\beta_b(\overrightarrow{C}(n;1,a))$.


[416] 2402.19266

Cauchy-completions and the rule of unique choice in relational doctrines

Lawvere's generalised the notion of complete metric space to the field of enriched categories: an enriched category is said to be Cauchy-complete if every left adjoint bimodule into it is represented by an enriched functor. Looking at this definition from a logical standpoint, regarding bimodules as an abstraction of relations and functors as an abstraction of functions, Cauchy-completeness resembles a formulation of the rule of unique choice. In this paper, we make this analogy precise, using the language of relational doctrines, a categorical tool that provides a functorial description of the calculus of relations, in the same way Lawvere's hyperdoctrines give a functorial description of predicate logic. Given a relational doctrine, we define Cauchy-complete objects as those objects of the domain category satisfying the rule of unique choice. Then, we present a universal construction that completes a relational doctrine with the rule of unique choice, that is, producing a new relational doctrine where all objects are Cauchy-complete. We also introduce relational doctrines with singleton objects and show that these have the minimal structure needed to build the reflector of the full subcategory of its domain on Cauchy-complete objects. The main result is that this reflector exists if and only if the relational doctrine has singleton objects and this happens if and only if its restriction to Cauchy-complete objects is equivalent to its completion with the rule of unique choice. We support our results with many examples, also falling outside the scope of standard doctrines, such as complete metric spaces, Banach spaces and compact Hausdorff spaces in the general context of monoidal topology, which are all shown to be Cauchy-complete objects for appropriate relational doctrines.


[417] 2402.19276

Modular Blind Video Quality Assessment

Blind video quality assessment (BVQA) plays a pivotal role in evaluating and improving the viewing experience of end-users across a wide range of video-based platforms and services. Contemporary deep learning-based models primarily analyze the video content in its aggressively downsampled format, while being blind to the impact of actual spatial resolution and frame rate on video quality. In this paper, we propose a modular BVQA model, and a method of training it to improve its modularity. Specifically, our model comprises a base quality predictor, a spatial rectifier, and a temporal rectifier, responding to the visual content and distortion, spatial resolution, and frame rate changes on video quality, respectively. During training, spatial and temporal rectifiers are dropped out with some probabilities so as to make the base quality predictor a standalone BVQA model, which should work better with the rectifiers. Extensive experiments on both professionally-generated content and user generated content video databases show that our quality model achieves superior or comparable performance to current methods. Furthermore, the modularity of our model offers a great opportunity to analyze existing video quality databases in terms of their spatial and temporal complexities. Last, our BVQA model is cost-effective to add other quality-relevant video attributes such as dynamic range and color gamut as additional rectifiers.


[418] 2402.19286

PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation

Understanding the anatomy of renal pathology is crucial for advancing disease diagnostics, treatment evaluation, and clinical research. The complex kidney system comprises various components across multiple levels, including regions (cortex, medulla), functional units (glomeruli, tubules), and cells (podocytes, mesangial cells in glomerulus). Prior studies have predominantly overlooked the intricate spatial interrelations among objects from clinical knowledge. In this research, we introduce a novel universal proposition learning approach, called panoramic renal pathology segmentation (PrPSeg), designed to segment comprehensively panoramic structures within kidney by integrating extensive knowledge of kidney anatomy. In this paper, we propose (1) the design of a comprehensive universal proposition matrix for renal pathology, facilitating the incorporation of classification and spatial relationships into the segmentation process; (2) a token-based dynamic head single network architecture, with the improvement of the partial label image segmentation and capability for future data enlargement; and (3) an anatomy loss function, quantifying the inter-object relationships across the kidney.


[419] 2402.19289

CAMixerSR: Only Details Need More "Attention"

To satisfy the rapidly increasing demands on the large image (2K-8K) super-resolution (SR), prevailing methods follow two independent tracks: 1) accelerate existing networks by content-aware routing, and 2) design better super-resolution networks via token mixer refining. Despite directness, they encounter unavoidable defects (e.g., inflexible route or non-discriminative processing) limiting further improvements of quality-complexity trade-off. To erase the drawbacks, we integrate these schemes by proposing a content-aware mixer (CAMixer), which assigns convolution for simple contexts and additional deformable window-attention for sparse textures. Specifically, the CAMixer uses a learnable predictor to generate multiple bootstraps, including offsets for windows warping, a mask for classifying windows, and convolutional attentions for endowing convolution with the dynamic property, which modulates attention to include more useful textures self-adaptively and improves the representation capability of convolution. We further introduce a global classification loss to improve the accuracy of predictors. By simply stacking CAMixers, we obtain CAMixerSR which achieves superior performance on large-image SR, lightweight SR, and omnidirectional-image SR.


[420] 2402.19329

Social Links vs. Language Barriers: Decoding the Global Spread of Streaming Content

The development of the internet has allowed for the global distribution of content, redefining media communication and property structures through various streaming platforms. Previous studies successfully clarified the factors contributing to trends in each streaming service, yet the similarities and differences between platforms are commonly unexplored; moreover, the influence of social connections and cultural similarity is usually overlooked. We hereby examine the social aspects of three significant streaming services--Netflix, Spotify, and YouTube--with an emphasis on the dissemination of content across countries. Using two-year-long trending chart datasets, we find that streaming content can be divided into two types: video-oriented (Netflix) and audio-oriented (Spotify). This characteristic is differentiated by accounting for the significance of social connectedness and linguistic similarity: audio-oriented content travels via social links, but video-oriented content tends to spread throughout linguistically akin countries. Interestingly, user-generated contents, YouTube, exhibits a dual characteristic by integrating both visual and auditory characteristics, indicating the platform is evolving into unique medium rather than simply residing a midpoint between video and audio media.


[421] 2402.19351

Oriented trees in $O(k \sqrt{k})$-chromatic digraphs, a subquadratic bound for Burr's conjecture

In 1980, Burr conjectured that every directed graph with chromatic number $2k-2$ contains any oriented tree of order $k$ as a subdigraph. Burr showed that chromatic number $(k-1)^2$ suffices, which was improved in 2013 to $\frac{k^2}{2} - \frac{k}{2} + 1$ by Addario-Berry et al. We give the first subquadratic bound for Burr's conjecture, by showing that every directed graph with chromatic number $8\sqrt{\frac{2}{15}} k \sqrt{k} + O(k)$ contains any oriented tree of order $k$. Moreover, we provide improved bounds of $\sqrt{\frac{4}{3}} k \sqrt{k}+O(k)$ for arborescences, and $(b-1)(k-3)+3$ for paths on $b$ blocks, with $b\ge 2$.


[422] 2402.19360

Joint Chance Constrained Optimal Control via Linear Programming

We establish a linear programming formulation for the solution of joint chance constrained optimal control problems over finite time horizons. The joint chance constraint may represent an invariance, reachability or reach-avoid specification that the trajectory must satisfy with a predefined probability. Compared to the existing literature, the formulation is computationally tractable and the solution exact.


[423] 2402.19387

SeD: Semantic-Aware Discriminator for Image Super-Resolution

Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular, one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However, the distribution learning is overly coarse-grained, which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this, we propose the simple and effective Semantic-aware Discriminator (denoted as SeD), which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely, we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics, the discriminator is able to distinguish the real-fake images individually and adaptively, which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics, we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets, and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way, our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our proposed methods.


[424] 2402.19455

Listening to the Noise: Blind Denoising with Gibbs Diffusion

In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the family of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and spectral index, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe.


[425] 2402.19456

Statistical Estimation in the Spiked Tensor Model via the Quantum Approximate Optimization Algorithm

The quantum approximate optimization algorithm (QAOA) is a general-purpose algorithm for combinatorial optimization. In this paper, we analyze the performance of the QAOA on a statistical estimation problem, namely, the spiked tensor model, which exhibits a statistical-computational gap classically. We prove that the weak recovery threshold of $1$-step QAOA matches that of $1$-step tensor power iteration. Additional heuristic calculations suggest that the weak recovery threshold of $p$-step QAOA matches that of $p$-step tensor power iteration when $p$ is a fixed constant. This further implies that multi-step QAOA with tensor unfolding could achieve, but not surpass, the classical computation threshold $\Theta(n^{(q-2)/4})$ for spiked $q$-tensors. Meanwhile, we characterize the asymptotic overlap distribution for $p$-step QAOA, finding an intriguing sine-Gaussian law verified through simulations. For some $p$ and $q$, the QAOA attains an overlap that is larger by a constant factor than the tensor power iteration overlap. Of independent interest, our proof techniques employ the Fourier transform to handle difficult combinatorial sums, a novel approach differing from prior QAOA analyses on spin-glass models without planted structure.


[426] 2402.19462

Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

We present a natural language processing pipeline that was used to extract polymer solar cell property data from the literature and simulate various active learning strategies. While data-driven methods have been well established to discover novel materials faster than Edisonian trial-and-error approaches, their benefits have not been quantified. Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation. Our pipeline enables us to extract data from more than 3300 papers which is ~5 times larger than similar data sets reported by others. We also trained machine learning models to predict the power conversion efficiency and used our model to identify promising donor-acceptor combinations that are as yet unreported. We thus demonstrate a workflow that goes from published literature to extracted material property data which in turn is used to obtain data-driven insights. Our insights include active learning strategies that can simultaneously optimize the material system and train strong predictive models of material properties. This work provides a valuable framework for research in material science.


[427] 2402.19470

Towards Generalizable Tumor Synthesis

Tumor synthesis enables the creation of artificial tumors in medical images, facilitating the training of AI models for tumor detection and segmentation. However, success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and, furthermore, the resulting AI models being capable of detecting real tumors in images sourced from different domains (e.g., hospitals). This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors (< 2cm) tend to have similar imaging characteristics in computed tomography (CT), whether they originate in the liver, pancreas, or kidneys. We have ascertained that generative AI models, e.g., Diffusion Models, can create realistic tumors generalized to a range of organs even when trained on a limited number of tumor examples from only one organ. Moreover, we have shown that AI models trained on these synthetic tumors can be generalized to detect and segment real tumors from CT volumes, encompassing a broad spectrum of patient demographics, imaging protocols, and healthcare facilities.