New articles on Computer Science

[1] 2407.14514

Accelerate Intermittent Deep Inference

Emerging research in edge devices and micro-controller units (MCU) enables on-device computation of Deep Learning Training and Inferencing tasks. More recently, contemporary trends focus on making the Deep Neural Net (DNN) Models runnable on battery-less intermittent devices. One of the approaches is to shrink the DNN models by enabling weight sharing, pruning, and conducted Neural Architecture Search (NAS) with optimized search space to target specific edge devices \cite{Cai2019OnceFA} \cite{Lin2020MCUNetTD} \cite{Lin2021MCUNetV2MP} \cite{Lin2022OnDeviceTU}. Another approach analyzes the intermittent execution and designs the corresponding system by performing NAS that is aware of intermittent execution cycles and resource constraints \cite{iNAS} \cite{HW-NAS} \cite{iLearn}. However, the optimized NAS was only considering consecutive execution with no power loss, and intermittent execution designs only focused on balancing data reuse and costs related to intermittent inference and often with low accuracy. We proposed Accelerated Intermittent Deep Inference to harness the power of optimized inferencing DNN models specifically targeting SRAM under 256KB and make it schedulable and runnable within intermittent power. Our main contribution is: (1) Schedule tasks performed by on-device inferencing into intermittent execution cycles and optimize for latency; (2) Develop a system that can satisfy the end-to-end latency while achieving a much higher accuracy compared to baseline \cite{iNAS} \cite{HW-NAS}

[2] 2407.14516

RobocupGym: A challenging continuous control benchmark in Robocup

Reinforcement learning (RL) has progressed substantially over the past decade, with much of this progress being driven by benchmarks. Many benchmarks are focused on video or board games, and a large number of robotics benchmarks lack diversity and real-world applicability. In this paper, we aim to simplify the process of applying reinforcement learning in the 3D simulation league of Robocup, a robotic football competition. To this end, we introduce a Robocup-based RL environment based on the open source rcssserver3d soccer server, simple pre-defined tasks, and integration with a popular RL library, Stable Baselines 3. Our environment enables the creation of high-dimensional continuous control tasks within a robotics football simulation. In each task, an RL agent controls a simulated Nao robot, and can interact with the ball or other agents. We open-source our environment and training code at

[3] 2407.14518

Accurate Analysis of Sparse Random Projections

There has been recently a lot of research on sparse variants of random projections, faster adaptations of the state-of-the-art dimensionality reduction technique originally due to Johsnon and Lindenstrauss. Although the construction is very simple, its analyses are notoriously complicated. Meeting the demand for both simplicity and accuracy, this work establishes sharp sub-poissonian tail bounds for the distribution of sparse random projections. Compared to other works, this analysis provide superior numerical guarantees (exactly matching impossibility results) while being arguably less complicated (the technique resembles Bennet's Inequality and is of independent interest).

[4] 2407.14519

Electricity Consumption of Ethereum and Filecoin: Advances in Models and Estimates

The high electricity consumption of cryptocurrencies that rely on proof-of-work (PoW) consensus algorithms has raised serious environmental concerns due to its association with carbon emissions and strain on energy grids. There has been significant research into estimating the electricity consumption of PoW-based cryptocurrencies and developing alternatives to PoW. In this article, we introduce refined models to estimate the electricity consumption of two prominent alternatives: Ethereum, now utilizing proof-of-stake (PoS), and Filecoin, which employs proof-of-spacetime (PoSt). Ethereum stands as a leading blockchain platform for crafting decentralized applications, whereas Filecoin is recognized as the world's foremost decentralized data storage network. Prior studies for modeling electricity consumption have been criticized for methodological flaws and shortcomings, low-quality data, and unvalidated assumptions. We improve on this in several ways: we obtain more novel, validated data from the systems in question, extract information from existing data and research, and we improve transparency and reproducibility by clearly explaining and documenting the used methodology and explicitly stating unavoidable limitations and assumptions made. When comparing the current, most prominent models for Ethereum and Filecoin to our refined models, we find that given the wide error margins of both the refined models and the ones introduced in prior literature, the resulting average estimates are to a large extent in line with each other.

[5] 2407.14521

Towards Automated Functional Equation Proving: A Benchmark Dataset and A Domain-Specific In-Context Agent

Automated Theorem Proving (ATP) faces challenges due to its complexity and computational demands. Recent work has explored using Large Language Models (LLMs) for ATP action selection, but these methods can be resource-intensive. This study introduces FEAS, an agent that enhances the COPRA in-context learning framework within Lean. FEAS refines prompt generation, response parsing, and incorporates domain-specific heuristics for functional equations. It introduces FunEq, a curated dataset of functional equation problems with varying difficulty. FEAS outperforms baselines on FunEq, particularly with the integration of domain-specific heuristics. The results demonstrate FEAS's effectiveness in generating and formalizing high-level proof strategies into Lean proofs, showcasing the potential of tailored approaches for specific ATP challenges.

[6] 2407.14525

Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments

The proposed model aims to develop a speech recognition technology for hearing, speech, or cognitively disabled people. All the available technology in the field of speech recognition doesn't come with an interface for communication for people with hearing, speech, or cognitive disabilities. The proposed model proposes the speech from the user, is transmitted to the speech recognition layer where it is converted into text and then that text is then transmitted to the morse code conversion layer where the morse code of the corresponding speech is given as the output. The accuracy of the model is completely dependent on speech recognition, as the morse code conversion is a process. The model is tested with recorded audio files with different parameters. The proposed model's WER and accuracy are both determined to be 10.18% and 89.82%, respectively.

[7] 2407.14527

Building Call Graph of WebAssembly Programs via Abstract Semantics

WebAssembly is a binary format for code that is gaining popularity thanks to its focus on portability and performance. Currently, the most common use case for WebAssembly is execution in a browser. It is also being increasingly adopted as a stand-alone application due to its portability. The binary format of WebAssembly, however, makes it prone to being used as a vehicle for malicious software. For instance, one could embed a cryptocurrency miner in code executed by a browser. As a result, there is substantial interest in developing tools for WebAssembly security verification, information flow control, and, more generally, for verifying behavioral properties such as correct API usage. In this document, we address the issue of building call graphs for WebAssembly code. This is important because having or computing a call graph is a prerequisite for most inter-procedural verification tasks. In this paper, we propose a formal solution based on the theory of Abstract Interpretation. We compare our approach to the state-of-the-art by predicting how it would perform against a set of specifically crafted benchmark programs.

[8] 2407.14529

A Scalable Clustered Architecture for Cyber-Physical Systems

Cyber-Physical Systems (CPS) play a vital role in the operation of intelligent interconnected systems. CPS integrates physical and software components capable of sensing, monitoring, and controlling physical assets and processes. However, developing distributed and scalable CPSs that efficiently handle large volumes of data while ensuring high performance and reliability remains a challenging task. Moreover, existing commercial solutions are often costly and not suitable for certain applications, limiting developers and researchers in experimenting and deploying CPSs on a larger scale. The development of this project aims to contribute to the design and implementation of a solution to the CPS challenges. To achieve this goal, the Edge4CPS system was developed. Edge4CPS system is an open source, distributed, multi-architecture solution that leverages Kubernetes for managing distributed edge computing clusters. It facilitates the deployment of applications across multiple computing nodes. It also offers services such as data pipeline, which includes data processing, classification, and visualization, as well as a middleware for messaging protocol translation.

[9] 2407.14530

FuncEvalGMN: Evaluating Functional Correctness of SQL via Graph Matching Network

In this paper, we propose a novel graph-based methodology to evaluate the functional correctness of SQL generation. Conventional metrics for assessing SQL code generation, such as matching-based and execution-based methods (e.g., exact set match and execution accuracy), are subject to two primary limitations. Firstly, the former fails to effectively assess functional correctness, as different SQL queries may possess identical functionalities. Secondly, the latter is susceptible to producing false positive samples in evaluations. Our proposed evaluation method, \texttt{FuncEvalGMN}, does not depend on the sufficient preparation of the test data, and it enables precise testing of the functional correctness of the code. Firstly, we parse SQL using a relational operator tree (ROT) called \textit{Relnode}, which contains rich semantic information from the perspective of logical execution.Then, we introduce a GNN-based approach for predicting the functional correctness of generated SQL. This approach incorporates global positional embeddings to address the limitations with the loss of topological information in conventional graph matching frameworks. As an auxiliary contribution, we propose a rule-based matching algorithm, Relnode Partial Matching (\texttt{RelPM}) as a baseline. Finally, we contribute a dataset, \texttt{Pair-Aug-Spider} with a training set and two testing sets, each comprising pairs of SQL codes to simulate various SQL code evaluation scenarios. The training set and one testing dataset focus on code generation using large language models (LLMs), while the other emphasizes SQL equivalence rewriting.

[10] 2407.14532

A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management

AIOps algorithms play a crucial role in the maintenance of microservice systems. Many previous benchmarks' performance leaderboard provides valuable guidance for selecting appropriate algorithms. However, existing AIOps benchmarks mainly utilize offline datasets to evaluate algorithms. They cannot consistently evaluate the performance of algorithms using real-time datasets, and the operation scenarios for evaluation are static, which is insufficient for effective algorithm selection. To address these issues, we propose an evaluation-consistent and scenario-oriented evaluation framework named MicroServo. The core idea is to build a live microservice benchmark to generate real-time datasets and consistently simulate the specific operation scenarios on it. MicroServo supports different leaderboards by selecting specific algorithms and datasets according to the operation scenarios. It also supports the deployment of various types of algorithms, enabling algorithms hot-plugging. At last, we test MicroServo with three typical microservice operation scenarios to demonstrate its efficiency and usability.

[11] 2407.14535

Ktirio Urban Building: A Computational Framework for City Energy Simulations Enhanced by CI/CD Innovations on EuroHPC Systems

The building sector in the European Union significantly impacts energy consumption and greenhouse gas emissions. The EU's Horizon 2050 initiative sets ambitious goals to reduce these impacts through enhanced building renovation rates. The CoE HiDALGO2 supports this initiative by developing high-performance computing solutions, specifically through the Urban Building pilot application, which utilizes advanced CI/CD methodologies to streamline simulation and deployment across various computational platforms, such as the EuroHPC JU supercomputers. The present work provides an overview of the Ktirio Urban Building framework (KUB), starting with an overview of the workflow and a description of some of the main ingredients of the software stack and discusses some current results performed on EuroHPC JU supercomputers using an innovative CI/CD pipeline.

[12] 2407.14536

Model-Agnostic Approximation of Constrained Forest Problems

Constrained Forest Problems (CFPs) as introduced by Goemans and Williamson in 1995 capture a wide range of network design problems with edge subsets as solutions, such as Minimum Spanning Tree, Steiner Forest, and Point-to-Point Connection. While individual CFPs have been studied extensively in individual computational models, a unified approach to solving general CFPs in multiple computational models has been lacking. Against this background, we present the shell-decomposition algorithm, a model-agnostic meta-algorithm that efficiently computes a $(2+\epsilon)$-approximation to CFPs for a broad class of forest functions. To demonstrate the power and flexibility of this result, we instantiate our algorithm for 3 fundamental, NP-hard CFPs in 3 different computational models. For example, for constant $\epsilon$, we obtain the following $(2+\epsilon)$-approximations in the Congest model: 1. For Steiner Forest specified via input components, where each node knows the identifier of one of $k$ disjoint subsets of $V$, we achieve a deterministic $(2+\epsilon)$-approximation in $O(\sqrt{n}+D+k)$ rounds, where $D$ is the hop diameter of the graph. 2. For Steiner Forest specified via symmetric connection requests, where connection requests are issued to pairs of nodes, we leverage randomized equality testing to reduce the running time to $O(\sqrt{n}+D)$, succeeding with high probability. 3. For Point-to-Point Connection, we provide a $(2+\epsilon)$-approximation in $O(\sqrt{n}+D)$ rounds. 4. For Facility Placement and Connection, a relative of non-metric Facility Location, we obtain a $(2+\epsilon)$-approximation in $O(\sqrt{n}+D)$ rounds. We further show how to replace the $\sqrt{n}+D$ term by the complexity of solving Partwise Aggregation, achieving (near-)universal optimality in any setting in which a solution to Partwise Aggregation in near-shortcut-quality time is known.

[13] 2407.14538

Alea-BFT: Practical Asynchronous Byzantine Fault Tolerance

Traditional Byzantine Fault Tolerance (BFT) state machine replication protocols assume a partial synchrony model, leading to a design where a leader replica drives the protocol and is replaced after a timeout. Recently, we witnessed a surge of asynchronous BFT protocols, which use randomization to remove the need for bounds on message delivery times, making them more resilient to adverse network conditions. However, existing research proposals still fall short of gaining practical adoption, plausibly because they are not able to combine good performance with a simple design that can be readily understood and adopted. In this paper, we present Alea-BFT, a simple and highly efficient asynchronous BFT protocol, which is gaining practical adoption, namely in Ethereum distributed validators. Alea-BFT brings the key design insight from classical protocols of concentrating part of the work on a single designated replica and incorporates this principle in a simple two-stage pipelined design, with an efficient broadcast led by the designated replica, followed by an inexpensive binary agreement. The evaluation of our research prototype implementation and two real-world integrations in cryptocurrency ecosystems shows excellent performance, improving on the fastest protocol (Dumbo-NG) in terms of latency and displaying good performance under faults.

[14] 2407.14540

Risks of uncertainty propagation in Al-augmented security pipelines

The use of AI technologies is percolating into the secure development of software-based systems, with an increasing trend of composing AI-based subsystems (with uncertain levels of performance) into automated pipelines. This presents a fundamental research challenge and poses a serious threat to safety-critical domains (e.g., aviation). Despite the existing knowledge about uncertainty in risk analysis, no previous work has estimated the uncertainty of AI-augmented systems given the propagation of errors in the pipeline. We provide the formal underpinnings for capturing uncertainty propagation, develop a simulator to quantify uncertainty, and evaluate the simulation of propagating errors with two case studies. We discuss the generalizability of our approach and present policy implications and recommendations for aviation. Future work includes extending the approach and investigating the required metrics for validation in the aviation domain.

[15] 2407.14543

Towards consistency of rule-based explainer and black box model -- fusion of rule induction and XAI-based feature importance

Rule-based models offer a human-understandable representation, i.e. they are interpretable. For this reason, they are used to explain the decisions of non-interpretable complex models, referred to as black box models. The generation of such explanations involves the approximation of a black box model by a rule-based model. To date, however, it has not been investigated whether the rule-based model makes decisions in the same way as the black box model it approximates. Decision making in the same way is understood in this work as the consistency of decisions and the consistency of the most important attributes used for decision making. This study proposes a novel approach ensuring that the rule-based surrogate model mimics the performance of the black box model. The proposed solution performs an explanation fusion involving rule generation and taking into account the feature importance determined by the selected XAI methods for the black box model being explained. The result of the method can be both global and local rule-based explanations. The quality of the proposed solution was verified by extensive analysis on 30 tabular benchmark datasets representing classification problems. Evaluation included comparison with the reference method and an illustrative case study. In addition, the paper discusses the possible pathways for the application of the rule-based approach in XAI and how rule-based explanations, including the proposed method, meet the user perspective and requirements for both content and presentation. The software created and a detailed report containing the full experimental results are available on the GitHub repository ( ).

[16] 2407.14544

Fast Iterative Graph Computing with Updated Neighbor States

Enhancing the efficiency of iterative computation on graphs has garnered considerable attention in both industry and academia. Nonetheless, the majority of efforts focus on expediting iterative computation by minimizing the running time per iteration step, ignoring the optimization of the number of iteration rounds, which is a crucial aspect of iterative computation. We experimentally verified the correlation between the vertex processing order and the number of iterative rounds, thus making it possible to reduce the number of execution rounds for iterative computation. In this paper, we propose a graph reordering method, GoGraph, which can construct a well-formed vertex processing order effectively reducing the number of iteration rounds and, consequently, accelerating iterative computation. Before delving into GoGraph, a metric function is introduced to quantify the efficiency of vertex processing order in accelerating iterative computation. This metric reflects the quality of the processing order by counting the number of edges whose source precedes the destination. GoGraph employs a divide-and-conquer mindset to establish the vertex processing order by maximizing the value of the metric function. Our experimental results show that GoGraph outperforms current state-of-the-art reordering algorithms by 1.83x on average (up to 3.34x) in runtime.

[17] 2407.14557

A Novel Skiagraphic Method of Casting Shade of a Torus

This paper introduces a novel skiagraphic method for shading toroidal forms in architectural illustrations, addressing the challenges of traditional techniques. Skiagraphy projects 3D objects onto 2D surfaces to display geometric properties. Traditional shading of tori involves extensive manual calculations and multiple projections, leading to high complexity and inaccuracies. The proposed method simplifies this by focusing on the elevation view, eliminating the need for multiple projections and complex math. Utilizing descriptive geometry, it reduces labor and complexity. Accuracy was validated through comparisons with SketchUp-generated shading and various torus configurations. This technique streamlines shading toroidal shapes while maintaining the artistic value of traditional illustration. Additionally, it has potential applications in 3D model generation from architectural shade casts, contributing to the evolving field of architectural visualization and representation.

[18] 2407.14558

A Foundation Model for Soccer

We propose a foundation model for soccer, which is able to predict subsequent actions in a soccer match from a given input sequence of actions. As a proof of concept, we train a transformer architecture on three seasons of data from a professional soccer league. We quantitatively and qualitatively compare the performance of this transformer architecture to two baseline models: a Markov model and a multi-layer perceptron. Additionally, we discuss potential applications of our model. We provide an open-source implementation of our methods at

[19] 2407.14559

Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach

Star scientists are highly influential researchers who have made significant contributions to their field, gained widespread recognition, and often attracted substantial research funding. They are critical for the advancement of science and innovation, and they have a significant influence on the transfer of knowledge and technology to industry. Identifying potential star scientists before their performance becomes outstanding is important for recruitment, collaboration, networking, or research funding decisions. Using machine learning techniques, this study proposes a model to predict star scientists in the field of artificial intelligence while highlighting features related to their success. Our results confirm that rising stars follow different patterns compared to their non-rising stars counterparts in almost all the early-career features. We also found that certain features such as gender and ethnic diversity play important roles in scientific collaboration and that they can significantly impact an author's career development and success. The most important features in predicting star scientists in the field of artificial intelligence were the number of articles, group discipline diversity, and weighted degree centrality. The proposed approach offers valuable insights for researchers, practitioners, and funding agencies interested in identifying and supporting talented researchers.

[20] 2407.14560

Automated and Holistic Co-design of Neural Networks and ASICs for Enabling In-Pixel Intelligence

Extreme edge-AI systems, such as those in readout ASICs for radiation detection, must operate under stringent hardware constraints such as micron-level dimensions, sub-milliwatt power, and nanosecond-scale speed while providing clear accuracy advantages over traditional architectures. Finding ideal solutions means identifying optimal AI and ASIC design choices from a design space that has explosively expanded during the merger of these domains, creating non-trivial couplings which together act upon a small set of solutions as constraints tighten. It is impractical, if not impossible, to manually determine ideal choices among possibilities that easily exceed billions even in small-size problems. Existing methods to bridge this gap have leveraged theoretical understanding of hardware to f architecture search. However, the assumptions made in computing such theoretical metrics are too idealized to provide sufficient guidance during the difficult search for a practical implementation. Meanwhile, theoretical estimates for many other crucial metrics (like delay) do not even exist and are similarly variable, dependent on parameters of the process design kit (PDK). To address these challenges, we present a study that employs intelligent search using multi-objective Bayesian optimization, integrating both neural network search and ASIC synthesis in the loop. This approach provides reliable feedback on the collective impact of all cross-domain design choices. We showcase the effectiveness of our approach by finding several Pareto-optimal design choices for effective and efficient neural networks that perform real-time feature extraction from input pulses within the individual pixels of a readout ASIC.

[21] 2407.14561

NNsight and NDIF: Democratizing Access to Foundation Model Internals

The enormous scale of state-of-the-art foundation models has limited their accessibility to scientists, because customized experiments at large model sizes require costly hardware and complex engineering that is impractical for most researchers. To alleviate these problems, we introduce NNsight, an open-source Python package with a simple, flexible API that can express interventions on any PyTorch model by building computation graphs. We also introduce NDIF, a collaborative research platform providing researchers access to foundation-scale LLMs via the NNsight API. Code, documentation, and tutorials are available at

[22] 2407.14562

Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Though

Large language models (LLMs) have shown exceptional performance as general-purpose assistants, excelling across a variety of reasoning tasks. This achievement represents a significant step toward achieving artificial general intelligence (AGI). Despite these advancements, the effectiveness of LLMs often hinges on the specific prompting strategies employed, and there remains a lack of a robust framework to facilitate learning and generalization across diverse reasoning tasks. To address these challenges, we introduce a novel learning framework, THOUGHT-LIKE-PRO In this framework, we utilize imitation learning to imitate the Chain-of-Thought (CoT) process which is verified and translated from reasoning trajectories generated by a symbolic Prolog logic engine. This framework proceeds in a self-driven manner, that enables LLMs to formulate rules and statements from given instructions and leverage the symbolic Prolog engine to derive results. Subsequently, LLMs convert Prolog-derived successive reasoning trajectories into natural language CoT for imitation learning. Our empirical findings indicate that our proposed approach substantially enhances the reasoning abilities of LLMs and demonstrates robust generalization across out-of-distribution reasoning tasks.

[23] 2407.14563

Learning Visual Grounding from Generative Vision and Language Model

Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data. We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting. We thus prompt a VLM to generate object-level descriptions by feeding it object regions from existing object detection datasets. We further propose attribute modeling to explicitly capture the important object attributes, and spatial relation modeling to capture inter-object relationship, both of which are common linguistic pattern in referring expression. Our constructed dataset (500K images, 1M objects, 16M referring expressions) is one of the largest grounding datasets to date, and the first grounding dataset with purely model-generated queries and human-annotated objects. To verify the quality of this data, we conduct zero-shot transfer experiments to the popular RefCOCO benchmarks for both referring expression comprehension (REC) and segmentation (RES) tasks. On both tasks, our model significantly outperform the state-of-the-art approaches without using human annotated visual grounding data. Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world. Code and models will be released.

[24] 2407.14565

Detecting and Characterising Mobile App Metamorphosis in Google Play Store

App markets have evolved into highly competitive and dynamic environments for developers. While the traditional app life cycle involves incremental updates for feature enhancements and issue resolution, some apps deviate from this norm by undergoing significant transformations in their use cases or market positioning. We define this previously unstudied phenomenon as 'app metamorphosis'. In this paper, we propose a novel and efficient multi-modal search methodology to identify apps undergoing metamorphosis and apply it to analyse two snapshots of the Google Play Store taken five years apart. Our methodology uncovers various metamorphosis scenarios, including re-births, re-branding, re-purposing, and others, enabling comprehensive characterisation. Although these transformations may register as successful for app developers based on our defined success score metric (e.g., re-branded apps performing approximately 11.3% better than an average top app), we shed light on the concealed security and privacy risks that lurk within, potentially impacting even tech-savvy end-users.

[25] 2407.14566

Generalization Error Analysis of Deep Backward Dynamic Programming for Solving Nonlinear PDEs

We explore the application of the quasi-Monte Carlo (QMC) method in deep backward dynamic programming (DBDP) (Hure et al. 2020) for numerically solving high-dimensional nonlinear partial differential equations (PDEs). Our study focuses on examining the generalization error as a component of the total error in the DBDP framework, discovering that the rate of convergence for the generalization error is influenced by the choice of sampling methods. Specifically, for a given batch size $m$, the generalization error under QMC methods exhibits a convergence rate of $O(m^{-1+\varepsilon})$, where $\varepsilon$ can be made arbitrarily small. This rate is notably more favorable than that of the traditional Monte Carlo (MC) methods, which is $O(m^{-1/2+\varepsilon})$. Our theoretical analysis shows that the generalization error under QMC methods achieves a higher order of convergence than their MC counterparts. Numerical experiments demonstrate that QMC indeed surpasses MC in delivering solutions that are both more precise and stable.

[26] 2407.14567

Operating System And Artificial Intelligence: A Systematic Review

In the dynamic landscape of technology, the convergence of Artificial Intelligence (AI) and Operating Systems (OS) has emerged as a pivotal arena for innovation. Our exploration focuses on the symbiotic relationship between AI and OS, emphasizing how AI-driven tools enhance OS performance, security, and efficiency, while OS advancements facilitate more sophisticated AI applications. We delve into various AI techniques employed to optimize OS functionalities, including memory management, process scheduling, and intrusion detection. Simultaneously, we analyze the role of OS in providing essential services and infrastructure that enable effective AI application execution, from resource allocation to data processing. The article also addresses challenges and future directions in this domain, emphasizing the imperative of secure and efficient AI integration within OS frameworks. By examining case studies and recent developments, our review provides a comprehensive overview of the current state of AI-OS integration, underscoring its significance in shaping the next generation of computing technologies. Finally, we explore the promising prospects of Intelligent OSes, considering not only how innovative OS architectures will pave the way for groundbreaking opportunities but also how AI will significantly contribute to advancing these next-generation OSs.

[27] 2407.14568

SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy

Text-to-SQL conversion is a critical innovation, simplifying the transition from complex SQL to intuitive natural language queries, especially significant given SQL's prevalence in the job market across various roles. The rise of Large Language Models (LLMs) like GPT-3.5 and GPT-4 has greatly advanced this field, offering improved natural language understanding and the ability to generate nuanced SQL statements. However, the potential of open-source LLMs in Text-to-SQL applications remains underexplored, with many frameworks failing to leverage their full capabilities, particularly in handling complex database queries and incorporating feedback for iterative refinement. Addressing these limitations, this paper introduces SQLfuse, a robust system integrating open-source LLMs with a suite of tools to enhance Text-to-SQL translation's accuracy and usability. SQLfuse features four modules: schema mining, schema linking, SQL generation, and a SQL critic module, to not only generate but also continuously enhance SQL query quality. Demonstrated by its leading performance on the Spider Leaderboard and deployment by Ant Group, SQLfuse showcases the practical merits of open-source LLMs in diverse business contexts.

[28] 2407.14570

Are handcrafted filters helpful for attributing AI-generated images?

Recently, a vast number of image generation models have been proposed, which raises concerns regarding the misuse of these artificial intelligence (AI) techniques for generating fake images. To attribute the AI-generated images, existing schemes usually design and train deep neural networks (DNNs) to learn the model fingerprints, which usually requires a large amount of data for effective learning. In this paper, we aim to answer the following two questions for AI-generated image attribution, 1) is it possible to design useful handcrafted filters to facilitate the fingerprint learning? and 2) how we could reduce the amount of training data after we incorporate the handcrafted filters? We first propose a set of Multi-Directional High-Pass Filters (MHFs) which are capable to extract the subtle fingerprints from various directions. Then, we propose a Directional Enhanced Feature Learning network (DEFL) to take both the MHFs and randomly-initialized filters into consideration. The output of the DEFL is fused with the semantic features to produce a compact fingerprint. To make the compact fingerprint discriminative among different models, we propose a Dual-Margin Contrastive (DMC) loss to tune our DEFL. Finally, we propose a reference based fingerprint classification scheme for image attribution. Experimental results demonstrate that it is indeed helpful to use our MHFs for attributing the AI-generated images. The performance of our proposed method is significantly better than the state-of-the-art for both the closed-set and open-set image attribution, where only a small amount of images are required for training.

[29] 2407.14571

DataStorm-EM: Exploration of Alternative Timelines within Continuous-Coupled Simulation Ensembles

Many socio-economical critical domains (such as sustainability, public health, and disasters) are characterized by highly complex and dynamic systems, requiring data and model-driven simulations to support decision-making. Due to a large number of unknowns, decision-makers usually need to generate ensembles of stochastic scenarios, requiring hundreds or thousands of individual simulation instances, each with different parameter settings corresponding to distinct scenarios, As the number of model parameters increases, the number of potential timelines one can simulate increases exponentially. Consequently, simulation ensembles are inherently sparse, even when they are extremely large. This necessitates a platform for (a) deciding which simulation instances to execute and (b) given a large simulation ensemble, enabling decision-makers to explore the resulting alternative timelines, by extracting and visualizing consistent, yet diverse timelines from continuous-coupled simulation ensembles. In this article, we present DataStorm-EM platform for data- and model-driven simulation ensemble management, optimization, analysis, and exploration, describe underlying challenges and present our solution.

[30] 2407.14572

Affinity-aware Serverless Function Scheduling

Functions-as-a-Service (FaaS) is a Serverless Cloud paradigm where a platform manages the scheduling (e.g., resource allocation, runtime environments) of stateless functions. Recent developments show the benefits of using domain-specific languages to express per-function policies, e.g., policies can enforce the allocation of functions on nodes that enjoy lower data-access latencies thanks to proximity and connection pooling. Here, we focus on affinity-aware scenarios, i.e., where, for performance and functional requirements, the allocation of a function depends on the presence/absence of other functions on nodes. We first present aAPP, an affinity-aware extension of a declarative, platform-agnostic language for defining custom function scheduling policies. We implement a prototype supporting this scheduling language by extending the popular Apache OpenWhisk FaaS platform and show that using aAPP in affinity-aware scenarios leads to an appreciable reduction in latency without noticeable overhead for scenarios without affinity constraints.

[31] 2407.14573

Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization

Since the advent of generative artificial intelligence, every company and researcher has been rushing to develop their own generative models, whether commercial or not. Given the large number of users of these powerful new tools, there is currently no intrinsically verifiable way to explain from the ground up what happens when LLMs (large language models) learn. For example, those based on automatic speech recognition systems, which have to rely on huge and astronomical amounts of data collected from all over the web to produce fast and efficient results, In this article, we develop a backdoor attack called MarketBackFinal 2.0, based on acoustic data poisoning, MarketBackFinal 2.0 is mainly based on modern stock market models. In order to show the possible vulnerabilities of speech-based transformers that may rely on LLMs.

[32] 2407.14575

Regression prediction algorithm for energy consumption regression in cloud computing based on horned lizard algorithm optimised convolutional neural network-bidirectional gated recurrent unit

For this paper, a prediction study of cloud computing energy consumption was conducted by optimising the data regression algorithm based on the horned lizard optimisation algorithm for Convolutional Neural Networks-Bi-Directional Gated Recurrent Units. Firstly, through Spearman correlation analysis of CPU, usage, memory usage, network traffic, power consumption, number of instructions executed, execution time and energy efficiency, we found that power consumption has the highest degree of positive correlation with energy efficiency, while CPU usage has the highest degree of negative correlation with energy efficiency. In our experiments, we introduced a random forest model and an optimisation model based on the horned lizard optimisation algorithm for testing, and the results show that the optimisation algorithm has better prediction results compared to the random forest model. Specifically, the mean square error (MSE) of the optimisation algorithm is 0.01 smaller than that of the random forest model, and the mean absolute error (MAE) is 0.01 smaller than that of the random forest.3 The results of the combined metrics show that the optimisation algorithm performs more accurately and reliably in predicting energy efficiency. This research result provides new ideas and methods to improve the energy efficiency of cloud computing systems. This research not only expands the scope of application in the field of cloud computing, but also provides a strong support for improving the energy use efficiency of the system.

[33] 2407.14576

A Comparative Study of Transfer Learning for Emotion Recognition using CNN and Modified VGG16 Models

Emotion recognition is a critical aspect of human interaction. This topic garnered significant attention in the field of artificial intelligence. In this study, we investigate the performance of convolutional neural network (CNN) and Modified VGG16 models for emotion recognition tasks across two datasets: FER2013 and AffectNet. Our aim is to measure the effectiveness of these models in identifying emotions and their ability to generalize to different and broader datasets. Our findings reveal that both models achieve reasonable performance on the FER2013 dataset, with the Modified VGG16 model demonstrating slightly increased accuracy. When evaluated on the Affect-Net dataset, performance declines for both models, with the Modified VGG16 model continuing to outperform the CNN. Our study emphasizes the importance of dataset diversity in emotion recognition and discusses open problems and future research directions, including the exploration of multi-modal approaches and the development of more comprehensive datasets.

[34] 2407.14605

ESCAPE: Energy-based Selective Adaptive Correction for Out-of-distribution 3D Human Pose Estimation

Despite recent advances in human pose estimation (HPE), poor generalization to out-of-distribution (OOD) data remains a difficult problem. While previous works have proposed Test-Time Adaptation (TTA) to bridge the train-test domain gap by refining network parameters at inference, the absence of ground-truth annotations makes it highly challenging and existing methods typically increase inference times by one or more orders of magnitude. We observe that 1) not every test time sample is OOD, and 2) HPE errors are significantly larger on distal keypoints (wrist, ankle). To this end, we propose ESCAPE: a lightweight correction and selective adaptation framework which applies a fast, forward-pass correction on most data while reserving costly TTA for OOD data. The free energy function is introduced to separate OOD samples from incoming data and a correction network is trained to estimate the errors of pretrained backbone HPE predictions on the distal keypoints. For OOD samples, we propose a novel self-consistency adaptation loss to update the correction network by leveraging the constraining relationship between distal keypoints and proximal keypoints (shoulders, hips), via a second ``reverse" network. ESCAPE improves the distal MPJPE of five popular HPE models by up to 7% on unseen data, achieves state-of-the-art results on two popular HPE benchmarks, and is significantly faster than existing adaptation methods.

[35] 2407.14609

Adversarial Databases Improve Success in Retrieval-based Large Language Models

Open-source LLMs have shown great potential as fine-tuned chatbots, and demonstrate robust abilities in reasoning and surpass many existing benchmarks. Retrieval-Augmented Generation (RAG) is a technique for improving the performance of LLMs on tasks that the models weren't explicitly trained on, by leveraging external knowledge databases. Numerous studies have demonstrated the effectiveness of RAG to more successfully accomplish downstream tasks when using vector datasets that consist of relevant background information. It has been implicitly assumed by those in the field that if adversarial background information is utilized in this context, that the success of using a RAG-based approach would be nonexistent or even negatively impact the results. To address this assumption, we tested several open-source LLMs on the ability of RAG to improve their success in answering multiple-choice questions (MCQ) in the medical subspecialty field of Nephrology. Unlike previous studies, we examined the effect of RAG in utilizing both relevant and adversarial background databases. We set up several open-source LLMs, including Llama 3, Phi-3, Mixtral 8x7b, Zephyr$\beta$, and Gemma 7B Instruct, in a zero-shot RAG pipeline. As adversarial sources of information, text from the Bible and a Random Words generated database were used for comparison. Our data show that most of the open-source LLMs improve their multiple-choice test-taking success as expected when incorporating relevant information vector databases. Surprisingly however, adversarial Bible text significantly improved the success of many LLMs and even random word text improved test taking ability of some of the models. In summary, our results demonstrate for the first time the countertintuitive ability of adversarial information datasets to improve the RAG-based LLM success.

[36] 2407.14612

A Biomechanics-Inspired Approach to Soccer Kicking for Humanoid Robots

Soccer kicking is a complex whole-body motion that requires intricate coordination of various motor actions. To accomplish such dynamic motion in a humanoid robot, the robot needs to simultaneously: 1) transfer high kinetic energy to the kicking leg, 2) maintain balance and stability of the entire body, and 3) manage the impact disturbance from the ball during the kicking moment. Prior studies on robotic soccer kicking often prioritized stability, leading to overly conservative quasi-static motions. In this work, we present a biomechanics-inspired control framework that leverages trajectory optimization and imitation learning to facilitate highly dynamic soccer kicks in humanoid robots. We conducted an in-depth analysis of human soccer kick biomechanics to identify key motion constraints. Based on this understanding, we designed kinodynamically feasible trajectories that are then used as a reference in imitation learning to develop a robust feedback control policy. We demonstrate the effectiveness of our approach through a simulation of an anthropomorphic 25 DoF bipedal humanoid robot, named PresToe, which is equipped with 7 DoF legs, including a unique actuated toe. Using our framework, PresToe can execute dynamic instep kicks, propelling the ball at speeds exceeding 11m/s in full dynamics simulation.

[37] 2407.14614

Evaluating language models as risk scores

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.

[38] 2407.14620

The Research of Group Re-identification from Multiple Cameras

Object re-identification is of increasing importance in visual surveillance. Most existing works focus on re-identify individual from multiple cameras while the application of group re-identification (Re-ID) is rarely discussed. We redefine Group Re-identification as a process which includes pedestrian detection, feature extraction, graph model construction, and graph matching. Group re-identification is very challenging since it is not only interfered by view-point and human pose variations in the traditional re-identification tasks, but also suffered from the challenges in group layout change and group member variation. To address the above challenges, this paper introduces a novel approach which leverages the multi-granularity information inside groups to facilitate group re-identification. We first introduce a multi-granularity Re-ID process, which derives features for multi-granularity objects (people/people-subgroups) in a group and iteratively evaluates their importances during group Re-ID, so as to handle group-wise misalignments due to viewpoint change and group dynamics. We further introduce a multi-order matching scheme. It adaptively selects representative people/people-subgroups in each group and integrates the multi-granularity information from these people/people-subgroups to obtain group-wise matching, hence achieving a more reliable matching score between groups. Experimental results on various datasets demonstrate the effectiveness of our approach.

[39] 2407.14622

BOND: Aligning LLMs with Best-of-N Distillation

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

[40] 2407.14628

Advancing Melanoma Diagnosis with Self-Supervised Neural Networks: Evaluating the Effectiveness of Different Techniques

We investigate the potential of self-supervision in improving the accuracy of deep learning models trained to classify melanoma patches. Various self-supervision techniques such as rotation prediction, missing patch prediction, and corruption removal were implemented and assessed for their impact on the convolutional neural network's performance. Preliminary results suggest a positive influence of self-supervision methods on the model's accuracy. The study notably demonstrates the efficacy of the corruption removal method in enhancing model performance. Despite observable improvements, we conclude that the self-supervised models have considerable potential for further enhancement, achievable through training over more epochs or expanding the dataset. We suggest exploring other self-supervision methods like Bootstrap Your Own Latent (BYOL) and contrastive learning in future research, emphasizing the cost-benefit trade-off due to their resource-intensive nature. The findings underline the promise of self-supervision in augmenting melanoma detection capabilities of deep learning models.

[41] 2407.14631

Two new feature selection methods based on learn-heuristic techniques for breast cancer prediction: A comprehensive analysis

Breast cancer is not preventable because of its unknown causes. However, its early diagnosis increases patients' recovery chances. Machine learning (ML) can be utilized to improve treatment outcomes in healthcare operations while diminishing costs and time. In this research, we suggest two novel feature selection (FS) methods based upon an imperialist competitive algorithm (ICA) and a bat algorithm (BA) and their combination with ML algorithms. This study aims to enhance diagnostic models' efficiency and present a comprehensive analysis to help clinical physicians make much more precise and reliable decisions than before. K-nearest neighbors, support vector machine, decision tree, Naive Bayes, AdaBoost, linear discriminant analysis, random forest, logistic regression, and artificial neural network are some of the methods employed. This paper applied a distinctive integration of evaluation measures and ML algorithms using the wrapper feature selection based on ICA (WFSIC) and BA (WFSB) separately. We compared two proposed approaches for the performance of the classifiers. Also, we compared our best diagnostic model with previous works reported in the literature survey. Experimentations were performed on the Wisconsin diagnostic breast cancer dataset. Results reveal that the proposed framework that uses the BA with an accuracy of 99.12\%, surpasses the framework using the ICA and most previous works. Additionally, the RF classifier in the approach of FS based on BA emerges as the best model and outperforms others regarding its criteria. Besides, the results illustrate the role of our techniques in reducing the dataset dimensions up to 90\% and increasing the performance of diagnostic models by over 99\%. Moreover, the result demonstrates that there are more critical features than the optimum dataset obtained by proposed FS approaches that have been selected by most ML models.

[42] 2407.14634

Informational Health --Toward the Reduction of Risks in the Information Space

The modern information society, markedly influenced by the advent of the internet and subsequent developments such as WEB 2.0, has seen an explosive increase in information availability, fundamentally altering human interaction with information spaces. This transformation has facilitated not only unprecedented access to information but has also raised significant challenges, particularly highlighted by the spread of ``fake news'' during critical events like the 2016 U.S. presidential election and the COVID-19 pandemic. The latter event underscored the dangers of an ``infodemic,'' where the large amount of information made distinguishing between factual and non-factual content difficult, thereby complicating public health responses and posing risks to democratic processes. In response to these challenges, this paper introduces the concept of ``informational health,'' drawing an analogy between dietary habits and information consumption. It argues that just as balanced diets are crucial for physical health, well-considered nformation behavior is essential for maintaining a healthy information environment. This paper proposes three strategies for fostering informational health: literacy education, visualization of meta-information, and informational health assessments. These strategies aim to empower users and platforms to navigate and enhance the information ecosystem effectively. By focusing on long-term informational well-being, we highlight the necessity of addressing the social risks inherent in the current attention economy, advocating for a paradigm shift towards a more sustainable information consumption model.

[43] 2407.14637

An objective isogeometric mixed finite element formulation for nonlinear elastodynamic beams with incompatible warping strains

We present a stable mixed isogeometric finite element formulation for geometrically and materially nonlinear beams in transient elastodynamics, where a Cosserat beam formulation with extensible directors is used. The extensible directors yields a linear configuration space incorporating constant in-plane cross-sectional strains. Higher-order (incompatible) strains are introduced to correct stiffness, whose additional degrees-of-freedom are eliminated by an element-wise condensation. Further, the present discretization of the initial director field leads to the objectivity of approximated strain measures, regardless of the degree of basis functions. For physical stress resultants and strains, we employ a global patch-wise approxiation using B-spline basis functions, whose higher-order continuity enables to use much less degrees-of-freedom, compared to element-wise approximation. For time-stepping, we employ an implicit energy-momentum consistent scheme, which exhibits superior numerical stability in comparison to standard trapezoidal and mid-point rules. Several numerical examples are presented to verify the present method.

[44] 2407.14638

Metasurface Energy Harvesters: State-of-the-Art Designs and Their Potential for Energy Sustainable Reconfigurable Intelligent Surfaces

Metasurface Energy Harvesters (MEHs) have emerged as a prominent enabler of highly efficient Radio Frequency (RF) energy harvesters. This survey delves into the fundamentals of the MEH technology, providing a comprehensive overview of their working principle, unit cell designs and prototypes over various frequency bands, as well as state-of-the art modes of operation. Inspired by the recent academic and industrial interest on Reconfigurable Intelligent Surfaces (RISs)for the upcoming sixth-Generation (6G) of wireless networks, we study the interplay between this technology and MEHs aiming for energy sustainable RISs power by metasurface-based RF energy harvesting. We present a novel hybrid unit cell design capable of simultaneous energy harvesting and 1-bit tunable reflection whose dual-functional response is validated via full-wave simulations. Then, we conduct a comparative collection of real-world measurements for ambient RF power levels and power consumption budgets of reflective RISs to unveil the potential for a self-sustainable RIS via ambient RF energy harvesting. The paper is concluded with an elaborative discussion on open design challenges and future research directions for MEHs and energy sustainable hybrid RISs.

[45] 2407.14640

CVE-LLM : Automatic vulnerability evaluation in medical device industry using large language models

The healthcare industry is currently experiencing an unprecedented wave of cybersecurity attacks, impacting millions of individuals. With the discovery of thousands of vulnerabilities each month, there is a pressing need to drive the automation of vulnerability assessment processes for medical devices, facilitating rapid mitigation efforts. Generative AI systems have revolutionized various industries, offering unparalleled opportunities for automation and increased efficiency. This paper presents a solution leveraging Large Language Models (LLMs) to learn from historical evaluations of vulnerabilities for the automatic assessment of vulnerabilities in the medical devices industry. This approach is applied within the portfolio of a single manufacturer, taking into account device characteristics, including existing security posture and controls. The primary contributions of this paper are threefold. Firstly, it provides a detailed examination of the best practices for training a vulnerability Language Model (LM) in an industrial context. Secondly, it presents a comprehensive comparison and insightful analysis of the effectiveness of Language Models in vulnerability assessment. Finally, it proposes a new human-in-the-loop framework to expedite vulnerability evaluation processes.

[46] 2407.14641

Differential Privacy with Multiple Selections

We consider the setting where a user with sensitive features wishes to obtain a recommendation from a server in a differentially private fashion. We propose a ``multi-selection'' architecture where the server can send back multiple recommendations and the user chooses one from these that matches best with their private features. When the user feature is one-dimensional -- on an infinite line -- and the accuracy measure is defined w.r.t some increasing function $\mathfrak{h}(.)$ of the distance on the line, we precisely characterize the optimal mechanism that satisfies differential privacy. The specification of the optimal mechanism includes both the distribution of the noise that the user adds to its private value, and the algorithm used by the server to determine the set of results to send back as a response and further show that Laplace is an optimal noise distribution. We further show that this optimal mechanism results in an error that is inversely proportional to the number of results returned when the function $\mathfrak{h}(.)$ is the identity function.

[47] 2407.14643

Double-Layer Soft Data Fusion for Indoor Robot WiFi-Visual Localization

This paper presents a novel WiFi-Visual data fusion method for indoor robot (TIAGO++) localization. This method can use 10 WiFi samples and 4 low-resolution images ($58 \times 58$ in pixels) to localize a indoor robot with an average error distance about 1.32 meters. The experiment test is 3 months after the data collection in a general teaching building, whose WiFi and visual environments are partially changed. This indirectly shows the robustness of the proposed method. Instead of neural network design, this paper focuses on the soft data fusion to prevent unbounded errors in visual localization. A double-layer soft data fusion is proposed. The proposed soft data fusion includes the first-layer WiFi-Visual feature fusion and the second-layer decision vector fusion. Firstly, motivated by the excellent capability of neural network in image processing and recognition, the temporal-spatial features are extracted from WiFi data, these features are represented in image form. Secondly, the WiFi temporal-spatial features in image form and the visual features taken by the robot camera are combined together, and are jointly exploited by a classification neural network to produce a likelihood vector for WiFi-Visual localization. This is called first-layer WiFi-Visual fusion. Similarly, these two types of features can exploited separately by neural networks to produce another two independent likelihood vectors. Thirdly, the three likelihood vectors are fused by Hadamard product and median filtering to produce the final likelihood vector for localization. This called the second-layer decision vector fusion. The proposed soft data fusion does not apply any threshold or prioritize any data source over the other in the fusion process. It never excludes the positions of low probabilities, which can avoid the information loss due to a hard decision. The demo video is provided. The code will be open.

[48] 2407.14644

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

Previous research on testing the vulnerabilities in Large Language Models (LLMs) using adversarial attacks has primarily focused on nonsensical prompt injections, which are easily detected upon manual or automated review (e.g., via byte entropy). However, the exploration of innocuous human-understandable malicious prompts augmented with adversarial injections remains limited. In this research, we explore converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. This allows us to show suffix conversion without any gradients, using only LLMs to perform the attacks, and thus better understand the scope of possible risks. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. The situations are extracted from the IMDB dataset, and prompts are defined following a few-shot chain-of-thought prompting. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs. We find that across many LLMs, as few as 1 attempt produces an attack and that these attacks transfer between LLMs. The link to our code is available at \url{}.

[49] 2407.14645

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ($\sim$ 35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.

[50] 2407.14649

The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments

Industry 4.0 introduced AI as a transformative solution for modernizing manufacturing processes. Its successor, Industry 5.0, envisions humans as collaborators and experts guiding these AI-driven manufacturing solutions. Developing these techniques necessitates algorithms capable of safe, real-time identification of human positions in a scene, particularly their hands, during collaborative assembly. Although substantial efforts have curated datasets for hand segmentation, most focus on residential or commercial domains. Existing datasets targeting industrial settings predominantly rely on synthetic data, which we demonstrate does not effectively transfer to real-world operations. Moreover, these datasets lack uncertainty estimations critical for safe collaboration. Addressing these gaps, we present HAGS: Hand and Glove Segmentation Dataset. This dataset provides 1200 challenging examples to build applications toward hand and glove segmentation in industrial human-robot collaboration scenarios as well as assess out-of-distribution images, constructed via green screen augmentations, to determine ML-classifier robustness. We study state-of-the-art, real-time segmentation models to evaluate existing methods. Our dataset and baselines are publicly available: and

[51] 2407.14650

Auditing the Grid-Based Placement of Private Label Products on E-commerce Search Result Pages

E-commerce platforms support the needs and livelihoods of their two most important stakeholders -- customers and producers/sellers. Multiple algorithmic systems, like ``search'' systems mediate the interactions between these stakeholders by connecting customers to producers with relevant items. Search results include (i) private label (PL) products that are manufactured/sold by the platform itself, as well as (ii) third-party products on advertised / sponsored and organic positions. In this paper, we systematically quantify the extent of PL product promotion on e-commerce search results for the two largest e-commerce platforms operating in India -- and Flipkart. By analyzing snapshots of search results across the two platforms, we discover high PL promotion on the initial result pages (~ 15% PLs are advertised on the first SERP of Amazon). Both platforms use different strategies to promote their PL products, such as placing more PLs on the advertised positions -- while Amazon places them on the first, middle, and last rows of the search results, Flipkart places them on the first two positions and the (entire) last column of the search results. We discover that these product placement strategies of both platforms conform with existing user attention strategies proposed in the literature. Finally, to supplement the findings from the collected data, we conduct a survey among 68 participants on Amazon Mechanical Turk. The click pattern from our survey shows that users strongly prefer to click on products placed at positions that correspond to the PL products on the search results of Amazon, but not so strongly on Flipkart. The click-through rate follows previously proposed theoretically grounded user attention distribution patterns in a two-dimensional layout.

[52] 2407.14653

OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning

Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset. Most current methods struggle with the mismatch between imperfect demonstrations and the desired safe and rewarding performance. In this paper, we introduce OASIS (cOnditionAl diStributIon Shaping), a new paradigm in offline safe RL designed to overcome these critical limitations. OASIS utilizes a conditional diffusion model to synthesize offline datasets, thus shaping the data distribution toward a beneficial target domain. Our approach makes compliance with safety constraints through effective data utilization and regularization techniques to benefit offline safe RL training. Comprehensive evaluations on public benchmarks and varying datasets showcase OASIS's superiority in benefiting offline safe RL agents to achieve high-reward behavior while satisfying the safety constraints, outperforming established baselines. Furthermore, OASIS exhibits high data efficiency and robustness, making it suitable for real-world applications, particularly in tasks where safety is imperative and high-quality demonstrations are scarce.

[53] 2407.14655

LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition

The complexity of state-of-the-art Transformer-based models for skeleton-based action recognition poses significant challenges in terms of computational efficiency and resource utilization. In this paper, we explore the application of Singular Value Decomposition (SVD) to effectively reduce the model sizes of these pre-trained models, aiming to minimize their resource consumption while preserving accuracy. Our method, LORTSAR (LOw-Rank Transformer for Skeleton-based Action Recognition), also includes a fine-tuning step to compensate for any potential accuracy degradation caused by model compression, and is applied to two leading Transformer-based models, "Hyperformer" and "STEP-CATFormer". Experimental results on the "NTU RGB+D" and "NTU RGB+D 120" datasets show that our method can reduce the number of model parameters substantially with negligible degradation or even performance increase in recognition accuracy. This confirms that SVD combined with post-compression fine-tuning can boost model efficiency, paving the way for more sustainable, lightweight, and high-performance technologies in human action recognition.

[54] 2407.14658

A New Lightweight Hybrid Graph Convolutional Neural Network -- CNN Scheme for Scene Classification using Object Detection Inference

Scene understanding plays an important role in several high-level computer vision applications, such as autonomous vehicles, intelligent video surveillance, or robotics. However, too few solutions have been proposed for indoor/outdoor scene classification to ensure scene context adaptability for computer vision frameworks. We propose the first Lightweight Hybrid Graph Convolutional Neural Network (LH-GCNN)-CNN framework as an add-on to object detection models. The proposed approach uses the output of the CNN object detection model to predict the observed scene type by generating a coherent GCNN representing the semantic and geometric content of the observed scene. This new method, applied to natural scenes, achieves an efficiency of over 90\% for scene classification in a COCO-derived dataset containing a large number of different scenes, while requiring fewer parameters than traditional CNN methods. For the benefit of the scientific community, we will make the source code publicly available:

[55] 2407.14662

Relational Composition in Neural Networks: A Survey and Call to Action

Many neural nets appear to represent data as linear combinations of "feature vectors." Algorithms for discovering these vectors have seen impressive recent success. However, we argue that this success is incomplete without an understanding of relational composition: how (or whether) neural nets combine feature vectors to represent more complicated relationships. To facilitate research in this area, this paper offers a guided tour of various relational mechanisms that have been proposed, along with preliminary analysis of how such mechanisms might affect the search for interpretable features. We end with a series of promising areas for empirical research, which may help determine how neural networks represent structured data.

[56] 2407.14664

Is $F_1$ Score Suboptimal for Cybersecurity Models? Introducing $C_{score}$, a Cost-Aware Alternative for Model Assessment

The cost of errors related to machine learning classifiers, namely, false positives and false negatives, are not equal and are application dependent. For example, in cybersecurity applications, the cost of not detecting an attack is very different from marking a benign activity as an attack. Various design choices during machine learning model building, such as hyperparameter tuning and model selection, allow a data scientist to trade-off between these two errors. However, most of the commonly used metrics to evaluate model quality, such as $F_1$ score, which is defined in terms of model precision and recall, treat both these errors equally, making it difficult for users to optimize for the actual cost of these errors. In this paper, we propose a new cost-aware metric, $C_{score}$ based on precision and recall that can replace $F_1$ score for model evaluation and selection. It includes a cost ratio that takes into account the differing costs of handling false positives and false negatives. We derive and characterize the new cost metric, and compare it to $F_1$ score. Further, we use this metric for model thresholding for five cybersecurity related datasets for multiple cost ratios. The results show an average cost savings of 49%.

[57] 2407.14671

DefTesPY: Cyber defense model with enhanced data modeling and analysis for Tesla company via Python Language

Several types of cyber-attacks on automobiles and business firms keep on rising as we are preparing to counter cybercrimes with several new technologies and defense models. Cyber defense (also, counter intelligence) is a computer network defense mechanism that involves response to activities, critical infrastructure protection, and information assurance for corporations, government bodies, and other conceivable networks. Cyber defense focuses on preventing, detecting, and responding to assaults or threats in a timely manner so that no infrastructure or information is compromised. With the increasing volume and complexity of cyber threats, most companies need cyber defense to protect sensitive information and assets. We can control attacker actions by utilizing firewalls at different levels, an intrusion detection system (IDS), with the intrusion prevention system (IPS) which can be installed independently or in combination with other protection approaches. Tesla is an American clean energy and automotive company in Austin, Texas, USA. The recent data breach at Tesla affected over 75,000 individuals as the company pinpoints two former employees as the offender revealing more than 23,000 internal files from 2015 to 2022. In this work, we will emphasize data modeling and data analysis using cyber defense model and python with a survey of the Tesla company. We have proposed a defense model, DefTesPY, with enhanced data modeling and data analysis based on the encountered cyber-attacks and cybercrimes for Tesla company till date.

[58] 2407.14676

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

Self-Supervised Learning (SSL) has become a prominent approach for acquiring visual representations across various tasks, yet its application in fine-grained visual recognition (FGVR) is challenged by the intricate task of distinguishing subtle differences between categories. To overcome this, we introduce an novel strategy that boosts SSL's ability to extract critical discriminative features vital for FGVR. This approach creates synthesized data pairs to guide the model to focus on discriminative features critical for FGVR during SSL. We start by identifying non-discriminative features using two main criteria: features with low variance that fail to effectively separate data and those deemed less important by Grad-CAM induced from the SSL loss. We then introduce perturbations to these non-discriminative features while preserving discriminative ones. A decoder is employed to reconstruct images from both perturbed and original feature vectors to create data pairs. An encoder is trained on such generated data pairs to become invariant to variations in non-discriminative dimensions while focusing on discriminative features, thereby improving the model's performance in FGVR tasks. We demonstrate the promising FGVR performance of the proposed approach through extensive evaluation on a wide variety of datasets.

[59] 2407.14679

Compact Language Models via Pruning and Knowledge Distillation

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

[60] 2407.14681

Value Internalization: Learning and Generalizing from Social Reward

Social rewards shape human behavior. During development, a caregiver guides a learner's behavior towards culturally aligned goals and values. How do these behaviors persist and generalize when the caregiver is no longer present, and the learner must continue autonomously? Here, we propose a model of value internalization where social feedback trains an internal social reward (ISR) model that generates internal rewards when social rewards are unavailable. Through empirical simulations, we show that an ISR model prevents agents from unlearning socialized behaviors and enables generalization in out-of-distribution tasks. We characterize the implications of incomplete internalization, akin to "reward hacking" on the ISR. Additionally, we show that our model internalizes prosocial behavior in a multi-agent environment. Our work provides a foundation for understanding how humans acquire and generalize values and offers insights for aligning AI with human values.

[61] 2407.14683

Large-Area Emergency Lockdowns with Automated Driving Systems

Region-wide restrictions on personal vehicle travel have a long history in the United States, from riot curfews in the late 1960s, to travel bans during snow events, to the 2013 shelter-in-place "lockdown" during the search for the perpetrator of the Boston Marathon bombing. Because lockdowns require tremendous resources to enforce, they are often limited in duration or scope. The introduction of automated driving systems may allow governments to quickly and cheaply effect large-area lockdowns by jamming wireless communications, spoofing road closures on digital maps, exploiting a vehicle's programming to obey all traffic control devices, or coordinating with vehicle developers. Future vehicles may lack conventional controls, rendering them undrivable by the public. As travel restrictions become easier to implement, governments may enforce them more frequently, over longer durations and wider areas. This article explores the practical, legal, and ethical implications of lockdowns when most driving is highly automated, and provides guidance for the development of lockdown policies.

[62] 2407.14684

Data Poisoning: An Overlooked Threat to Power Grid Resilience

As the complexities of Dynamic Data Driven Applications Systems increase, preserving their resilience becomes more challenging. For instance, maintaining power grid resilience is becoming increasingly complicated due to the growing number of stochastic variables (such as renewable outputs) and extreme weather events that add uncertainty to the grid. Current optimization methods have struggled to accommodate this rise in complexity. This has fueled the growing interest in data-driven methods used to operate the grid, leading to more vulnerability to cyberattacks. One such disruption that is commonly discussed is the adversarial disruption, where the intruder attempts to add a small perturbation to input data in order to "manipulate" the system operation. During the last few years, work on adversarial training and disruptions on the power system has gained popularity. In this paper, we will first review these applications, specifically on the most common types of adversarial disruptions: evasion and poisoning disruptions. Through this review, we highlight the gap between poisoning and evasion research when applied to the power grid. This is due to the underlying assumption that model training is secure, leading to evasion disruptions being the primary type of studied disruption. Finally, we will examine the impacts of data poisoning interventions and showcase how they can endanger power grid resilience.

[63] 2407.14686

Using Case Studies to Teach Responsible AI to Industry Practitioners

Responsible AI (RAI) is the science and the practice of making the design, development, and use of AI socially sustainable: of reaping the benefits of innovation while controlling the risks. Naturally, industry practitioners play a decisive role in our collective ability to achieve the goals of RAI. Unfortunately, we do not yet have consolidated educational materials and effective methodologies for teaching RAI to practitioners. In this paper, we propose a novel stakeholder-first educational approach that uses interactive case studies to achieve organizational and practitioner -level engagement and advance learning of RAI. We discuss a partnership with Meta, an international technology company, to co-develop and deliver RAI workshops to a diverse audience within the company. Our assessment results indicate that participants found the workshops engaging and reported a positive shift in understanding and motivation to apply RAI to their work.

[64] 2407.14690

Productivity profile of CNPq scholarship researchers in computer science from 2017 to 2021

Productivity in Research (PQ) is a scholarship granted by CNPq (Brazilian National Council for Scientific and Technological Development). This scholarship aims to recognize a few selected faculty researchers for their scientific production, outstanding technology and innovation in their respective areas of knowledge. In the present study, we evaluated the scientific production of the 185 researchers in the Computer Science area granted with PQ scholarship in the last PQ selection notice. To evaluate the productivity of each professor, we considered papers published in scientific journals and conferences (complete works) in a five years period (from 2017 to 2021). We analyzed the productivity in terms of both quantity and quality. We also evaluated its distribution over the country, universities and research facilities, as well as, the co-authorship network produced.

[65] 2407.14695

A Comprehensive Guide to Combining R and Python code for Data Science, Machine Learning and Reinforcement Learning

Python has gained widespread popularity in the fields of machine learning, artificial intelligence, and data engineering due to its effectiveness and extensive libraries. R, on its side, remains a dominant language for statistical analysis and visualization. However, certain libraries have become outdated, limiting their functionality and performance. Users can use Python's advanced machine learning and AI capabilities alongside R's robust statistical packages by combining these two programming languages. This paper explores using R's reticulate package to call Python from R, providing practical examples and highlighting scenarios where this integration enhances productivity and analytical capabilities. With a few hello-world code snippets, we demonstrate how to run Python's scikit-learn, pytorch and OpenAI gym libraries for building Machine Learning, Deep Learning, and Reinforcement Learning projects easily.

[66] 2407.14700

Composer's Assistant 2: Interactive Multi-Track MIDI Infilling with Fine-Grained User Control

We introduce Composer's Assistant 2, a system for interactive human-computer composition in the REAPER digital audio workstation. Our work upgrades the Composer's Assistant system (which performs multi-track infilling of symbolic music at the track-measure level) with a wide range of new controls to give users fine-grained control over the system's outputs. Controls introduced in this work include two types of rhythmic conditioning controls, horizontal and vertical note onset density controls, several types of pitch controls, and a rhythmic interest control. We train a T5-like transformer model to implement these controls and to serve as the backbone of our system. With these controls, we achieve a dramatic improvement in objective metrics over the original system. We also study how well our model understands the meaning of our controls, and we conduct a listening study that does not find a significant difference between real music and music composed in a co-creative fashion with our system. We release our complete system, consisting of source code, pretrained models, and REAPER scripts.

[67] 2407.14701

Contextual modulation of language comprehension in a dynamic neural model of lexical meaning

We propose and computationally implement a dynamic neural model of lexical meaning, and experimentally test its behavioral predictions. We demonstrate the architecture and behavior of the model using as a test case the English lexical item 'have', focusing on its polysemous use. In the model, 'have' maps to a semantic space defined by two continuous conceptual dimensions, connectedness and control asymmetry, previously proposed to parameterize the conceptual system for language. The mapping is modeled as coupling between a neural node representing the lexical item and neural fields representing the conceptual dimensions. While lexical knowledge is modeled as a stable coupling pattern, real-time lexical meaning retrieval is modeled as the motion of neural activation patterns between metastable states corresponding to semantic interpretations or readings. Model simulations capture two previously reported empirical observations: (1) contextual modulation of lexical semantic interpretation, and (2) individual variation in the magnitude of this modulation. Simulations also generate a novel prediction that the by-trial relationship between sentence reading time and acceptability should be contextually modulated. An experiment combining self-paced reading and acceptability judgments replicates previous results and confirms the new model prediction. Altogether, results support a novel perspective on lexical polysemy: that the many related meanings of a word are metastable neural activation states that arise from the nonlinear dynamics of neural populations governing interpretation on continuous semantic dimensions.

[68] 2407.14705

Reactive graphs in action (extended version)

Reactive graphs are transition structures whereas edges become active and inactive during its evolution, that were introduced by Dov Gabbay from a mathematical's perspective. This paper presents Marge (, a web-based tool to visualise and analyse reactive graphs enriched with labels. Marge animates the operational semantics of reactive graphs and offers different graphical views to provide insights over concrete systems. We motivate the applicability of reactive graphs for adaptive systems and for featured transition systems, using Marge to tighten the gap between the existing theoretical models and their usage to analyse concrete systems.

[69] 2407.14709

$\infty$-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

Synthesizing high-resolution images from intricate, domain-specific information remains a significant challenge in generative modeling, particularly for applications in large-image domains such as digital histopathology and remote sensing. Existing methods face critical limitations: conditional diffusion models in pixel or latent space cannot exceed the resolution on which they were trained without losing fidelity, and computational demands increase significantly for larger image sizes. Patch-based methods offer computational efficiency but fail to capture long-range spatial relationships due to their overreliance on local information. In this paper, we introduce a novel conditional diffusion model in infinite dimensions, $\infty$-Brush for controllable large image synthesis. We propose a cross-attention neural operator to enable conditioning in function space. Our model overcomes the constraints of traditional finite-dimensional diffusion models and patch-based methods, offering scalability and superior capability in preserving global image structures while maintaining fine details. To our best knowledge, $\infty$-Brush is the first conditional diffusion model in function space, that can controllably synthesize images at arbitrary resolutions of up to $4096\times4096$ pixels. The code is available at

[70] 2407.14710

Universally Harmonizing Differential Privacy Mechanisms for Federated Learning: Boosting Accuracy and Convergence

Differentially private federated learning (DP-FL) is a promising technique for collaborative model training while ensuring provable privacy for clients. However, optimizing the tradeoff between privacy and accuracy remains a critical challenge. To our best knowledge, we propose the first DP-FL framework (namely UDP-FL), which universally harmonizes any randomization mechanism (e.g., an optimal one) with the Gaussian Moments Accountant (viz. DP-SGD) to significantly boost accuracy and convergence. Specifically, UDP-FL demonstrates enhanced model performance by mitigating the reliance on Gaussian noise. The key mediator variable in this transformation is the R\'enyi Differential Privacy notion, which is carefully used to harmonize privacy budgets. We also propose an innovative method to theoretically analyze the convergence for DP-FL (including our UDP-FL ) based on mode connectivity analysis. Moreover, we evaluate our UDP-FL through extensive experiments benchmarked against state-of-the-art (SOTA) methods, demonstrating superior performance on both privacy guarantees and model performance. Notably, UDP-FL exhibits substantial resilience against different inference attacks, indicating a significant advance in safeguarding sensitive data in federated learning environments.

[71] 2407.14714

Unveiling the Decision-Making Process in Reinforcement Learning with Genetic Programming

Despite tremendous progress, machine learning and deep learning still suffer from incomprehensible predictions. Incomprehensibility, however, is not an option for the use of (deep) reinforcement learning in the real world, as unpredictable actions can seriously harm the involved individuals. In this work, we propose a genetic programming framework to generate explanations for the decision-making process of already trained agents by imitating them with programs. Programs are interpretable and can be executed to generate explanations of why the agent chooses a particular action. Furthermore, we conduct an ablation study that investigates how extending the domain-specific language by using library learning alters the performance of the method. We compare our results with the previous state of the art for this problem and show that we are comparable in performance but require much less hardware resources and computation time.

[72] 2407.14717

Differential Privacy of Cross-Attention with Provable Guarantee

Cross-attention has become a fundamental module nowadays in many important artificial intelligence applications, e.g., retrieval-augmented generation (RAG), system prompt, guided stable diffusion, and many so on. Ensuring cross-attention privacy is crucial and urgently needed because its key and value matrices may contain sensitive information about companies and their users, many of which profit solely from their system prompts or RAG data. In this work, we design a novel differential privacy (DP) data structure to address the privacy security of cross-attention with a theoretical guarantee. In detail, let $n$ be the input token length of system prompt/RAG data, $d$ be the feature dimension, $0 < \alpha \le 1$ be the relative error parameter, $R$ be the maximum value of the query and key matrices, $R_w$ be the maximum value of the value matrix, and $r,s,\epsilon_s$ be parameters of polynomial kernel methods. Then, our data structure requires $\widetilde{O}(ndr^2)$ memory consumption with $\widetilde{O}(nr^2)$ initialization time complexity and $\widetilde{O}(\alpha^{-1} r^2)$ query time complexity for a single token query. In addition, our data structure can guarantee that the user query is $(\epsilon, \delta)$-DP with $\widetilde{O}(n^{-1} \epsilon^{-1} \alpha^{-1/2} R^{2s} R_w r^2)$ additive error and $n^{-1} (\alpha + \epsilon_s)$ relative error between our output and the true answer. Furthermore, our result is robust to adaptive queries in which users can intentionally attack the cross-attention system. To our knowledge, this is the first work to provide DP for cross-attention. We believe it can inspire more privacy algorithm design in large generative models (LGMs).

[73] 2407.14718

A mimetic discretization of Westervelt's equation

A broad class of nonlinear acoustic wave models possess a Hamiltonian structure in their dissipation-free limit and a gradient flow structure for their dissipative dynamics. This structure may be exploited to design numerical methods which preserve the Hamiltonian structure in the dissipation-free limit, and which achieve the correct dissipation rate in the spatially-discrete dissipative dynamics. Moreover, by using spatial discretizations which preserve the de Rham cohomology, the non-evolving involution constraint for the vorticity may be exactly satisfied for all of time. Numerical examples are given using a mimetic finite difference spatial discretization.

[74] 2407.14720

Downstream-Pretext Domain Knowledge Traceback for Active Learning

Active learning (AL) is designed to construct a high-quality labeled dataset by iteratively selecting the most informative samples. Such sampling heavily relies on data representation, while recently pre-training is popular for robust feature learning. However, as pre-training utilizes low-level pretext tasks that lack annotation, directly using pre-trained representation in AL is inadequate for determining the sampling score. To address this problem, we propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance for selecting diverse and instructive samples near the decision boundary. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. The diversity indicator constructs two feature spaces based on the pre-training pretext model and the downstream knowledge from annotation, by which it locates the neighbors of unlabeled data from the downstream space in the pretext space to explore the interaction of samples. With this mechanism, DOKT unifies the data relations of low-level and high-level representations to estimate traceback diversity. Next, in the uncertainty estimator, domain mixing is designed to enforce perceptual perturbing to unlabeled samples with similar visual patches in the pretext space. Then the divergence of perturbed samples is measured to estimate the domain uncertainty. As a result, DOKT selects the most diverse and important samples based on these two modules. The experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods and generalizes well to various application scenarios such as semantic segmentation and image captioning.

[75] 2407.14725

CrowdMAC: Masked Crowd Density Completion for Robust Crowd Density Forecasting

A crowd density forecasting task aims to predict how the crowd density map will change in the future from observed past crowd density maps. However, the past crowd density maps are often incomplete due to the miss-detection of pedestrians, and it is crucial to develop a robust crowd density forecasting model against the miss-detection. This paper presents a MAsked crowd density Completion framework for crowd density forecasting (CrowdMAC), which is simultaneously trained to forecast future crowd density maps from partially masked past crowd density maps (i.e., forecasting maps from past maps with miss-detection) while reconstructing the masked observation maps (i.e., imputing past maps with miss-detection). Additionally, we propose Temporal-Density-aware Masking (TDM), which non-uniformly masks tokens in the observed crowd density map, considering the sparsity of the crowd density maps and the informativeness of the subsequent frames for the forecasting task. Moreover, we introduce multi-task masking to enhance training efficiency. In the experiments, CrowdMAC achieves state-of-the-art performance on seven large-scale datasets, including SDD, ETH-UCY, inD, JRDB, VSCrowd, FDST, and croHD. We also demonstrate the robustness of the proposed method against both synthetic and realistic miss-detections.

[76] 2407.14726

MetaAug: Meta-Data Augmentation for Post-Training Quantization

Post-Training Quantization (PTQ) has received significant attention because it requires only a small set of calibration data to quantize a full-precision model, which is more practical in real-world applications in which full access to a large training set is not available. However, it often leads to overfitting on the small calibration dataset. Several methods have been proposed to address this issue, yet they still rely on only the calibration set for the quantization and they do not validate the quantized model due to the lack of a validation set. In this work, we propose a novel meta-learning based approach to enhance the performance of post-training quantization. Specifically, to mitigate the overfitting problem, instead of only training the quantized model using the original calibration set without any validation during the learning process as in previous PTQ works, in our approach, we both train and validate the quantized model using two different sets of images. In particular, we propose a meta-learning based approach to jointly optimize a transformation network and a quantized model through bi-level optimization. The transformation network modifies the original calibration data and the modified data will be used as the training set to learn the quantized model with the objective that the quantized model achieves a good performance on the original calibration data. Extensive experiments on the widely used ImageNet dataset with different neural network architectures demonstrate that our approach outperforms the state-of-the-art PTQ methods.

[77] 2407.14727

Economy Watchers Survey provides Datasets and Tasks for Japanese Financial Domain

Many natural language processing (NLP) tasks in English or general domains are widely available and are often used to evaluate pre-trained language models. In contrast, there are fewer tasks available for languages other than English and for the financial domain. In particular, tasks in Japanese and the financial domain are limited. We construct two large datasets using materials published by a Japanese central government agency. The datasets provide three Japanese financial NLP tasks, which include a 3-class and 12-class classification for categorizing sentences, as well as a 5-class classification task for sentiment analysis. Our datasets are designed to be comprehensive and up-to-date, leveraging an automatic update framework that ensures the latest task datasets are publicly available anytime.

[78] 2407.14730

FedDM: Enhancing Communication Efficiency and Handling Data Heterogeneity in Federated Diffusion Models

We introduce FedDM, a novel training framework designed for the federated training of diffusion models. Our theoretical analysis establishes the convergence of diffusion models when trained in a federated setting, presenting the specific conditions under which this convergence is guaranteed. We propose a suite of training algorithms that leverage the U-Net architecture as the backbone for our diffusion models. These include a basic Federated Averaging variant, FedDM-vanilla, FedDM-prox to handle data heterogeneity among clients, and FedDM-quant, which incorporates a quantization module to reduce the model update size, thereby enhancing communication efficiency across the federated network. We evaluate our algorithms on FashionMNIST (28x28 resolution), CIFAR-10 (32x32 resolution), and CelebA (64x64 resolution) for DDPMs, as well as LSUN Church Outdoors (256x256 resolution) for LDMs, focusing exclusively on the imaging modality. Our evaluation results demonstrate that FedDM algorithms maintain high generation quality across image resolutions. At the same time, the use of quantized updates and proximal terms in the local training objective significantly enhances communication efficiency (up to 4x) and model convergence, particularly in non-IID data settings, at the cost of increased FID scores (up to 1.75x).

[79] 2407.14732

Meta-GPS++: Enhancing Graph Meta-Learning with Contrastive Learning and Self-Training

Node classification is an essential problem in graph learning. However, many models typically obtain unsatisfactory performance when applied to few-shot scenarios. Some studies have attempted to combine meta-learning with graph neural networks to solve few-shot node classification on graphs. Despite their promising performance, some limitations remain. First, they employ the node encoding mechanism of homophilic graphs to learn node embeddings, even in heterophilic graphs. Second, existing models based on meta-learning ignore the interference of randomness in the learning process. Third, they are trained using only limited labeled nodes within the specific task, without explicitly utilizing numerous unlabeled nodes. Finally, they treat almost all sampled tasks equally without customizing them for their uniqueness. To address these issues, we propose a novel framework for few-shot node classification called Meta-GPS++. Specifically, we first adopt an efficient method to learn discriminative node representations on homophilic and heterophilic graphs. Then, we leverage a prototype-based approach to initialize parameters and contrastive learning for regularizing the distribution of node embeddings. Moreover, we apply self-training to extract valuable information from unlabeled nodes. Additionally, we adopt S$^2$ (scaling & shifting) transformation to learn transferable knowledge from diverse tasks. The results on real-world datasets show the superiority of Meta-GPS++. Our code is available here.

[80] 2407.14733

Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL

With the advent of foundation models, prompt tuning has positioned itself as an important technique for directing model behaviors and eliciting desired responses. Prompt tuning regards selecting appropriate keywords included into the input, thereby adapting to the downstream task without adjusting or fine-tuning the model parameters. There is a wide range of work in prompt tuning, from approaches that directly harness the backpropagated gradient signals from the model, to those employing black-box optimization such as reinforcement learning (RL) methods. Our primary focus is on RLPrompt, which aims to find optimal prompt tokens leveraging soft Q-learning. While the results show promise, we have observed that the prompts frequently appear unnatural, which impedes their interpretability. We address this limitation by using sparse Tsallis entropy regularization, a principled approach to filtering out unlikely tokens from consideration. We extensively evaluate our approach across various tasks, including few-shot text classification, unsupervised text style transfer, and textual inversion from images. The results indicate a notable improvement over baselines, highlighting the efficacy of our approach in addressing the challenges of prompt tuning. Moreover, we show that the prompts discovered using our method are more natural and interpretable compared to those from other baselines.

[81] 2407.14735

ECRTime: Ensemble Integration of Classification and Retrieval for Time Series Classification

Deep learning-based methods for Time Series Classification (TSC) typically utilize deep networks to extract features, which are then processed through a combination of a Fully Connected (FC) layer and a SoftMax function. However, we have observed the phenomenon of inter-class similarity and intra-class inconsistency in the datasets from the UCR archive and further analyzed how this phenomenon adversely affects the "FC+SoftMax" paradigm. To address the issue, we introduce ECR, which, for the first time to our knowledge, applies deep learning-based retrieval algorithm to the TSC problem and integrates classification and retrieval models. Experimental results on 112 UCR datasets demonstrate that ECR is state-of-the-art(sota) compared to existing deep learning-based methods. Furthermore, we have developed a more precise classifier, ECRTime, which is an ensemble of ECR. ECRTime surpasses the currently most accurate deep learning classifier, InceptionTime, in terms of accuracy, achieving this with reduced training time and comparable scalability.

[82] 2407.14737

Early Detection of Coffee Leaf Rust Through Convolutional Neural Networks Trained on Low-Resolution Images

Coffee leaf rust, a foliar disease caused by the fungus Hemileia vastatrix, poses a major threat to coffee production, especially in Central America. Climate change further aggravates this issue, as it shortens the latency period between initial infection and the emergence of visible symptoms in diseases like leaf rust. Shortened latency periods can lead to more severe plant epidemics and faster spread of diseases. There is, hence, an urgent need for effective disease management strategies. To address these challenges, we explore the potential of deep learning models for enhancing early disease detection. However, deep learning models require extensive processing power and large amounts of data for model training, resources that are typically scarce. To overcome these barriers, we propose a preprocessing technique that involves convolving training images with a high-pass filter to enhance lesion-leaf contrast, significantly improving model efficacy in resource-limited environments. This method and our model demonstrated a strong performance, achieving over 90% across all evaluation metrics--including precision, recall, F1-score, and the Dice coefficient. Our experiments show that this approach outperforms other methods, including two different image preprocessing techniques and using unaltered, full-color images.

[83] 2407.14738

Flatness-aware Sequential Learning Generates Resilient Backdoors

Recently, backdoor attacks have become an emerging threat to the security of machine learning models. From the adversary's perspective, the implanted backdoors should be resistant to defensive algorithms, but some recently proposed fine-tuning defenses can remove these backdoors with notable efficacy. This is mainly due to the catastrophic forgetting (CF) property of deep neural networks. This paper counters CF of backdoors by leveraging continual learning (CL) techniques. We begin by investigating the connectivity between a backdoored and fine-tuned model in the loss landscape. Our analysis confirms that fine-tuning defenses, especially the more advanced ones, can easily push a poisoned model out of the backdoor regions, making it forget all about the backdoors. Based on this finding, we re-formulate backdoor training through the lens of CL and propose a novel framework, named Sequential Backdoor Learning (SBL), that can generate resilient backdoors. This framework separates the backdoor poisoning process into two tasks: the first task learns a backdoored model, while the second task, based on the CL principles, moves it to a backdoored region resistant to fine-tuning. We additionally propose to seek flatter backdoor regions via a sharpness-aware minimizer in the framework, further strengthening the durability of the implanted backdoor. Finally, we demonstrate the effectiveness of our method through extensive empirical experiments on several benchmark datasets in the backdoor domain. The source code is available at

[84] 2407.14740

Attention-based SIC Ordering and Power Allocation for Non-orthogonal Multiple Access Networks

Non-orthogonal multiple access (NOMA) emerges as a superior technology for enhancing spectral efficiency compared to orthogonal multiple access. In NOMA networks, successive interference cancellation (SIC) plays a crucial role in decoding user signals sequentially. The challenge lies in the joint optimization of SIC ordering and power allocation, due to the factorial nature of ordering combinations. This study introduces an innovative solution, the Attention-based SIC Ordering and Power Allocation (ASOPA) framework, targeting an uplink NOMA network with dynamic SIC ordering. ASOPA aims to maximize weighted proportional fairness by employing deep reinforcement learning, strategically decomposing the problem into two manageable subproblems: SIC ordering optimization and optimal power allocation. Our approach utilizes an attention-based neural network, which processes instantaneous channel gains and user weights to determine the SIC decoding sequence for each user. Once the SIC ordering is established, the power allocation subproblem transforms into a convex optimization problem, enabling efficient calculation. Extensive simulations validate ASOPA's efficacy, demonstrating a performance closely paralleling the exhaustive method, with over 97% confidence in normalized network utility. Notably, ASOPA maintains a low execution latency of approximately 50 milliseconds in a ten-user NOMA network, aligning with static SIC ordering algorithms. Furthermore, ASOPA demonstrates superior performance in various NOMA network configurations, including scenarios with imperfect channel state information, multiple base stations, and multiple-antenna setups. Such results underscore ASOPA's robustness and effectiveness, highlighting its ability to excel across various NOMA network environments. The complete source code for ASOPA is accessible at

[85] 2407.14741

Orthogonal Hyper-category Guided Multi-interest Elicitation for Micro-video Matching

Watching micro-videos is becoming a part of public daily life. Usually, user watching behaviors are thought to be rooted in their multiple different interests. In the paper, we propose a model named OPAL for micro-video matching, which elicits a user's multiple heterogeneous interests by disentangling multiple soft and hard interest embeddings from user interactions. Moreover, OPAL employs a two-stage training strategy, in which the pre-train is to generate soft interests from historical interactions under the guidance of orthogonal hyper-categories of micro-videos and the fine-tune is to reinforce the degree of disentanglement among the interests and learn the temporal evolution of each interest of each user. We conduct extensive experiments on two real-world datasets. The results show that OPAL not only returns diversified micro-videos but also outperforms six state-of-the-art models in terms of recall and hit rate.

[86] 2407.14742

Dynamic Color Assignment for Hierarchical Data

Assigning discriminable and harmonic colors to samples according to their class labels and spatial distribution can generate attractive visualizations and facilitate data exploration. However, as the number of classes increases, it is challenging to generate a high-quality color assignment result that accommodates all classes simultaneously. A practical solution is to organize classes into a hierarchy and then dynamically assign colors during exploration. However, existing color assignment methods fall short in generating high-quality color assignment results and dynamically aligning them with hierarchical structures. To address this issue, we develop a dynamic color assignment method for hierarchical data, which is formulated as a multi-objective optimization problem. This method simultaneously considers color discriminability, color harmony, and spatial distribution at each hierarchical level. By using the colors of parent classes to guide the color assignment of their child classes, our method further promotes both consistency and clarity across hierarchical levels. We demonstrate the effectiveness of our method in generating dynamic color assignment results with quantitative experiments and a user study.

[87] 2407.14743

Denoising Long- and Short-term Interests for Sequential Recommendation

User interests can be viewed over different time scales, mainly including stable long-term preferences and changing short-term intentions, and their combination facilitates the comprehensive sequential recommendation. However, existing work that focuses on different time scales of user modeling has ignored the negative effects of different time-scale noise, which hinders capturing actual user interests and cannot be resolved by conventional sequential denoising methods. In this paper, we propose a Long- and Short-term Interest Denoising Network (LSIDN), which employs different encoders and tailored denoising strategies to extract long- and short-term interests, respectively, achieving both comprehensive and robust user modeling. Specifically, we employ a session-level interest extraction and evolution strategy to avoid introducing inter-session behavioral noise into long-term interest modeling; we also adopt contrastive learning equipped with a homogeneous exchanging augmentation to alleviate the impact of unintentional behavioral noise on short-term interest modeling. Results of experiments on two public datasets show that LSIDN consistently outperforms state-of-the-art models and achieves significant robustness.

[88] 2407.14744

A Comprehensive Review of Few-shot Action Recognition

Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data in action recognition. It requires accurately classifying human actions in videos using only a few labeled examples per class. Compared to few-shot learning in image scenarios, few-shot action recognition is more challenging due to the intrinsic complexity of video data. Recognizing actions involves modeling intricate temporal sequences and extracting rich semantic information, which goes beyond mere human and object identification in each frame. Furthermore, the issue of intra-class variance becomes particularly pronounced with limited video samples, complicating the learning of representative features for novel action categories. To overcome these challenges, numerous approaches have driven significant advancements in few-shot action recognition, which underscores the need for a comprehensive survey. Unlike early surveys that focus on few-shot image or text classification, we deeply consider the unique challenges of few-shot action recognition. In this survey, we review a wide variety of recent methods and summarize the general framework. Additionally, the survey presents the commonly used benchmarks and discusses relevant advanced topics and promising future directions. We hope this survey can serve as a valuable resource for researchers, offering essential guidance to newcomers and stimulating seasoned researchers with fresh insights.

[89] 2407.14745

Randomized Radial Basis Function Neural Network for Solving Multiscale Elliptic Equations

To overcome these obstacles and improve computational accuracy and efficiency, this paper presents the Randomized Radial Basis Function Neural Network (RRNN), an innovative approach explicitly crafted for solving multiscale elliptic equations. The RRNN method commences by decomposing the computational domain into non-overlapping subdomains. Within each subdomain, the solution to the localized subproblem is approximated by a randomized radial basis function neural network with a Gaussian kernel. This network is distinguished by the random assignment of width and center coefficients for its activation functions, thereby rendering the training process focused solely on determining the weight coefficients of the output layer. For each subproblem, similar to the Petrov-Galerkin finite element method, a linear system will be formulated on the foundation of a weak formulation. Subsequently, a selection of collocation points is stochastically sampled at the boundaries of the subdomain, ensuring satisfying $C^0$ and $C^1$ continuity and boundary conditions to couple these localized solutions. The network is ultimately trained using the least squares method to ascertain the output layer weights. To validate the RRNN method's effectiveness, an extensive array of numerical experiments has been executed and the results demonstrate that the proposed method can improve the accuracy and efficiency well.

[90] 2407.14746

Difflare: Removing Image Lens Flare with Latent Diffusion Model

The recovery of high-quality images from images corrupted by lens flare presents a significant challenge in low-level vision. Contemporary deep learning methods frequently entail training a lens flare removing model from scratch. However, these methods, despite their noticeable success, fail to utilize the generative prior learned by pre-trained models, resulting in unsatisfactory performance in lens flare removal. Furthermore, there are only few works considering the physical priors relevant to flare removal. To address these issues, we introduce Difflare, a novel approach designed for lens flare removal. To leverage the generative prior learned by Pre-Trained Diffusion Models (PTDM), we introduce a trainable Structural Guidance Injection Module (SGIM) aimed at guiding the restoration process with PTDM. Towards more efficient training, we employ Difflare in the latent space. To address information loss resulting from latent compression and the stochastic sampling process of PTDM, we introduce an Adaptive Feature Fusion Module (AFFM), which incorporates the Luminance Gradient Prior (LGP) of lens flare to dynamically regulate feature extraction. Extensive experiments demonstrate that our proposed Difflare achieves state-of-the-art performance in real-world lens flare removal, restoring images corrupted by flare with improved fidelity and perceptual quality. The codes will be released soon.

[91] 2407.14749

Adaptive-Frequency Model Learning and Predictive Control for Dynamic Maneuvers on Legged Robots

Achieving both target accuracy and robustness in dynamic maneuvers with long flight phases, such as high or long jumps, has been a significant challenge for legged robots. To address this challenge, we propose a novel learning-based control approach consisting of model learning and model predictive control (MPC) utilizing an adaptive frequency scheme. Compared to existing MPC techniques, we learn a model directly from experiments, accounting not only for leg dynamics but also for modeling errors and unknown dynamics mismatch in hardware and during contact. Additionally, learning the model with adaptive frequency allows us to cover the entire flight phase and final jumping target, enhancing the prediction accuracy of the jumping trajectory. Using the learned model, we also design an adaptive-frequency MPC to effectively leverage different jumping phases and track the target accurately. In hardware experiments with a Unitree A1 robot, we demonstrate that our approach outperforms baseline MPC using a nominal model, reducing the jumping distance error up to 8 times. We achieve jumping distance errors of less than 3 percent during continuous jumping on uneven terrain with randomly-placed perturbations of random heights (up to 4 cm or 27 percent of the robot's standing height). Our approach obtains distance errors of 1-2 cm on 34 single and continuous jumps with different jumping targets and model uncertainties.

[92] 2407.14750

Understanding the Needs of Nonhuman Stakeholders in Design Process: An Overview of and Reflection on Methods

Design practice traditionally focused on human concerns, either overseeing the various effects of climate issues on nonhuman stakeholders or considering them as resources to address these problems. The climate crisis's urgency demands a design shift towards sustainability and inclusivity. This shift was happening through an emerging theme in design, More-Than-Human (MTH), which expands the notion of the user to animals, things, nature, and microbes. Such an expansion creates a requirement for designers to consider nonhuman perspectives during the design process. This paper investigates the methods used in MTH Design studies to explore and synthesize the perspectives of nonhuman users. Reviewing 30 papers, it highlights a predominant focus on animals and things over plants and microbes in MTH studies, along with a scarcity of synthesis methods. It identifies the necessity of tools that represent nonhumans with their relationships within larger ecosystems, and calls for increased attention to plants and microbes, emphasizing their vital role in sustainable environments and urging researchers to develop methods for understanding these species. By highlighting method strengths and weaknesses, it aims to guide designers and design researchers who plan to work with nonhuman users in selecting appropriate methods.

[93] 2407.14757

Enhancing Skin Disease Classification Leveraging Transformer-based Deep Learning Architectures and Explainable AI

Skin diseases affect over a third of the global population, yet their impact is often underestimated. Automating skin disease classification to assist doctors with their prognosis might be difficult. Nevertheless, due to efficient feature extraction pipelines, deep learning techniques have shown much promise for various tasks, including dermatological disease identification. This study uses a skin disease dataset with 31 classes and compares it with all versions of Vision Transformers, Swin Transformers and DivoV2. The analysis is also extended to compare with benchmark convolution-based architecture presented in the literature. Transfer learning with ImageNet1k weights on the skin disease dataset contributes to a high test accuracy of 96.48\% and an F1-Score of 0.9727 using DinoV2, which is almost a 10\% improvement over this data's current benchmark results. The performance of DinoV2 was also compared for the HAM10000 and Dermnet datasets to test the model's robustness, and the trained model overcomes the benchmark results by a slight margin in test accuracy and in F1-Score on the 23 and 7 class datasets. The results are substantiated using explainable AI frameworks like GradCAM and SHAP, which provide precise image locations to map the disease, assisting dermatologists in early detection, prompt prognosis, and treatment.

[94] 2407.14758

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically learned on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark of large-scale long-horizon vision-language navigation and interaction tasks as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6% success rate margin in unseen scenes, even without step-by-step instructions. Our code is publicly released at

[95] 2407.14759

Autonomous Nonlinear Passive Transmit-Receive Switch for Compact IoT Devices: A Three-Port Agile Network

Recent advancements in RF technologies, especially Internet of Things (IoT) devices, require compact and integrated RF circulators or transmit-receive (TR) switches for efficient resource use. Although conventional techniques are crucial in managing signal flow to prevent signal interference to sensitive receiver components, they have some drawbacks, such as limited isolation, low switching speed, complex circuitry, bulkiness, and high cost. This work presents a smart, miniaturized, nonlinear TR switch capable of operating over a wide frequency range (0.8 - 1.3 GHz), making it suitable for IoT frequency bands. The switch achieves high isolation, low insertion loss, and intelligent transitions between transmitter and receiver without requiring external control or bias pins.

[96] 2407.14760

Multidirectional Pixelated Cubic Antenna with Enhanced Isolation for Vehicular Applications

This paper presents a pixelated cubic antenna design with enhanced isolation and diverse radiation pattern for vehicular applications. The design consists of four radiating patches to take advantage of a nearly omnidirectional radiation pattern with enhanced isolation and high gain. The antenna system with four patches has been pixelated and optimized simultaneously to achieve desired performance and high isolation at 5.4 GHz band. The antenna achieved measured isolation of more than -34 dB between antenna elements. The overall isolation improvement obtained by the antenna is about 18 dB compared to a configuration using standard patch antennas. Moreover, isolation improvement is achieved through patch pixelization without additional resonators or elements. The antenna achieved up to 6.9 dB realized gain in each direction. Additionally, the cubic antenna system is equipped with an E-shaped GPS antenna to facilitate connectivity with GPS satellite. Finally, the antenna performance has been investigated using a simulation model of the vehicle roof and roof rack. The reflection coefficient, isolation and radiation patterns of the antenna remains unaffected. The antenna prototype has been fabricated on Rogers substrate and measured to verify the simulation results. The measured results correlate well with the simulation results. The proposed antenna features low-profile, simple design for ease of manufacture, good radiation characteristics with multidirectional property and high isolation, which are well-suited to vehicular applications in different environments.

[97] 2407.14763

Efficient Design of a Pixelated Rectenna for WPT Applications

This paper introduces a highly efficient rectenna (rectifying antenna) using a binary optimization algorithm. A novel pixelated receiving antenna has been developed to match the diode impedance of a rectifier, eliminating the need for a separate matching circuit in the rectenna's rectifier. The receiving antenna configuration is fine-tuned via a binary optimization algorithm. A rectenna is designed using optimization algorithm at 2.5 GHz with 38% RF-DC conversion efficiency when subjected to 0 dBm incident power, with an output voltage of 815mV. The proposed rectenna demonstrates versatility across various low-power WPT (wireless power transfer) applications.

[98] 2407.14765

Data Augmentation in Graph Neural Networks: The Role of Generated Synthetic Graphs

Graphs are crucial for representing interrelated data and aiding predictive modeling by capturing complex relationships. Achieving high-quality graph representation is important for identifying linked patterns, leading to improvements in Graph Neural Networks (GNNs) to better capture data structures. However, challenges such as data scarcity, high collection costs, and ethical concerns limit progress. As a result, generative models and data augmentation have become more and more popular. This study explores using generated graphs for data augmentation, comparing the performance of combining generated graphs with real graphs, and examining the effect of different quantities of generated graphs on graph classification tasks. The experiments show that balancing scalability and quality requires different generators based on graph size. Our results introduce a new approach to graph data augmentation, ensuring consistent labels and enhancing classification performance.

[99] 2407.14766

Implementing Fairness: the view from a FairDream

In this paper, we propose an experimental investigation of the problem of AI fairness in classification. We train an AI model and develop our own fairness package FairDream to detect inequalities and then to correct for them, using income prediction as a case study. Our experiments show that it is a property of FairDream to fulfill fairness objectives which are conditional on the ground truth (Equalized Odds), even when the algorithm is set the task of equalizing positives across groups (Demographic Parity). While this may be seen as an anomaly, we explain this property by comparing our approach with a closely related fairness method (GridSearch), which can enforce Demographic Parity at the expense of Equalized Odds. We grant that a fairness metric conditioned on true labels does not give a sufficient criterion to reach fairness, but we argue that it gives us at least a necessary condition to implement Demographic Parity cautiously. We also explain why neither Equal Calibration nor Equal Precision stand as relevant fairness criteria in classification. Addressing their limitations to warn the decision-maker for any disadvantaging rate, Equalized Odds avoids the peril of strict conservatism, while keeping away the utopia of a whole redistribution of resources through algorithms.

[100] 2407.14767

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

In this study, we explore the proactive ability of LLMs to seek user support, using text-to-SQL generation as a case study. We propose metrics to evaluate the trade-off between performance improvements and user burden, and investigate whether LLMs can determine when to request help and examine their performance with varying levels of information availability. Our experiments reveal that without external feedback, many LLMs struggle to recognize their need for additional support. Our findings highlight the importance of external signals and provide insights for future research on improving support-seeking strategies.

[101] 2407.14768

Teach Harder, Learn Poorer: Rethinking Hard Sample Distillation for GNN-to-MLP Knowledge Distillation

To bridge the gaps between powerful Graph Neural Networks (GNNs) and lightweight Multi-Layer Perceptron (MLPs), GNN-to-MLP Knowledge Distillation (KD) proposes to distill knowledge from a well-trained teacher GNN into a student MLP. In this paper, we revisit the knowledge samples (nodes) in teacher GNNs from the perspective of hardness, and identify that hard sample distillation may be a major performance bottleneck of existing graph KD algorithms. The GNN-to-MLP KD involves two different types of hardness, one student-free knowledge hardness describing the inherent complexity of GNN knowledge, and the other student-dependent distillation hardness describing the difficulty of teacher-to-student distillation. However, most of the existing work focuses on only one of these aspects or regards them as one thing. This paper proposes a simple yet effective Hardness-aware GNN-to-MLP Distillation (HGMD) framework, which decouples the two hardnesses and estimates them using a non-parametric approach. Finally, two hardness-aware distillation schemes (i.e., HGMD-weight and HGMD-mixup) are further proposed to distill hardness-aware knowledge from teacher GNNs into the corresponding nodes of student MLPs. As non-parametric distillation, HGMD does not involve any additional learnable parameters beyond the student MLPs, but it still outperforms most of the state-of-the-art competitors. HGMD-mixup improves over the vanilla MLPs by 12.95% and outperforms its teacher GNNs by 2.48% averaged over seven real-world datasets.

[102] 2407.14769

A Two-Phase Visualization System for Continuous Human-AI Collaboration in Sequelae Analysis and Modeling

In healthcare, AI techniques are widely used for tasks like risk assessment and anomaly detection. Despite AI's potential as a valuable assistant, its role in complex medical data analysis often oversimplifies human-AI collaboration dynamics. To address this, we collaborated with a local hospital, engaging six physicians and one data scientist in a formative study. From this collaboration, we propose a framework integrating two-phase interactive visualization systems: one for Human-Led, AI-Assisted Retrospective Analysis and another for AI-Mediated, Human-Reviewed Iterative Modeling. This framework aims to enhance understanding and discussion around effective human-AI collaboration in healthcare.

[103] 2407.14770

SLInterpreter: An Exploratory and Iterative Human-AI Collaborative System for GNN-based Synthetic Lethal Prediction

Synthetic Lethal (SL) relationships, though rare among the vast array of gene combinations, hold substantial promise for targeted cancer therapy. Despite advancements in AI model accuracy, there is still a significant need among domain experts for interpretive paths and mechanism explorations that align better with domain-specific knowledge, particularly due to the high costs of experimentation. To address this gap, we propose an iterative Human-AI collaborative framework with two key components: 1) Human-Engaged Knowledge Graph Refinement based on Metapath Strategies, which leverages insights from interpretive paths and domain expertise to refine the knowledge graph through metapath strategies with appropriate granularity. 2) Cross-Granularity SL Interpretation Enhancement and Mechanism Analysis, which aids experts in organizing and comparing predictions and interpretive paths across different granularities, uncovering new SL relationships, enhancing result interpretation, and elucidating potential mechanisms inferred by Graph Neural Network (GNN) models. These components cyclically optimize model predictions and mechanism explorations, enhancing expert involvement and intervention to build trust. Facilitated by SLInterpreter, this framework ensures that newly generated interpretive paths increasingly align with domain knowledge and adhere more closely to real-world biological principles through iterative Human-AI collaboration. We evaluate the framework's efficacy through a case study and expert interviews.

[104] 2407.14772

Subgraph Clustering and Atom Learning for Improved Image Classification

In this study, we present the Graph Sub-Graph Network (GSN), a novel hybrid image classification model merging the strengths of Convolutional Neural Networks (CNNs) for feature extraction and Graph Neural Networks (GNNs) for structural modeling. GSN employs k-means clustering to group graph nodes into clusters, facilitating the creation of subgraphs. These subgraphs are then utilized to learn representative `atoms` for dictionary learning, enabling the identification of sparse, class-distinguishable features. This integrated approach is particularly relevant in domains like medical imaging, where discerning subtle feature differences is crucial for accurate classification. To evaluate the performance of our proposed GSN, we conducted experiments on benchmark datasets, including PascalVOC and HAM10000. Our results demonstrate the efficacy of our model in optimizing dictionary configurations across varied classes, which contributes to its effectiveness in medical classification tasks. This performance enhancement is primarily attributed to the integration of CNNs, GNNs, and graph learning techniques, which collectively improve the handling of datasets with limited labeled examples. Specifically, our experiments show that the model achieves a higher accuracy on benchmark datasets such as Pascal VOC and HAM10000 compared to conventional CNN approaches.

[105] 2407.14774

Intelligent Artistic Typography: A Comprehensive Review of Artistic Text Design and Generation

Artistic text generation aims to amplify the aesthetic qualities of text while maintaining readability. It can make the text more attractive and better convey its expression, thus enjoying a wide range of application scenarios such as social media display, consumer electronics, fashion, and graphic design. Artistic text generation includes artistic text stylization and semantic typography. Artistic text stylization concentrates on the text effect overlaid upon the text, such as shadows, outlines, colors, glows, and textures. By comparison, semantic typography focuses on the deformation of the characters to strengthen their visual representation by mimicking the semantic understanding within the text. This overview paper provides an introduction to both artistic text stylization and semantic typography, including the taxonomy, the key ideas of representative methods, and the applications in static and dynamic artistic text generation. Furthermore, the dataset and evaluation metrics are introduced, and the future directions of artistic text generation are discussed. A comprehensive list of artistic text generation models studied in this review is available at

[106] 2407.14775

Phase Re-service in Reinforcement Learning Traffic Signal Control

This article proposes a novel approach to traffic signal control that combines phase re-service with reinforcement learning (RL). The RL agent directly determines the duration of the next phase in a pre-defined sequence. Before the RL agent's decision is executed, we use the shock wave theory to estimate queue expansion at the designated movement allowed for re-service and decide if phase re-service is necessary. If necessary, a temporary phase re-service is inserted before the next regular phase. We formulate the RL problem as a semi-Markov decision process (SMDP) and solve it with proximal policy optimization (PPO). We conducted a series of experiments that showed significant improvements thanks to the introduction of phase re-service. Vehicle delays are reduced by up to 29.95% of the average and up to 59.21% of the standard deviation. The number of stops is reduced by 26.05% on average with 45.77% less standard deviation.

[107] 2407.14779

Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach

Our research investigates the impact of Generative Artificial Intelligence (GAI) models, specifically text-to-image generators (T2Is), on the representation of non-Western cultures, with a focus on Indian contexts. Despite the transformative potential of T2Is in content creation, concerns have arisen regarding biases that may lead to misrepresentations and marginalizations. Through a community-centered approach and grounded theory analysis of 5 focus groups from diverse Indian subcultures, we explore how T2I outputs to English prompts depict Indian culture and its subcultures, uncovering novel representational harms such as exoticism and cultural misappropriation. These findings highlight the urgent need for inclusive and culturally sensitive T2I systems. We propose design guidelines informed by a sociotechnical perspective, aiming to address these issues and contribute to the development of more equitable and representative GAI technologies globally. Our work also underscores the necessity of adopting a community-centered approach to comprehend the sociotechnical dynamics of these models, complementing existing work in this space while identifying and addressing the potential negative repercussions and harms that may arise when these models are deployed on a global scale.

[108] 2407.14783

VisFly: An Efficient and Versatile Simulator for Training Vision-based Flight

State-of-art simulators primarily focus on providing full-stack simulation tools or state-only parallelizability. Due to the limitation of computing resources, they have to make trade-off among photo-realism and sampling efficiency. Yet, both factors are crucial for data-driven reinforcement learning tasks. Therefore, we introduce a both rapid-rendering and photo-realistic quadrotor simulator: VisFly. VisFly offers a user-friendly framework and interfaces for users to develop or utilize. It couples differentiable dynamics and habitat-sim rendering engines, reaching frame rate of up to 10000 frame per second in cluttered environments. The simulation is wrapped as a gym environment, facilitating convenient implementation of various baseline learning algorithms. It can directly import all the open-source scene datasets compatible with habitat-sim, which provides more fair benchmarks for comparing the intelligent policy. VisFly presents a general policy architecture for tasks, and the whole framework is verified by three regular quadrotor tasks with visual observation. We will make this tool available at \url{}.

[109] 2407.14785

Stochastic Online Metric Matching: Adversarial is no Harder than Stochastic

We study the stochastic online metric matching problem. In this problem, $m$ servers and $n$ requests are located in a metric space, where all servers are available upfront and requests arrive one at a time. In particular, servers are adversarially chosen, and requests are independently drawn from a known distribution. Upon the arrival of a new request, it needs to be immediately and irrevocably matched to a free server, resulting in a cost of their distance. The objective is to minimize the total matching cost. In this paper, we show that the problem can be reduced to a more accessible setting where both servers and requests are drawn from the same distribution by incurring a moderate cost. Combining our reduction with previous techniques, for $[0, 1]^d$ with various choices of distributions, we achieve improved competitive ratios and nearly optimal regrets in both balanced and unbalanced markets. In particular, we give $O(1)$-competitive algorithms for $d \geq 3$ in both balanced and unbalanced markets with smooth distributions. Our algorithms improve on the $O((\log \log \log n)^2)$ competitive ratio of \cite{DBLP:conf/icalp/GuptaGPW19} for balanced markets in various regimes, and provide the first positive results for unbalanced markets.

[110] 2407.14788

On the Design and Analysis of LLM-Based Algorithms

We initiate a formal investigation into the design and analysis of LLM-based algorithms, i.e. algorithms that contain one or multiple calls of large language models (LLMs) as sub-routines and critically rely on the capabilities of LLMs. While LLM-based algorithms, ranging from basic LLM calls with prompt engineering to complicated LLM-powered agent systems and compound AI systems, have achieved remarkable empirical success, the design and optimization of them have mostly relied on heuristics and trial-and-errors, which is largely due to a lack of formal and analytical study for these algorithms. To fill this gap, we start by identifying the computational-graph representation of LLM-based algorithms, the design principle of task decomposition, and some key abstractions, which then facilitate our formal analysis for the accuracy and efficiency of LLM-based algorithms, despite the black-box nature of LLMs. We further consider parallel decomposition for a case study, providing extensive analytical and empirical study for four concrete examples of this pattern. Our proposed framework holds promise for advancing LLM-based algorithms, by revealing the reasons behind curious empirical phenomena, guiding the choices of hyperparameters, predicting the empirical performance of algorithms, and inspiring new algorithm design. To promote further study of LLM-based algorithms, we release our source code at

[111] 2407.14789

PERCORE: A Deep Learning-Based Framework for Persian Spelling Correction with Phonetic Analysis

This research introduces a state-of-the-art Persian spelling correction system that seamlessly integrates deep learning techniques with phonetic analysis, significantly enhancing the accuracy and efficiency of natural language processing (NLP) for Persian. Utilizing a fine-tuned language representation model, our methodology effectively combines deep contextual analysis with phonetic insights, adeptly correcting both non-word and real-word spelling errors. This strategy proves particularly effective in tackling the unique complexities of Persian spelling, including its elaborate morphology and the challenge of homophony. A thorough evaluation on a wide-ranging dataset confirms our system's superior performance compared to existing methods, with impressive F1-Scores of 0.890 for detecting real-word errors and 0.905 for correcting them. Additionally, the system demonstrates a strong capability in non-word error correction, achieving an F1-Score of 0.891. These results illustrate the significant benefits of incorporating phonetic insights into deep learning models for spelling correction. Our contributions not only advance Persian language processing by providing a versatile solution for a variety of NLP applications but also pave the way for future research in the field, emphasizing the critical role of phonetic analysis in developing effective spelling correction system.

[112] 2407.14790

Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) or providing any finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we first develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2. Then, we develop an LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval, to evaluate the correctness of reasoning chains. Evaluating reasoning chains from LLMs leads to several interesting findings. We further show that existing prompting methods used for enhancing models' reasoning abilities do not improve performance on GridPuzzle. This highlights the importance of understanding fine-grained errors and presents a challenge for future research to enhance LLMs' puzzle-solving abilities by developing methods that address these errors. Data and source code are available at

[113] 2407.14792

FedPartWhole: Federated domain generalization via consistent part-whole hierarchies

Federated Domain Generalization (FedDG), aims to tackle the challenge of generalizing to unseen domains at test time while catering to the data privacy constraints that prevent centralized data storage from different domains originating at various clients. Existing approaches can be broadly categorized into four groups: domain alignment, data manipulation, learning strategies, and optimization of model aggregation weights. This paper proposes a novel approach to Federated Domain Generalization that tackles the problem from the perspective of the backbone model architecture. The core principle is that objects, even under substantial domain shifts and appearance variations, maintain a consistent hierarchical structure of parts and wholes. For instance, a photograph and a sketch of a dog share the same hierarchical organization, consisting of a head, body, limbs, and so on. The introduced architecture explicitly incorporates a feature representation for the image parse tree. To the best of our knowledge, this is the first work to tackle Federated Domain Generalization from a model architecture standpoint. Our approach outperforms a convolutional architecture of comparable size by over 12\%, despite utilizing fewer parameters. Additionally, it is inherently interpretable, contrary to the black-box nature of CNNs, which fosters trust in its predictions, a crucial asset in federated learning.

[114] 2407.14793

QoS Aware Mixed-Criticality Task Scheduling in Vehicular Edge Cloud System

Modern-day cars are equipped with numerous cameras and sensors, typically integrated with advanced decision-control systems that enable the vehicle to perceive its surroundings and navigate autonomously. Efficient processing of data from sensors, lidars, radars and cameras is quite computationally intensive and can not be done with good accuracy using less capable onboard resources. In order to deal with this problem, some computation requirements (also referred as tasks) are offloaded to infrastructure or executed in parallel in both autonomous vehicle (AV) and infrastructure to enhance accuracy. The infrastructure comprises base stations, a centralized cloud, and a CS. Base stations (BSs) execute tasks in collaboration with a significantly more powerful centralized cloud, while the centralised scheduler (CS) centrally schedules all the tasks. The base station receives tasks from multiple AVs, each with varying deadlines, criticality, and locations. Our main goal is to maximize the profit of the infrastructure by (a) minimizing the number of drop tasks, (b) minimizing the distance cost for task offloading, and (c) minimizing the energy usage of BSs. In this work, we proposed efficient approaches to schedule the collection of tasks to the BSs, by employing a hybrid scheduling approach where tasks from AVs get allocated to nearby base stations if the nearby BSs are lightly loaded, otherwise AVs send the task to CS for allocation. The CS maximizes the profit by following strategies: (a) selection of BS considering distance and energy consumption, (b) when task load is moderate or low, highly critical tasks run at favourable utilisation, and (c) low-critical tasks are dropped to free up resources for executing high-critical tasks. Based on our experiments, proposed approaches improved the QoS provided by up to 25% compared to the state-of-the-art approach in real-life datasets.

[115] 2407.14795

Automatic Real-word Error Correction in Persian Text

Automatic spelling correction stands as a pivotal challenge within the ambit of natural language processing (NLP), demanding nuanced solutions. Traditional spelling correction techniques are typically only capable of detecting and correcting non-word errors, such as typos and misspellings. However, context-sensitive errors, also known as real-word errors, are more challenging to detect because they are valid words that are used incorrectly in a given context. The Persian language, characterized by its rich morphology and complex syntax, presents formidable challenges to automatic spelling correction systems. Furthermore, the limited availability of Persian language resources makes it difficult to train effective spelling correction models. This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text. Our methodology adopts a structured, multi-tiered approach, employing semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy. The innovative architecture discovers and stores semantic similarities between words and phrases in Persian text. The classifiers accurately identify real-word errors, while the semantic ranking algorithm determines the most probable corrections for real-word errors, taking into account specific spelling correction and context properties such as context, semantic similarity, and edit-distance measures. Evaluations have demonstrated that our proposed method surpasses previous Persian real-word error correction models. Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase. These results clearly indicate that our approach is a highly promising solution for automatic real-word error correction in Persian text.

[116] 2407.14796

PASSION: Towards Effective Incomplete Multi-Modal Medical Image Segmentation with Imbalanced Missing Rates

Incomplete multi-modal image segmentation is a fundamental task in medical imaging to refine deployment efficiency when only partial modalities are available. However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a challenging setting and propose Preference-Aware Self-diStillatION (PASSION) for incomplete multi-modal medical image segmentation under imbalanced missing rates. Specifically, we first construct pixel-wise and semantic-wise self-distillation to balance the optimization objective of each modality. Then, we define relative preference to evaluate the dominance of each modality during training, based on which to design task-wise and gradient-wise regularization to balance the convergence rates of different modalities. Experimental results on two publicly available multi-modal datasets demonstrate the superiority of PASSION against existing approaches for modality balancing. More importantly, PASSION is validated to work as a plug-and-play module for consistent performance improvement across different backbones. Code is available at

[117] 2407.14797

From Underground Mines to Offices: A Versatile and Robust Framework for Range-Inertial SLAM

Simultaneous Localization and Mapping (SLAM) is an essential component of autonomous robotic applications and self-driving vehicles, enabling them to understand and operate in their environment. Many SLAM systems have been proposed in the last decade, but they are often complex to adapt to different settings or sensor setups. In this work, we present LiDAR Graph-SLAM (LG-SLAM), a versatile range-inertial SLAM framework that can be adapted to different types of sensors and environments, from underground mines to offices with minimal parameter tuning. Our system integrates range, inertial and GNSS measurements into a graph-based optimization framework. We also use a refined submap management approach and a robust loop closure method that effectively accounts for uncertainty in the identification and validation of putative loop closures, ensuring global consistency and robustness. Enabled by a parallelized architecture and GPU integration, our system achieves pose estimation at LiDAR frame rate, along with online loop closing and graph optimization. We validate our system in diverse environments using public datasets and real-world data, consistently achieving an average error below 20 cm and outperforming other state-of-the-art algorithms.

[118] 2407.14799

FairViT: Fair Vision Transformer via Adaptive Masking

Vision Transformer (ViT) has achieved excellent performance and demonstrated its promising potential in various computer vision tasks. The wide deployment of ViT in real-world tasks requires a thorough understanding of the societal impact of the model. However, most ViT-based works do not take fairness into account and it is unclear whether directly applying CNN-oriented debiased algorithm to ViT is feasible. Moreover, previous works typically sacrifice accuracy for fairness. Therefore, we aim to develop an algorithm that improves accuracy without sacrificing fairness. In this paper, we propose FairViT, a novel accurate and fair ViT framework. To this end, we introduce a novel distance loss and deploy adaptive fairness-aware masks on attention layers updating with model parameters. Experimental results show \sys can achieve accuracy better than other alternatives, even with competitive computational efficiency. Furthermore, \sys achieves appreciable fairness results.

[119] 2407.14801

SquareSort: a cache-oblivious sorting algorithm

In this paper we consider sorting in the cache-oblivious model of Frigo, Leiserson, Prokop, and Ramachandran (1999). We introduce a new simple sorting algorithm in that model which has asymptotically optimal IO complexity $O(\frac{n}{B} \log_{M/B} n)$, where $n$ is the instance size, $M$ size of the cache and $B$ size of a memory block. This is the same as the complexity of the best known cache-oblivious sorting algorithm FunnelSort.

[120] 2407.14804

WiFaKey: Generating Cryptographic Keys from Face in the Wild

Deriving a unique cryptographic key from biometric measurements is a challenging task due to the existing noise gap between the biometric measurements and error correction coding. Additionally, privacy and security concerns arise as biometric measurements are inherently linked to the user. Biocryptosystems represent a key branch of solutions aimed at addressing these issues. However, many existing bio-cryptosystems rely on handcrafted feature extractors and error correction codes (ECC), often leading to performance degradation. To address these challenges and improve the reliability of biometric measurements, we propose a novel biometric cryptosystem named WiFaKey, for generating cryptographic keys from face in unconstrained settings. Speciffcally, WiFaKey ffrst introduces an adaptive random masking-driven feature transformation pipeline, AdaMTrans. AdaMTrans effectively quantizes and binarizes realvalued features and incorporates an adaptive random masking scheme to align the bit error rate with error correction requirements, thereby mitigating the noise gap. Besides, WiFaKey incorporates a supervised learning-based neural decoding scheme called Neural-MS decoder, which delivers a more robust error correction performance with less iteration than non-learning decoders, thereby alleviating the performance degradation. We evaluated WiFaKey using widely adopted face feature extractors on six large unconstrained and two constrained datasets. On the LFW dataset, WiFaKey achieved an average Genuine Match Rate of 85.45% and 85.20% at a 0% False Match Rate for MagFace and AdaFace features, respectively. Our comprehensive comparative analysis shows a signiffcant performance improvement of WiFaKey. The source code of our work is available at

[121] 2407.14811

Decoupled Prompt-Adapter Tuning for Continual Activity Recognition

Action recognition technology plays a vital role in enhancing security through surveillance systems, enabling better patient monitoring in healthcare, providing in-depth performance analysis in sports, and facilitating seamless human-AI collaboration in domains such as manufacturing and assistive technologies. The dynamic nature of data in these areas underscores the need for models that can continuously adapt to new video data without losing previously acquired knowledge, highlighting the critical role of advanced continual action recognition. To address these challenges, we propose Decoupled Prompt-Adapter Tuning (DPAT), a novel framework that integrates adapters for capturing spatial-temporal information and learnable prompts for mitigating catastrophic forgetting through a decoupled training strategy. DPAT uniquely balances the generalization benefits of prompt tuning with the plasticity provided by adapters in pretrained vision models, effectively addressing the challenge of maintaining model performance amidst continuous data evolution without necessitating extensive finetuning. DPAT consistently achieves state-of-the-art performance across several challenging action recognition benchmarks, thus demonstrating the effectiveness of our model in the domain of continual action recognition.

[122] 2407.14812

GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition

Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Existing appearance-based methods utilize CNN or Transformer to extract spatial and temporal features from silhouettes, while model-based methods employ GCN to focus on the special topological structure of skeleton points. However, the quality of silhouettes is limited by complex occlusions, and skeletons lack dense semantic features of the human body. To tackle these problems, we propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA), which effectively combines two modalities to obtain a more robust and comprehensive gait representation for recognition. First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors. Second, a co-attention alignment module is proposed to align the features by element-wise attention. Finally, we propose a mutual learning module, which achieves feature fusion through cross-attention, Wasserstein loss is further introduced to ensure the effective fusion of two modalities. Extensive experimental results demonstrate the superiority of our model on Gait3D, OU-MVLP, and CASIA-B.

[123] 2407.14813

Ultraspherical Spectral Method for Block Copolymer Systems on Unit Disk

In this paper, we investigate ultraspherical spectral method for the Ohta-Kawasaki (OK) and Nakazawa-Ohta (NO) models in the disk domain, representing diblock and triblock copolymer systems, respectively. We employ ultraspherical spectral discretization for spatial variables in the disk domain and apply the second-order backward differentiation formula (BDF) method for temporal discretization. To our best knowledge, this is the first study to develop a numerical method for diblock and triblock copolymer systems with long-range interactions in disk domains. We show the energy stability of the numerical method in both semi-discrete and fully-discrete discretizations. In our numerical experiments, we verify the second-order temporal convergence rate and the energy stability of the proposed methods. Our numerical results show that the coarsening dynamics in diblock copolymers lead to bubble assemblies both inside and on the boundary of the disk. Additionally, in the triblock copolymer system, we observe several novel pattern formations, including single and double bubble assemblies in the unit disk. These findings are detailed through extensive numerical experiments.

[124] 2407.14814

FMamba: Mamba based on Fast-attention for Multivariate Time-series Forecasting

In multivariate time-series forecasting (MTSF), extracting the temporal correlations of the input sequences is crucial. While popular Transformer-based predictive models can perform well, their quadratic computational complexity results in inefficiency and high overhead. The recently emerged Mamba, a selective state space model, has shown promising results in many fields due to its strong temporal feature extraction capabilities and linear computational complexity. However, due to the unilateral nature of Mamba, channel-independent predictive models based on Mamba cannot attend to the relationships among all variables in the manner of Transformer-based models. To address this issue, we combine fast-attention with Mamba to introduce a novel framework named FMamba for MTSF. Technically, we first extract the temporal features of the input variables through an embedding layer, then compute the dependencies among input variables via the fast-attention module. Subsequently, we use Mamba to selectively deal with the input features and further extract the temporal dependencies of the variables through the multi-layer perceptron block (MLP-block). Finally, FMamba obtains the predictive results through the projector, a linear layer. Experimental results on eight public datasets demonstrate that FMamba can achieve state-of-the-art performance while maintaining low computational overhead.

[125] 2407.14815

Unified Far-Field and Near-Field in Holographic MIMO: A Wavenumber-Domain Perspective

This article conceives a unified representation for near-field and far-field holographic multiple-input multiple-output (HMIMO) channels, addressing a practical design dilemma: "Why does the angular-domain representation no longer function effectively?" To answer this question, we pivot from the angular domain to the wavenumber domain and present a succinct overview of its underlying philosophy. In re-examining the Fourier plane-wave series expansion that recasts spherical propagation waves into a series of plane waves represented by Fourier harmonics, we characterize the HMIMO channel employing these Fourier harmonics having different wavenumbers. This approach, referred to as the wavenumebr-domain representation, facilitates a unified view across the far-field and the near-field. Furthermore, the limitations of the DFT basis are demonstrated when identifying the sparsity inherent to the HMIMO channel, motivating the development of a wavenumber-domain basis as an alternative. We then present some preliminary applications of the proposed wavenumber-domain basis in signal processing across both the far-field and near-field, along with several prospects for future HMIMO system designs based on the wavenumber domain.

[126] 2407.14816

Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding

Blind image deconvolution (BID) is a classic yet challenging problem in the field of image processing. Recent advances in deep image prior (DIP) have motivated a series of DIP-based approaches, demonstrating remarkable success in BID. However, due to the high non-convexity of the inherent optimization process, these methods are notorious for their sensitivity to the initialized kernel. To alleviate this issue and further improve their performance, we propose a new framework for BID that better considers the prior modeling and the initialization for blur kernels, leveraging a deep generative model. The proposed approach pre-trains a generative adversarial network-based kernel generator that aptly characterizes the kernel priors and a kernel initializer that facilitates a well-informed initialization for the blur kernel through latent space encoding. With the pre-trained kernel generator and initializer, one can obtain a high-quality initialization of the blur kernel, and enable optimization within a compact latent kernel manifold. Such a framework results in an evident performance improvement over existing DIP-based BID methods. Extensive experiments on different datasets demonstrate the effectiveness of the proposed method.

[127] 2407.14822

Text Style Transfer: An Introductory Overview

Text Style Transfer (TST) is a pivotal task in natural language generation to manipulate text style attributes while preserving style-independent content. The attributes targeted in TST can vary widely, including politeness, authorship, mitigation of offensive language, modification of feelings, and adjustment of text formality. TST has become a widely researched topic with substantial advancements in recent years. This paper provides an introductory overview of TST, addressing its challenges, existing approaches, datasets, evaluation measures, subtasks, and applications. This fundamental overview improves understanding of the background and fundamentals of text style transfer.

[128] 2407.14823

CrossDehaze: Scaling Up Image Dehazing with Cross-Data Vision Alignment and Augmentation

In recent years, as computer vision tasks have increasingly relied on high-quality image inputs, the task of image dehazing has received significant attention. Previously, many methods based on priors and deep learning have been proposed to address the task of image dehazing. Ignoring the domain gap between different data, former de-hazing methods usually adopt multiple datasets for explicit training, which often makes the methods themselves be violated. To address this problem, we propose a novel method of internal and external data augmentation to improve the existing dehazing methodology. By using cross-data external augmentor. The dataset inherits samples from different domains that are firmly aligned, making the model learn more robust and generalizable features. By using the internal data augmentation method, the model can fully exploit local information within the images, thereby obtaining more image details. To demonstrate the effectiveness of our proposed method, we conduct training on both the Natural Image Dataset (NID) and the Remote Sensing Image Dataset (RSID). Experimental results show that our method clearly resolves the domain gap in different dehazing datasets and presents a new pipeline for joint training in the dehazing task. Our approach significantly outperforms other advanced methods in dehazing and produces dehazed images that are closest to real haze-free images. The code will be available at:

[129] 2407.14829

Overview of AI-Debater 2023: The Challenges of Argument Generation Tasks

In this paper we present the results of the AI-Debater 2023 Challenge held by the Chinese Conference on Affect Computing (CCAC 2023), and introduce the related datasets. We organize two tracks to handle the argumentative generation tasks in different scenarios, namely, Counter-Argument Generation (Track 1) and Claim-based Argument Generation (Track 2). Each track is equipped with its distinct dataset and baseline model respectively. In total, 32 competing teams register for the challenge, from which we received 11 successful submissions. In this paper, we will present the results of the challenge and a summary of the systems, highlighting commonalities and innovations among participating systems. Datasets and baseline models of the AI-Debater 2023 Challenge have been already released and can be accessed through the official website of the challenge.

[130] 2407.14831

Toward Efficient Convolutional Neural Networks With Structured Ternary Patterns

High-efficiency deep learning (DL) models are necessary not only to facilitate their use in devices with limited resources but also to improve resources required for training. Convolutional neural networks (ConvNets) typically exert severe demands on local device resources and this conventionally limits their adoption within mobile and embedded platforms. This brief presents work toward utilizing static convolutional filters generated from the space of local binary patterns (LBPs) and Haar features to design efficient ConvNet architectures. These are referred to as Structured Ternary Patterns (STePs) and can be generated during network initialization in a systematic way instead of having learnable weight parameters thus reducing the total weight updates. The ternary values require significantly less storage and with the appropriate low-level implementation, can also lead to inference improvements. The proposed approach is validated using four image classification datasets, demonstrating that common network backbones can be made more efficient and provide competitive results. It is also demonstrated that it is possible to generate completely custom STeP-based networks that provide good trade-offs for on-device applications such as unmanned aerial vehicle (UAV)-based aerial vehicle detection. The experimental results show that the proposed method maintains high detection accuracy while reducing the trainable parameters by 40-80%. This work motivates further research toward good priors for non-learnable weights that can make DL architectures more efficient without having to alter the network during or after training.

[131] 2407.14833

SpatialTouch: Exploring Spatial Data Visualizations in Cross-reality

We propose and study a novel cross-reality environment that seamlessly integrates a monoscopic 2D surface (an interactive screen with touch and pen input) with a stereoscopic 3D space (an augmented reality HMD) to jointly host spatial data visualizations. This innovative approach combines the best of two conventional methods of displaying and manipulating spatial 3D data, enabling users to fluidly explore diverse visual forms using tailored interaction techniques. Providing such effective 3D data exploration techniques is pivotal for conveying its intricate spatial structures -- often at multiple spatial or semantic scales -- across various application domains and requiring diverse visual representations for effective visualization. To understand user reactions to our new environment, we began with an elicitation user study, in which we captured their responses and interactions. We observed that users adapted their interaction approaches based on perceived visual representations, with natural transitions in spatial awareness and actions while navigating across the physical surface. Our findings then informed the development of a design space for spatial data exploration in cross-reality. We thus developed cross-reality environments tailored to three distinct domains: for 3D molecular structure data, for 3D point cloud data, and for 3D anatomical data. In particular, we designed interaction techniques that account for the inherent features of interactions in both spaces, facilitating various forms of interaction, including mid-air gestures, touch interactions, pen interactions, and combinations thereof, to enhance the users' sense of presence and engagement. We assessed the usability of our environment with biologists, focusing on its use for domain research. In addition, we evaluated our interaction transition designs with virtual and mixed-reality experts to gather further insights.

[132] 2407.14834

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and minimal temporal information. Our experiments demonstrate that LLM, when coordinating different VLMs, can successfully recognize patterns and deduce actions in various scenarios despite the weak temporal signals. However, our findings suggest that to enhance this approach as a viable alternative solution, integrating a stronger temporal signal and exposing the models to slightly more frames would be beneficial.

[133] 2407.14838

Retrieval Augmented Generation Integrated Large Language Models in Smart Contract Vulnerability Detection

The rapid growth of Decentralized Finance (DeFi) has been accompanied by substantial financial losses due to smart contract vulnerabilities, underscoring the critical need for effective security auditing. With attacks becoming more frequent, the necessity and demand for auditing services has escalated. This especially creates a financial burden for independent developers and small businesses, who often have limited available funding for these services. Our study builds upon existing frameworks by integrating Retrieval-Augmented Generation (RAG) with large language models (LLMs), specifically employing GPT-4-1106 for its 128k token context window. We construct a vector store of 830 known vulnerable contracts, leveraging Pinecone for vector storage, OpenAI's text-embedding-ada-002 for embeddings, and LangChain to construct the RAG-LLM pipeline. Prompts were designed to provide a binary answer for vulnerability detection. We first test 52 smart contracts 40 times each against a provided vulnerability type, verifying the replicability and consistency of the RAG-LLM. Encouraging results were observed, with a 62.7% success rate in guided detection of vulnerabilities. Second, we challenge the model under a "blind" audit setup, without the vulnerability type provided in the prompt, wherein 219 contracts undergo 40 tests each. This setup evaluates the general vulnerability detection capabilities without hinted context assistance. Under these conditions, a 60.71% success rate was observed. While the results are promising, we still emphasize the need for human auditing at this time. We provide this study as a proof of concept for a cost-effective smart contract auditing process, moving towards democratic access to security.

[134] 2407.14841

Text-based Talking Video Editing with Cascaded Conditional Diffusion

Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face video training data and expensive test-time optimization for customized talking video editing or directly generate a video sequence without considering in-context information, leading to a poor generalizable representation, or incoherent transitions, or even inconsistent identity. In this paper, we propose an efficient cascaded conditional diffusion-based framework, which consists of two stages: audio to dense-landmark motion and motion to video. \textit{\textbf{In the first stage}}, we first propose a dynamic weighted in-context diffusion module to synthesize dense-landmark motions given an edited audio. \textit{\textbf{In the second stage}}, we introduce a warping-guided conditional diffusion module. The module first interpolates between the start and end frames of the editing interval to generate smooth intermediate frames. Then, with the help of the audio-to-dense motion images, these intermediate frames are warped to obtain coarse intermediate frames. Conditioned on the warped intermedia frames, a diffusion model is adopted to generate detailed and high-resolution target frames, which guarantees coherent and identity-preserved transitions. The cascaded conditional diffusion model decomposes the complex talking editing task into two flexible generation tasks, which provides a generalizable talking-face representation, seamless audio-visual transitions, and identity-preserved faces on a small dataset. Experiments show the effectiveness and superiority of the proposed method.

[135] 2407.14843

A Tale of Two Scales: Reconciling Horizontal and Vertical Scaling for Inference Serving Systems

Inference serving is of great importance in deploying machine learning models in real-world applications, ensuring efficient processing and quick responses to inference requests. However, managing resources in these systems poses significant challenges, particularly in maintaining performance under varying and unpredictable workloads. Two primary scaling strategies, horizontal and vertical scaling, offer different advantages and limitations. Horizontal scaling adds more instances to handle increased loads but can suffer from cold start issues and increased management complexity. Vertical scaling boosts the capacity of existing instances, allowing for quicker responses but is limited by hardware and model parallelization capabilities. This paper introduces Themis, a system designed to leverage the benefits of both horizontal and vertical scaling in inference serving systems. Themis employs a two-stage autoscaling strategy: initially using in-place vertical scaling to handle workload surges and then switching to horizontal scaling to optimize resource efficiency once the workload stabilizes. The system profiles the processing latency of deep learning models, calculates queuing delays, and employs different dynamic programming algorithms to solve the joint horizontal and vertical scaling problem optimally based on the workload situation. Extensive evaluations with real-world workload traces demonstrate over $10\times$ SLO violation reduction compared to the state-of-the-art horizontal or vertical autoscaling approaches while maintaining resource efficiency when the workload is stable.

[136] 2407.14844

Political Leanings in Web3 Betting: Decoding the Interplay of Political and Profitable Motives

Harnessing the transparent blockchain user behavior data, we construct the Political Betting Leaning Score (PBLS) to measure political leanings based on betting within Web3 prediction markets. Focusing on Polymarket and starting from the 2024 U.S. Presidential Election, we synthesize behaviors over 15,000 addresses across 4,500 events and 8,500 markets, capturing the intensity and direction of their political leanings by the PBLS. We validate the PBLS through internal consistency checks and external comparisons. We uncover relationships between our PBLS and betting behaviors through over 800 features capturing various behavioral aspects. A case study of the 2022 U.S. Senate election further demonstrates the ability of our measurement while decoding the dynamic interaction between political and profitable motives. Our findings contribute to understanding decision-making in decentralized markets, enhancing the analysis of behaviors within Web3 prediction environments. The insights of this study reveal the potential of blockchain in enabling innovative, multidisciplinary studies and could inform the development of more effective online prediction markets, improve the accuracy of forecast, and help the design and optimization of platform mechanisms. The data and code for the paper are accessible at the following link:

[137] 2407.14845

Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models

Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established. Therefore, understanding how LLMs reason and make decisions is crucial for their safe deployment. This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt. Leveraging the insight that LLMs learn to infer latent concepts during pretraining, we propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty. We show that the uncertainty decreases as the prompt's informativeness increases, similar to epistemic uncertainty. Our detailed experimental results on real datasets validate our proposed model.

[138] 2407.14846

Realistic Surgical Image Dataset Generation Based On 3D Gaussian Splatting

Computer vision technologies markedly enhance the automation capabilities of robotic-assisted minimally invasive surgery (RAMIS) through advanced tool tracking, detection, and localization. However, the limited availability of comprehensive surgical datasets for training represents a significant challenge in this field. This research introduces a novel method that employs 3D Gaussian Splatting to generate synthetic surgical datasets. We propose a method for extracting and combining 3D Gaussian representations of surgical instruments and background operating environments, transforming and combining them to generate high-fidelity synthetic surgical scenarios. We developed a data recording system capable of acquiring images alongside tool and camera poses in a surgical scene. Using this pose data, we synthetically replicate the scene, thereby enabling direct comparisons of the synthetic image quality (29.592 PSNR). As a further validation, we compared two YOLOv5 models trained on the synthetic and real data, respectively, and assessed their performance in an unseen real-world test dataset. Comparing the performances, we observe an improvement in neural network performance, with the synthetic-trained model outperforming the real-world trained model by 12%, testing both on real-world data.

[139] 2407.14847

Integrated BIM and Machine Learning System for Circularity Prediction of Construction Demolition Waste

Effective management of construction and demolition waste (C&DW) is crucial for sustainable development, as the industry accounts for 40% of the waste generated globally. The effectiveness of the C&DW management relies on the proper quantification of C&DW to be generated. Despite demolition activities having larger contributions to C&DW generation, extant studies have focused on construction waste. The few extant studies on demolition are often from the regional level perspective and provide no circularity insights. Thus, this study advances demolition quantification via Variable Modelling (VM) with Machine Learning (ML). The demolition dataset of 2280 projects were leveraged for the ML modelling, with XGBoost model emerging as the best (based on the Copeland algorithm), achieving R2 of 0.9977 and a Mean Absolute Error of 5.0910 on the testing dataset. Through the integration of the ML model with Building Information Modelling (BIM), the study developed a system for predicting quantities of recyclable and landfill materials from building demolitions. This provides detailed insights into the circularity of demolition waste and facilitates better planning and management. The SHapley Additive exPlanations (SHAP) method highlighted the implications of the features for demolition waste circularity. The study contributes to empirical studies on pre-demolition auditing at the project level and provides practical tools for implementation. Its findings would benefit stakeholders in driving a circular economy in the industry.

[140] 2407.14850

A Tale of Single-channel Electroencephalogram: Devices, Datasets, Signal Processing, Applications, and Future Directions

Single-channel electroencephalogram (EEG) is a cost-effective, comfortable, and non-invasive method for monitoring brain activity, widely adopted by researchers, consumers, and clinicians. The increasing number and proportion of articles on single-channel EEG underscore its growing potential. This paper provides a comprehensive review of single-channel EEG, focusing on development trends, devices, datasets, signal processing methods, recent applications, and future directions. Definitions of bipolar and unipolar configurations in single-channel EEG are clarified to guide future advancements. Applications mainly span sleep staging, emotion recognition, educational research, and clinical diagnosis. Ongoing advancements of single-channel EEG in AI-based EEG generation techniques suggest potential parity or superiority over multichannel EEG performance.

[141] 2407.14853

CBCTLiTS: A Synthetic, Paired CBCT/CT Dataset For Segmentation And Style Transfer

Medical imaging is vital in computer assisted intervention. Particularly cone beam computed tomography (CBCT) with defacto real time and mobility capabilities plays an important role. However, CBCT images often suffer from artifacts, which pose challenges for accurate interpretation, motivating research in advanced algorithms for more effective use in clinical practice. In this work we present CBCTLiTS, a synthetically generated, labelled CBCT dataset for segmentation with paired and aligned, high quality computed tomography data. The CBCT data is provided in 5 different levels of quality, reaching from a large number of projections with high visual quality and mild artifacts to a small number of projections with severe artifacts. This allows thorough investigations with the quality as a degree of freedom. We also provide baselines for several possible research scenarios like uni- and multimodal segmentation, multitask learning and style transfer followed by segmentation of relatively simple, liver to complex liver tumor segmentation. CBCTLiTS is accesssible via

[142] 2407.14857

Software Companies Responses to Hybrid Working

COVID 19 pandemic has disrupted the global market and workplace landscape. As a response, hybrid work situations have become popular in the software business sector. This way of working has an impact on software companies. This study investigates software companies responses to hybrid working. We conducted a large scale survey to achieve our objective. Our results are based on a qualitative analysis of 124 valid responses. The main result of our study is a taxonomy of software companies impacts on hybrid working at individual, team and organisation levels. We found higher positive responses at individual and organisational levels than negative responses. At the team level, both positive and negative impacts obtained a uniform number of responses. The results indicate that hybrid working became credible with the wave of COVID 19, with 83 positive responses outweighing the 41 negative responses. Software company respondents witnessed better work-life balance, productivity, and efficiency in hybrid working.

[143] 2407.14858

An encryption algorithm using a generalization of the Markovski algorithm and a system of orthogonal operations based on T-quasigroups

Here is a more detailed description of the algorithm proposed in [1]. This algorithm simultaneously uses two cryptographic procedures: encryption using a generalization of the Markovski algorithm [2] and encryption using a system of orthogonal operations. In this paper, we present an implementation of this algorithm based on T-quasigroups, more precisely, based on medial quasigroups.

[144] 2407.14859

Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques

The experiments at the Large Hadron Collider at CERN generate vast amounts of complex data from high-energy particle collisions. This data presents significant challenges due to its volume and complex reconstruction, necessitating the use of advanced analysis techniques for analysis. Recent advancements in deep learning, particularly Graph Neural Networks, have shown promising results in addressing the challenges but remain computationally expensive. The study presented in this paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline aiming at improving the accuracy and efficiency of collision event prediction tasks. By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples and then we refined the dataset by removing non-contributory elements: the model trained on this new reduced dataset can achieve good performances at a reduced computational cost. The method is completely agnostic to the specific influence method: different influence modalities can be easily integrated into our methodology. Moreover, by analyzing the discarded elements we can provide further insights about the event classification task. The novelty of integrating data attribution techniques together with Graph Neural Networks in high-energy physics tasks can offer a robust solution for managing large-scale data problems, capturing critical patterns, and maximizing accuracy across several high-data demand domains.

[145] 2407.14865

An Explainable Fast Deep Neural Network for Emotion Recognition

In the context of artificial intelligence, the inherent human attribute of engaging in logical reasoning to facilitate decision-making is mirrored by the concept of explainability, which pertains to the ability of a model to provide a clear and interpretable account of how it arrived at a particular outcome. This study explores explainability techniques for binary deep neural architectures in the framework of emotion classification through video analysis. We investigate the optimization of input features to binary classifiers for emotion recognition, with face landmarks detection using an improved version of the Integrated Gradients explainability method. The main contribution of this paper consists in the employment of an innovative explainable artificial intelligence algorithm to understand the crucial facial landmarks movements during emotional feeling, using this information also for improving the performances of deep learning-based emotion classifiers. By means of explainability, we can optimize the number and the position of the facial landmarks used as input features for facial emotion recognition, lowering the impact of noisy landmarks and thus increasing the accuracy of the developed models. In order to test the effectiveness of the proposed approach, we considered a set of deep binary models for emotion classification trained initially with a complete set of facial landmarks, which are progressively reduced based on a suitable optimization procedure. The obtained results prove the robustness of the proposed explainable approach in terms of understanding the relevance of the different facial points for the different emotions, also improving the classification accuracy and diminishing the computational cost.

[146] 2407.14868

Dual High-Order Total Variation Model for Underwater Image Restoration

Underwater images are typically characterized by color cast, haze, blurring, and uneven illumination due to the selective absorption and scattering when light propagates through the water, which limits their practical applications. Underwater image enhancement and restoration (UIER) is one crucial mode to improve the visual quality of underwater images. However, most existing UIER methods concentrate on enhancing contrast and dehazing, rarely pay attention to the local illumination differences within the image caused by illumination variations, thus introducing some undesirable artifacts and unnatural color. To address this issue, an effective variational framework is proposed based on an extended underwater image formation model (UIFM). Technically, dual high-order regularizations are successfully integrated into the variational model to acquire smoothed local ambient illuminance and structure-revealed reflectance in a unified manner. In our proposed framework, the weight factors-based color compensation is combined with the color balance to compensate for the attenuated color channels and remove the color cast. In particular, the local ambient illuminance with strong robustness is acquired by performing the local patch brightest pixel estimation and an improved gamma correction. Additionally, we design an iterative optimization algorithm relying on the alternating direction method of multipliers (ADMM) to accelerate the solution of the proposed variational model. Considerable experiments on three real-world underwater image datasets demonstrate that the proposed method outperforms several state-of-the-art methods with regard to visual quality and quantitative assessments. Moreover, the proposed method can also be extended to outdoor image dehazing, low-light image enhancement, and some high-level vision tasks. The code is available at

[147] 2407.14872

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model's ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.

[148] 2407.14875

Seal: Advancing Speech Language Models to be Few-Shot Learners

Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.

[149] 2407.14878

Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment

Multilingual sentence encoders are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space. As such, they are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing. Another limitation of multilingual sentence encoders is the trade-off between monolingual and cross-lingual performance. Training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages, harming the utility of sentence embeddings in monolingual tasks. In this work, we address both issues by modular training of sentence encoders, i.e., by separating monolingual specialization from cross-lingual alignment. We first efficiently train language-specific sentence encoders to avoid negative interference between languages (i.e., the curse). We then align all non-English monolingual encoders to the English encoder by training a cross-lingual alignment adapter on top of each, preventing interference with monolingual specialization from the first step. In both steps, we resort to contrastive learning on machine-translated paraphrase data. Monolingual and cross-lingual evaluations on semantic text similarity/relatedness and multiple-choice QA render our modular solution more effective than multilingual sentence encoders, especially benefiting low-resource languages.

[150] 2407.14879

Thompson Sampling Itself is Differentially Private

In this work we first show that the classical Thompson sampling algorithm for multi-arm bandits is differentially private as-is, without any modification. We provide per-round privacy guarantees as a function of problem parameters and show composition over $T$ rounds; since the algorithm is unchanged, existing $O(\sqrt{NT\log N})$ regret bounds still hold and there is no loss in performance due to privacy. We then show that simple modifications -- such as pre-pulling all arms a fixed number of times, increasing the sampling variance -- can provide tighter privacy guarantees. We again provide privacy guarantees that now depend on the new parameters introduced in the modification, which allows the analyst to tune the privacy guarantee as desired. We also provide a novel regret analysis for this new algorithm, and show how the new parameters also impact expected regret. Finally, we empirically validate and illustrate our theoretical findings in two parameter regimes and demonstrate that tuning the new parameters substantially improve the privacy-regret tradeoff.

[151] 2407.14880

A New Dataset and Framework for Real-World Blurred Images Super-Resolution

Recent Blind Image Super-Resolution (BSR) methods have shown proficiency in general images. However, we find that the efficacy of recent methods obviously diminishes when employed on image data with blur, while image data with intentional blur constitute a substantial proportion of general data. To further investigate and address this issue, we developed a new super-resolution dataset specifically tailored for blur images, named the Real-world Blur-kept Super-Resolution (ReBlurSR) dataset, which consists of nearly 3000 defocus and motion blur image samples with diverse blur sizes and varying blur intensities. Furthermore, we propose a new BSR framework for blur images called Perceptual-Blur-adaptive Super-Resolution (PBaSR), which comprises two main modules: the Cross Disentanglement Module (CDM) and the Cross Fusion Module (CFM). The CDM utilizes a dual-branch parallelism to isolate conflicting blur and general data during optimization. The CFM fuses the well-optimized prior from these distinct domains cost-effectively and efficiently based on model interpolation. By integrating these two modules, PBaSR achieves commendable performance on both general and blur data without any additional inference and deployment cost and is generalizable across multiple model architectures. Rich experiments show that PBaSR achieves state-of-the-art performance across various metrics without incurring extra inference costs. Within the widely adopted LPIPS metrics, PBaSR achieves an improvement range of approximately 0.02-0.10 with diverse anchor methods and blur types, across both the ReBlurSR and multiple common general BSR benchmarks. Code here:

[152] 2407.14882

Reduced Effectiveness of Kolmogorov-Arnold Networks on Functions with Noise

It has been observed that even a small amount of noise introduced into the dataset can significantly degrade the performance of KAN. In this brief note, we aim to quantitatively evaluate the performance when noise is added to the dataset. We propose an oversampling technique combined with denoising to alleviate the impact of noise. Specifically, we employ kernel filtering based on diffusion maps for pre-filtering the noisy data for training KAN network. Our experiments show that while adding i.i.d. noise with any fixed SNR, when we increase the amount of training data by a factor of $r$, the test-loss (RMSE) of KANs will exhibit a performance trend like $\text{test-loss} \sim \mathcal{O}(r^{-\frac{1}{2}})$ as $r\to +\infty$. We conclude that applying both oversampling and filtering strategies can reduce the detrimental effects of noise. Nevertheless, determining the optimal variance for the kernel filtering process is challenging, and enhancing the volume of training data substantially increases the associated costs, because the training dataset needs to be expanded multiple times in comparison to the initial clean data. As a result, the noise present in the data ultimately diminishes the effectiveness of Kolmogorov-Arnold networks.

[153] 2407.14883

Inferring Ingrained Remote Information in AC Power Flows Using Neuromorphic Modality Regime

In this paper, we infer ingrained remote information in AC power flows using spiking neural network (SNN) as edge processors for efficient coordination of power electronic converters. This work unifies power and information as a means of data normalization using a multi-modal regime in the form of spikes using energy-efficient neuromorphic processing and semantics theory. Firstly, we organize the synchronous realvalued measurements at each edge and translate them into asynchronous spike-based events to collect sparse data for training of SNN at each edge. Instead of relying on error-dependent supervised data-driven learning theory, we exploit the latency-driven unsupervised Hebbian learning rule to obtain modulation pulses for switching of power electronic converters that can now communicate among each other. Not only does this philosophy block exogenous path arrival for cyber attackers by dismissing the cyber layer, it also entails converter adaptation to system reconfiguration and parameter mismatch issues. We conclude this work by validating its energy-efficient and effective online learning performance under various scenarios in modified IEEE 14-bus system and under experimental conditions.

[154] 2407.14885

Falcon2-11B Technical Report

We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-stage approach where the early stages are distinguished by their context length and a final stage where we use a curated, high-quality dataset. Additionally, we report the effect of doubling the batch size mid-training and how training loss spikes are affected by the learning rate. The downstream performance of the foundation model is evaluated on established benchmarks, including multilingual and code datasets. The foundation model shows strong generalization across all the tasks which makes it suitable for downstream finetuning use cases. For the vision language model, we report the performance on several benchmarks and show that our model achieves a higher average score compared to open-source models of similar size. The model weights and code of both Falcon2-11B and Falcon2-11B-vlm are made available under a permissive license.

[155] 2407.14890

Rate Splitting Multiple Access: Optimal Beamforming Structure and Efficient Optimization Algorithms

Joint optimization for common rate allocation and beamforming design have been widely studied in rate splitting multiple access (RSMA) empowered multiuser multi-antenna transmission networks. Due to the highly coupled optimization variables and non-convexity of the joint optimization problems, emerging algorithms such as weighted minimum mean square error (WMMSE) and successive convex approximation (SCA) have been applied to RSMA which typically approximate the original problem with a sequence of disciplined convex subproblems and solve each subproblem by an optimization toolbox. While these approaches are capable of finding a viable solution, they are unable to offer a comprehensive understanding of the solution structure and are burdened by high computational complexity. In this work, for the first time, we identify the optimal beamforming structure and common rate allocation for the weighted sum-rate (WSR) maximization problem of RSMA. We then propose a computationally efficient optimization algorithm that jointly optimizes the beamforming and common rate allocation without relying on any toolbox. Specifically, we first approximate the original WSR maximization problem with a sequence of convex subproblems based on fractional programming (FP). Numerical results show that the proposed algorithm achieves the same performance but takes only 0.5\% or less simulation time compared with the state-of-the-art WMMSE, SCA, and FP algorithms. Experimental results show that the proposed two methods have similar mean square error (MSE) performance compared with traditional method using CVX optimization tool, however, computational complexities are greatly reduced. The proposed algorithms pave the way for the practical and efficient optimization algorithm design for RSMA and its applications in 6G.

[156] 2407.14892

Latent Pollution Model: The Hidden Carbon Footprint in 3D Image Synthesis

Contemporary developments in generative AI are rapidly transforming the field of medical AI. These developments have been predominantly driven by the availability of large datasets and high computing power, which have facilitated a significant increase in model capacity. Despite their considerable potential, these models demand substantially high power, leading to high carbon dioxide (CO2) emissions. Given the harm such models are causing to the environment, there has been little focus on the carbon footprints of such models. This study analyzes carbon emissions from 2D and 3D latent diffusion models (LDMs) during training and data generation phases, revealing a surprising finding: the synthesis of large images contributes most significantly to these emissions. We assess different scenarios including model sizes, image dimensions, distributed training, and data generation steps. Our findings reveal substantial carbon emissions from these models, with training 2D and 3D models comparable to driving a car for 10 km and 90 km, respectively. The process of data generation is even more significant, with CO2 emissions equivalent to driving 160 km for 2D models and driving for up to 3345 km for 3D synthesis. Additionally, we found that the location of the experiment can increase carbon emissions by up to 94 times, and even the time of year can influence emissions by up to 50%. These figures are alarming, considering they represent only a single training and data generation phase for each model. Our results emphasize the urgent need for developing environmentally sustainable strategies in generative AI.

[157] 2407.14894

A Holistic Optimization Framework for Energy Efficient UAV-assisted Fog Computing: Attitude Control, Trajectory Planning and Task Assignment

Unmanned Aerial Vehicles (UAVs) have significantly enhanced fog computing by acting as both flexible computation platforms and communication mobile relays. In this paper, we propose a holistic framework that jointly optimizes the total latency and energy consumption for UAV-assisted fog computing in a three-dimensional spatial domain with varying terrain elevations and dynamic task generations. Our proposed framework considers three important and interdependent modules: attitude control, trajectory planning, and task assignment. We first establish a fuzzy proportional-integral-derivative control model to determine the UAV's attitude. Then, we propose an enhanced Ant Colony System (ACS) based algorithm, that includes a safety value and a decoupling mechanism to overcome the convergence issue in classical ACS, to compute the optimal UAV trajectory. Finally, we design an algorithm based on the Particle Swarm Optimization technique, to determine where each offloaded task should be executed. Under our proposed framework, the outcome of one module would affect the decision-making in one other, providing a holistic perspective of the system and thus leading to improved solutions. We demonstrate by extensive simulation results that our proposed framework can significantly improve the overall performance, measured by latency and energy consumption, compared to existing baseline approaches.

[158] 2407.14895

Strategic Coupon Allocation for Increasing Providers' Sales Experiences in Two-sided Marketplaces

In a two-sided marketplace, network effects are crucial for competitiveness, and platforms need to retain users through advanced customer relationship management as much as possible. Maintaining numerous providers' stable and active presence on the platform is highly important to enhance the marketplace's scale and diversity. The strongest motivation for providers to continue using the platform is to realize actual profits through sales. Then, we propose a personalized promotion to increase the number of successful providers with sales experiences on the platform. The main contributions of our research are twofold. First, we introduce a new perspective in provider management with the distribution of successful sales experiences. Second, we propose a personalized promotion optimization method to maximize the number of providers' sales experiences. By utilizing this approach, we ensure equal opportunities for providers to experience sales without being monopolized by a few providers. Through experiments using actual data on coupon distribution, we confirm that our method enables the implementation of coupon allocation strategies that significantly increase the total number of providers having sales experiences.

[159] 2407.14899

Hyperspectral Unmixing Under Endmember Variability: A Variational Inference Framework

This work proposes a variational inference (VI) framework for hyperspectral unmixing in the presence of endmember variability (HU-EV). An EV-accounted noisy linear mixture model (LMM) is considered, and the presence of outliers is also incorporated into the model. Following the marginalized maximum likelihood (MML) principle, a VI algorithmic structure is designed for probabilistic inference for HU-EV. Specifically, a patch-wise static endmember assumption is employed to exploit spatial smoothness and to try to overcome the ill-posed nature of the HU-EV problem. The design facilitates lightweight, continuous optimization-based updates under a variety of endmember priors. Some of the priors, such as the Beta prior, were previously used under computationally heavy, sampling-based probabilistic HU-EV methods. The effectiveness of the proposed framework is demonstrated through synthetic, semi-real, and real-data experiments.

[160] 2407.14900

AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents a non-trivial problem. To overcome them, we propose the Attribute Guidance Diffusion framework (AGLLDiff), a training-free method for effective real-world LIE. Instead of specifically defining the degradation process, AGLLDiff shifts the paradigm and models the desired attributes, such as image exposure, structure and color of normal-light images. These attributes are readily available and impose no assumptions about the degradation process, which guides the diffusion sampling process to a reliable high-quality solution space. Extensive experiments demonstrate that our approach outperforms the current leading unsupervised LIE methods across benchmarks in terms of distortion-based and perceptual-based metrics, and it performs well even in sophisticated wild degradation.

[161] 2407.14903

Automated Patient Positioning with Learned 3D Hand Gestures

Positioning patients for scanning and interventional procedures is a critical task that requires high precision and accuracy. The conventional workflow involves manually adjusting the patient support to align the center of the target body part with the laser projector or other guiding devices. This process is not only time-consuming but also prone to inaccuracies. In this work, we propose an automated patient positioning system that utilizes a camera to detect specific hand gestures from technicians, allowing users to indicate the target patient region to the system and initiate automated positioning. Our approach relies on a novel multi-stage pipeline to recognize and interpret the technicians' gestures, translating them into precise motions of medical devices. We evaluate our proposed pipeline during actual MRI scanning procedures, using RGB-Depth cameras to capture the process. Results show that our system achieves accurate and precise patient positioning with minimal technician intervention. Furthermore, we validate our method on HaGRID, a large-scale hand gesture dataset, demonstrating its effectiveness in hand detection and gesture recognition.

[162] 2407.14906

Interdiction of minimum spanning trees and other matroid bases

In the minimum spanning tree (MST) interdiction problem, we are given a graph $G=(V,E)$ with edge weights, and want to find some $X\subseteq E$ satisfying a knapsack constraint such that the MST weight in $(V,E\setminus X)$ is maximized. Since MSTs of $G$ are the minimum weight bases in the graphic matroid of $G$, this problem is a special case of matroid interdiction on a matroid $M=(E,\mathcal{I})$, in which the objective is instead to maximize the minimum weight of a basis of $M$ which is disjoint from $X$. By reduction from 0-1 knapsack, matroid interdiction is NP-complete, even for uniform matroids. We develop a new exact algorithm to solve the matroid interdiction problem. One of the key components of our algorithm is a dynamic programming upper bound which only requires that a simpler discrete derivative problem can be calculated/approximated for the given matroid. Our exact algorithm then uses this bound within a custom branch-and-bound algorithm. For different matroids, we show how this discrete derivative can be calculated/approximated. In particular, for partition matroids, this yields a pseudopolynomial time algorithm. For graphic matroids, an approximation can be obtained by solving a sequence of minimum cut problems, which we apply to the MST interdiction problem. The running time of our algorithm is asymptotically faster than the best known MST interdiction algorithm, up to polylog factors. Furthermore, our algorithm achieves state-of-the-art computational performance: we solved all available instances from the literature, and in many cases reduced the best running time from hours to seconds.

[163] 2407.14907

Monotone Rewritability and the Analysis of Queries, Views, and Rules

We study the interaction of views, queries, and background knowledge in the form of existential rules. The motivating questions concern monotonic determinacy of a query using views w.r.t. rules, which refers to the ability to recover the query answer from the views via a monotone function. We study the decidability of monotonic determinacy, and compare with variations that require the ``recovery function'' to be in a well-known monotone query language, such as conjunctive queries or Datalog. Surprisingly, we find that even in the presence of basic existential rules, the borderline between well-behaved and badly-behaved answerability differs radically from the unconstrained case. In order to understand this boundary, we require new results concerning entailment problems involving views and rules.

[164] 2407.14908

PREVis: Perceived Readability Evaluation for Visualizations

We developed and validated an instrument to measure the perceived readability in data visualization: PREVis. Researchers and practitioners can easily use this instrument as part of their evaluations to compare the perceived readability of different visual data representations. Our instrument can complement results from controlled experiments on user task performance or provide additional data during in-depth qualitative work such as design iterations when developing a new technique. Although readability is recognized as an essential quality of data visualizations, so far there has not been a unified definition of the construct in the context of visual representations. As a result, researchers often lack guidance for determining how to ask people to rate their perceived readability of a visualization. To address this issue, we engaged in a rigorous process to develop the first validated instrument targeted at the subjective readability of visual data representations. Our final instrument consists of 11 items across 4 dimensions: understandability, layout clarity, readability of data values, and readability of data patterns. We provide the questionnaire as a document with implementation guidelines on Beyond this instrument, we contribute a discussion of how researchers have previously assessed visualization readability, and an analysis of the factors underlying perceived readability in visual data representations.

[165] 2407.14910

Visual Geo-Localization from images

This paper presents a visual geo-localization system capable of determining the geographic locations of places (buildings and road intersections) from images without relying on GPS data. Our approach integrates three primary methods: Scale-Invariant Feature Transform (SIFT) for place recognition, traditional image processing for identifying road junction types, and deep learning using the VGG16 model for classifying road junctions. The most effective techniques have been integrated into an offline mobile application, enhancing accessibility for users requiring reliable location information in GPS-denied environments.

[166] 2407.14911

Self-supervised transformer-based pre-training method with General Plant Infection dataset

Pest and disease classification is a challenging issue in agriculture. The performance of deep learning models is intricately linked to training data diversity and quantity, posing issues for plant pest and disease datasets that remain underdeveloped. This study addresses these challenges by constructing a comprehensive dataset and proposing an advanced network architecture that combines Contrastive Learning and Masked Image Modeling (MIM). The dataset comprises diverse plant species and pest categories, making it one of the largest and most varied in the field. The proposed network architecture demonstrates effectiveness in addressing plant pest and disease recognition tasks, achieving notable detection accuracy. This approach offers a viable solution for rapid, efficient, and cost-effective plant pest and disease detection, thereby reducing agricultural production costs. Our code and dataset will be publicly available to advance research in plant pest and disease recognition the GitHub repository at

[167] 2407.14912

PolyR-CNN: R-CNN for end-to-end polygonal building outline extraction

Polygonal building outline extraction has been a research focus in recent years. Most existing methods have addressed this challenging task by decomposing it into several subtasks and employing carefully designed architectures. Despite their accuracy, such pipelines often introduce inefficiencies during training and inference. This paper presents an end-to-end framework, denoted as PolyR-CNN, which offers an efficient and fully integrated approach to predict vectorized building polygons and bounding boxes directly from remotely sensed images. Notably, PolyR-CNN leverages solely the features of the Region of Interest (RoI) for the prediction, thereby mitigating the necessity for complex designs. Furthermore, we propose a novel scheme with PolyR-CNN to extract detailed outline information from polygon vertex coordinates, termed vertex proposal feature, to guide the RoI features to predict more regular buildings. PolyR-CNN demonstrates the capacity to deal with buildings with holes through a simple post-processing method on the Inria dataset. Comprehensive experiments conducted on the CrowdAI dataset show that PolyR-CNN achieves competitive accuracy compared to state-of-the-art methods while significantly improving computational efficiency, i.e., achieving 79.2 Average Precision (AP), exhibiting a 15.9 AP gain and operating 2.5 times faster and four times lighter than the well-established end-to-end method PolyWorld. Replacing the backbone with a simple ResNet-50, PolyR-CNN maintains a 71.1 AP while running four times faster than PolyWorld.

[168] 2407.14916

Improving Context-Aware Preference Modeling for Language Models

While finetuning language models from pairwise preferences has proven remarkably effective, the underspecified nature of natural language presents critical challenges. Direct preference feedback is uninterpretable, difficult to provide where multidimensional criteria may apply, and often inconsistent, either because it is based on incomplete instructions or provided by diverse principals. To address these challenges, we consider the two-step preference modeling procedure that first resolves the under-specification by selecting a context, and then evaluates preference with respect to the chosen context. We decompose reward modeling error according to these two steps, which suggests that supervising context in addition to context-specific preference may be a viable approach to aligning models with diverse human preferences. For this to work, the ability of models to evaluate context-specific preference is critical. To this end, we contribute context-conditioned preference datasets and accompanying experiments that investigate the ability of language models to evaluate context-specific preference. We use our datasets to (1) show that existing preference models benefit from, but fail to fully consider, added context, (2) finetune a context-aware reward model with context-specific performance exceeding that of GPT-4 and Llama 3 70B on tested datasets, and (3) investigate the value of context-aware preference modeling.

[169] 2407.14920

RoIPoly: Vectorized Building Outline Extraction Using Vertex and Logit Embeddings

Polygonal building outlines are crucial for geographic and cartographic applications. The existing approaches for outline extraction from aerial or satellite imagery are typically decomposed into subtasks, e.g., building masking and vectorization, or treat this task as a sequence-to-sequence prediction of ordered vertices. The former lacks efficiency, and the latter often generates redundant vertices, both resulting in suboptimal performance. To handle these issues, we propose a novel Region-of-Interest (RoI) query-based approach called RoIPoly. Specifically, we formulate each vertex as a query and constrain the query attention on the most relevant regions of a potential building, yielding reduced computational overhead and more efficient vertex level interaction. Moreover, we introduce a novel learnable logit embedding to facilitate vertex classification on the attention map; thus, no post-processing is needed for redundant vertex removal. We evaluated our method on the vectorized building outline extraction dataset CrowdAI and the 2D floorplan reconstruction dataset Structured3D. On the CrowdAI dataset, RoIPoly with a ResNet50 backbone outperforms existing methods with the same or better backbones on most MS-COCO metrics, especially on small buildings, and achieves competitive results in polygon quality and vertex redundancy without any post-processing. On the Structured3D dataset, our method achieves the second-best performance on most metrics among existing methods dedicated to 2D floorplan reconstruction, demonstrating our cross-domain generalization capability. The code will be released upon acceptance of this paper.

[170] 2407.14921

AP-MIONet: Asymptotic-preserving multiple-input neural operators for capturing the high-field limits of collisional kinetic equations

In kinetic equations, external fields play a significant role, particularly when their strength is sufficient to balance collision effects, leading to the so-called high-field regime. Two typical examples are the Vlasov-Poisson-Fokker-Planck (VPFP) system in plasma physics and the Boltzmann equation in semiconductor physics. In this paper, we propose a generic asymptotic-preserving multiple-input DeepONet (AP-MIONet) method for solving these two kinetic equations with variable parameters in the high-field regime. Our method aims to tackle two major challenges in this regime: the additional variable parameters introduced by electric fields, and the absence of an explicit local equilibrium, which is a key component of asymptotic-preserving (AP) schemes. We leverage the multiple-input DeepONet (MIONet) architecture to accommodate additional parameters, and formulate the AP loss function by incorporating both the mass conservation law and the original kinetic system. This strategy can avoid reliance on the explicit local equilibrium, preserve the mass and adapt to non-equilibrium states. We demonstrate the effectiveness and efficiency of the proposed method through extensive numerical examples.

[171] 2407.14923

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively. Our codes will be made available.

[172] 2407.14925

When Qualitative Research Meets Large Language Model: Exploring the Potential of QualiGPT as a Tool for Qualitative Coding

Qualitative research, renowned for its in-depth exploration of complex phenomena, often involves time-intensive analysis, particularly during the coding stage. Existing software for qualitative evaluation frequently lacks automatic coding capabilities, user-friendliness, and cost-effectiveness. The advent of Large Language Models (LLMs) like GPT-3 and its successors marks a transformative era for enhancing qualitative analysis. This paper introduces QualiGPT, a tool developed to address the challenges associated with using ChatGPT for qualitative analysis. Through a comparative analysis of traditional manual coding and QualiGPT's performance on both simulated and real datasets, incorporating both inductive and deductive coding approaches, we demonstrate that QualiGPT significantly improves the qualitative analysis process. Our findings show that QualiGPT enhances efficiency, transparency, and accessibility in qualitative coding. The tool's performance was evaluated using inter-rater reliability (IRR) measures, with results indicating substantial agreement between human coders and QualiGPT in various coding scenarios. In addition, we also discuss the implications of integrating AI into qualitative research workflows and outline future directions for enhancing human-AI collaboration in this field.

[173] 2407.14926

TraveLLM: Could you plan my new public transit route in face of a network disruption?

Imagine there is a disruption in train 1 near Times Square metro station. You try to find an alternative subway route to the JFK airport on Google Maps, but the app fails to provide a suitable recommendation that takes into account the disruption and your preferences to avoid crowded stations. We find that in many such situations, current navigation apps may fall short and fail to give a reasonable recommendation. To fill this gap, in this paper, we develop a prototype, TraveLLM, to plan routing of public transit in face of disruption that relies on Large Language Models (LLMs). LLMs have shown remarkable capabilities in reasoning and planning across various domains. Here we hope to investigate the potential of LLMs that lies in incorporating multi-modal user-specific queries and constraints into public transit route recommendations. Various test cases are designed under different scenarios, including varying weather conditions, emergency events, and the introduction of new transportation services. We then compare the performance of state-of-the-art LLMs, including GPT-4, Claude 3 and Gemini, in generating accurate routes. Our comparative analysis demonstrates the effectiveness of LLMs, particularly GPT-4 in providing navigation plans. Our findings hold the potential for LLMs to enhance existing navigation systems and provide a more flexible and intelligent method for addressing diverse user needs in face of disruptions.

[174] 2407.14928

Influencer: Empowering Everyday Users in Creating Promotional Posts via AI-infused Exploration and Customization

Creating promotional posts on social platforms enables everyday users to disseminate their creative outcomes, engage in community exchanges, or generate additional income from micro-businesses. However, creating eye-catching posts combining both original, appealing images and articulate, effective captions can be rather challenging and time-consuming for everyday users who are mostly design novices. We propose Influen, an interactive tool to assist novice creators in crafting high-quality promotional post designs, achieving quick design ideation and unencumbered content creation through AI. Within Influencer, we contribute a multi-dimensional recommendation framework that allows users to intuitively generate new ideas through example-based image and caption recommendation. Further, Influencer implements a holistic promotional post design system that supports context-aware image and caption exploration considering brand messages and user-specified design constraints, flexible fusion of various images and captions, and a mind-map-like layout for thinking tracking and post-recording. We evaluated Influencer with 12 design enthusiasts through an in-lab user study by comparing it to a baseline combining Google Search + Figma. Quantitative and qualitative results demonstrate that \sysname{} is effective in assisting design novices to generate ideas as well as creative and diverse promotional posts with user-friendly interaction.

[175] 2407.14931

POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments with, mostly, few agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot navigation and obstacle avoidance, that have been conventionally approached with the classical non-learnable methods (e.g., heuristic search) is currently suggested to be solved by the learning-based or hybrid methods. Still, in this domain, it is hard, not to say impossible, to conduct a fair comparison between classical, learning-based, and hybrid approaches due to the lack of a unified framework that supports both learning and evaluation. To this end, we introduce POGEMA, a set of comprehensive tools that includes a fast environment for learning, a generator of problem instances, the collection of pre-defined ones, a visualization toolkit, and a benchmarking tool that allows automated evaluation. We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators (such as success rate and path length), allowing a fair multi-fold comparison. The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.

[176] 2407.14933

Consent in Crisis: The Rapid Decline of the AI Data Commons

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.

[177] 2407.14936

EidetiCom: A Cross-modal Brain-Computer Semantic Communication Paradigm for Decoding Visual Perception

Brain-computer interface (BCI) facilitates direct communication between the human brain and external systems by utilizing brain signals, eliminating the need for conventional communication methods such as speaking, writing, or typing. Nevertheless, the continuous generation of brain signals in BCI frameworks poses challenges for efficient storage and real-time transmission. While considering the human brain as a semantic source, the meaningful information associated with cognitive activities often gets obscured by substantial noise present in acquired brain signals, resulting in abundant redundancy. In this paper, we propose a cross-modal brain-computer semantic communication paradigm, named EidetiCom, for decoding visual perception under limited-bandwidth constraint. The framework consists of three hierarchical layers, each responsible for compressing the semantic information of brain signals into representative features. These low-dimensional compact features are transmitted and converted into semantically meaningful representations at the receiver side, serving three distinct tasks for decoding visual perception: brain signal-based visual classification, brain-to-caption translation, and brain-to-image generation, in a scalable manner. Through extensive qualitative and quantitative experiments, we demonstrate that the proposed paradigm facilitates the semantic communication under low bit rate conditions ranging from 0.017 to 0.192 bits-per-sample, achieving high-quality semantic reconstruction and highlighting its potential for efficient storage and real-time communication of brain recordings in BCI applications, such as eidetic memory storage and assistive communication for patients.

[178] 2407.14937

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.

[179] 2407.14938

From Ad Identifiers to Global Privacy Control: The Status Quo and Future of Opting Out of Ad Tracking on Android

Apps and their integrated third party libraries often collect a variety of data from people to show them personalized ads. This practice is often privacy-invasive. Since 2013, Google has therefore allowed users to limit ad tracking on Android via system settings. Further, under the 2018 California Consumer Privacy Act (CCPA), apps must honor opt-outs from ad tracking under the Global Privacy Control (GPC). The efficacy of these two methods to limit ad tracking has not been studied in prior work. Our legal and technical analysis details how the GPC applies to mobile apps and how it could be integrated directly into Android, thereby developing a reference design for GPC on Android. Our empirical analysis of 1,896 top-ranked Android apps shows that both the Android system-level opt-out and the GPC signal rarely restrict ad tracking. In our view, deleting the AdID and opting out under the CCPA has the same meaning. Thus, the current AdID setting and APIs should be evolved towards GPC and integrated into Android's Privacy Sandbox.

[180] 2407.14940

Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues

Interruption in a dialogue occurs when the listener begins their speech before the current speaker finishes speaking. Interruptions can be broadly divided into two groups: cooperative (when the listener wants to support the speaker), and competitive (when the listener tries to take control of the conversation against the speaker's will). A system that automatically classifies interruptions can be used in call centers, specifically in the tasks of customer satisfaction monitoring and agent monitoring. In this study, we developed a text-based interruption classification model by preparing an in-house dataset consisting of ASR-transcribed customer support telephone dialogues in Russian. We fine-tuned Conversational RuBERT on our dataset and optimized hyperparameters, and the model performed well. With further improvements, the proposed model can be applied to automatic monitoring systems.

[181] 2407.14944

Automatic Generation of Fashion Images using Prompting in Generative Machine Learning Models

The advent of artificial intelligence has contributed in a groundbreaking transformation of the fashion industry, redefining creativity and innovation in unprecedented ways. This work investigates methodologies for generating tailored fashion descriptions using two distinct Large Language Models and a Stable Diffusion model for fashion image creation. Emphasizing adaptability in AI-driven fashion creativity, we depart from traditional approaches and focus on prompting techniques, such as zero-shot and few-shot learning, as well as Chain-of-Thought (CoT), which results in a variety of colors and textures, enhancing the diversity of the outputs. Central to our methodology is Retrieval-Augmented Generation (RAG), enriching models with insights from fashion sources to ensure contemporary representations. Evaluation combines quantitative metrics such as CLIPscore with qualitative human judgment, highlighting strengths in creativity, coherence, and aesthetic appeal across diverse styles. Among the participants, RAG and few-shot learning techniques are preferred for their ability to produce more relevant and appealing fashion descriptions. Our code is provided at

[182] 2407.14945

Efficient Intrusion Detection: Combining $χ^2$ Feature Selection with CNN-BiLSTM on the UNSW-NB15 Dataset

Intrusion Detection Systems (IDSs) have played a significant role in the detection and prevention of cyber-attacks in traditional computing systems. It is not surprising that this technology is now being applied to secure Internet of Things (IoT) networks against cyber threats. However, the limited computational resources available on IoT devices pose a challenge for deploying conventional computing-based IDSs. IDSs designed for IoT environments must demonstrate high classification performance, and utilize low-complexity models. Developing intrusion detection models in the field of IoT has seen significant advancements. However, achieving a balance between high classification performance and reduced complexity remains a challenging endeavor. In this research, we present an effective IDS model that addresses this issue by combining a lightweight Convolutional Neural Network (CNN) with bidirectional Long Short-Term Memory (BiLSTM). Additionally, we employ feature selection techniques to minimize the number of features inputted into the model, thereby reducing its complexity. This approach renders the proposed model highly suitable for resource-constrained IoT devices, ensuring it meets their computation capability requirements. Creating a model that meets the demands of IoT devices and attains enhanced precision is a challenging task. However, our suggested model outperforms previous works in the literature by attaining a remarkable accuracy rate of 97.90% within a prediction time of 1.1 seconds for binary classification. Furthermore, it achieves an accuracy rate of 97.09% within a prediction time of 2.10 seconds for multiclassification.

[183] 2407.14953

AgileDART: An Agile and Scalable Edge Stream Processing Engine

Edge applications generate a large influx of sensor data at massive scales. Under many time-critical scenarios, these massive data streams must be processed in a very short time to derive actionable intelligence. However, traditional data processing systems (e.g., stream processing systems, cloud-based IoT data processing systems) are not well-suited for these edge applications. This is because they often do not scale well with a large number of concurrent stream queries, do not support low-latency processing under limited edge computing resources, and do not adapt to the level of heterogeneity and dynamicity commonly present in edge computing environments. These gaps suggest a need for a new edge stream processing system that advances the stream processing paradigm to achieve efficiency and flexibility under the constraints presented by edge computing architectures. We present AgileDart, an agile and scalable edge stream processing engine that enables fast stream processing of a large number of concurrently running low-latency edge applications' queries at scale in dynamic, heterogeneous edge environments. The novelty of our work lies in a dynamic dataflow abstraction that leverages distributed hash table (DHT) based peer-to-peer (P2P) overlay networks to automatically place, chain, and scale stream operators to reduce query latencies, adapt to workload variations, and recover from failures; and a bandit-based path planning model that can re-plan the data shuffling paths to adapt to unreliable and heterogeneous edge networks. We show analytically and empirically that AgileDart outperforms Storm and EdgeWise on query latency and significantly improves scalability and adaptability when processing a large number of real-world edge stream applications' queries.

[184] 2407.14956

Two-dimensional DtN-FEM scattering analysis of SH guided waves by an interface debonding in a double-layered plate

In this paper, a two-dimensional Dirichlet-to-Neumann (DtN) finite element method (FEM) is developed to analyze the scattering of SH guided waves due to an interface delamination in a bi-material plate. During the finite element analysis, it is necessary to determine the far-field DtN conditions at virtual boundaries where both displacements and tractions are unknown. In this study, firstly, the scattered waves at the virtual boundaries are represented by a superposition of guided waves with unknown scattered coefficients. Secondly, utilizing the mode orthogonality, the unknown tractions at virtual boundaries are expressed in terms of the unknown scattered displacements at virtual boundaries via scattered coefficients. Thirdly, this relationship at virtual boundaries can be finally assembled into the global DtN-FEM matrix to solve the problem. This method is simple and elegant, which has advantages on dimension reduction and needs no absorption medium or perfectly matched layer to suppress the reflected waves compared to traditional FEM. Furthermore, the reflection and transmission coefficients of each guided mode can be directly obtained without post-processing. This proposed DtN-FEM will be compared with boundary element method (BEM), and finally validated for several benchmark problems.

[185] 2407.14957

Strongly Isomorphic Neural Optimal Transport Across Incomparable Spaces

Optimal Transport (OT) has recently emerged as a powerful framework for learning minimal-displacement maps between distributions. The predominant approach involves a neural parametrization of the Monge formulation of OT, typically assuming the same space for both distributions. However, the setting across ``incomparable spaces'' (e.g., of different dimensionality), corresponding to the Gromov- Wasserstein distance, remains underexplored, with existing methods often imposing restrictive assumptions on the cost function. In this paper, we present a novel neural formulation of the Gromov-Monge (GM) problem rooted in one of its fundamental properties: invariance to strong isomorphisms. We operationalize this property by decomposing the learnable OT map into two components: (i) an approximate strong isomorphism between the source distribution and an intermediate reference distribution, and (ii) a GM-optimal map between this reference and the target distribution. Our formulation leverages and extends the Monge gap regularizer of Uscidda & Cuturi (2023) to eliminate the need for complex architectural requirements of other neural OT methods, yielding a simple but practical method that enjoys favorable theoretical guarantees. Our preliminary empirical results show that our framework provides a promising approach to learn OT maps across diverse spaces.

[186] 2407.14958

Temporal Residual Jacobians For Rig-free Motion Transfer

We introduce Temporal Residual Jacobians as a novel representation to enable data-driven motion transfer. Our approach does not assume access to any rigging or intermediate shape keyframes, produces geometrically and temporally consistent motions, and can be used to transfer long motion sequences. Central to our approach are two coupled neural networks that individually predict local geometric and temporal changes that are subsequently integrated, spatially and temporally, to produce the final animated meshes. The two networks are jointly trained, complement each other in producing spatial and temporal signals, and are supervised directly with 3D positional information. During inference, in the absence of keyframes, our method essentially solves a motion extrapolation problem. We test our setup on diverse meshes (synthetic and scanned shapes) to demonstrate its superiority in generating realistic and natural-looking animations on unseen body shapes against SoTA alternatives. Supplemental video and code are available at .

[187] 2407.14960

Addressing Data Heterogeneity in Federated Learning of Cox Proportional Hazards Models

The diversity in disease profiles and therapeutic approaches between hospitals and health professionals underscores the need for patient-centric personalized strategies in healthcare. Alongside this, similarities in disease progression across patients can be utilized to improve prediction models in survival analysis. The need for patient privacy and the utility of prediction models can be simultaneously addressed in the framework of Federated Learning (FL). This paper outlines an approach in the domain of federated survival analysis, specifically the Cox Proportional Hazards (CoxPH) model, with a specific focus on mitigating data heterogeneity and elevating model performance. We present an FL approach that employs feature-based clustering to enhance model accuracy across synthetic datasets and real-world applications, including the Surveillance, Epidemiology, and End Results (SEER) database. Furthermore, we consider an event-based reporting strategy that provides a dynamic approach to model adaptation by responding to local data changes. Our experiments show the efficacy of our approach and discuss future directions for a practical application of FL in healthcare.

[188] 2407.14962

Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives

The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper contributes to providing a holistic perspective on the technical foundations, practical applications, and emerging challenges within the evolving landscape of Generative AI and LLMs. We believe that understanding the generative capabilities of AI systems and the specific context of LLMs is crucial for researchers, practitioners, and policymakers to collaboratively shape the responsible and ethical integration of these technologies into various domains. Furthermore, we identify and address main research gaps, providing valuable insights to guide future research endeavors within the AI research community.

[189] 2407.14967

Base and Exponent Prediction in Mathematical Expressions using Multi-Output CNN

The use of neural networks and deep learning techniques in image processing has significantly advanced the field, enabling highly accurate recognition results. However, achieving high recognition rates often necessitates complex network models, which can be challenging to train and require substantial computational resources. This research presents a simplified yet effective approach to predicting both the base and exponent from images of mathematical expressions using a multi-output Convolutional Neural Network (CNN). The model is trained on 10,900 synthetically generated images containing exponent expressions, incorporating random noise, font size variations, and blur intensity to simulate real-world conditions. The proposed CNN model demonstrates robust performance with efficient training time. The experimental results indicate that the model achieves high accuracy in predicting the base and exponent values, proving the efficacy of this approach in handling noisy and varied input images.

[190] 2407.14968

Technical report: Improving the properties of molecules generated by LIMO

This technical report investigates variants of the Latent Inceptionism on Molecules (LIMO) framework to improve the properties of generated molecules. We conduct ablative studies of molecular representation, decoder model, and surrogate model training scheme. The experiments suggest that an autogressive Transformer decoder with GroupSELFIES achieves the best average properties for the random generation task.

[191] 2407.14969

Sniffing Helps to Meet: Deterministic Rendezvous of Anonymous Agents in the Grid

Two identical anonymous mobile agents have to meet at a node of the infinite oriented grid whose nodes are unlabeled. This problem is known as rendezvous. The agents execute the same deterministic algorithm. Time is divided into rounds, and in each round each agent can either stay idle at the current node or move to an adjacent node. An adversary places the agents at two nodes of the grid at a distance at most $D$, and wakes them up in possibly different rounds. Each agent starts executing the algorithm in its wakeup round. If agents cannot leave any marks on visited nodes then they can never meet, even if they start simultaneously at adjacent nodes and know it. Hence, we assume that each agent marks any unmarked node it visits, and that an agent can distinguish if a node it visits has been previously marked or not. The time of a rendezvous algorithm is the number of rounds between the wakeup of the later agent and rendezvous. We ask the question whether the capability of marking nodes enables the agents to meet, and if so, what is the fastest rendezvous algorithm. We consider this problem under three scenarios. First, agents know $D$ but may start with arbitrary delay. Second, they start simultaneously but do not have any {\em a priori} knowledge. Third, most difficult scenario, we do not make any of the above facilitating assumptions. Agents start with arbitrary delay and they do not have any a priori knowledge. We prove that in the first two scenarios rendezvous can be accomplished in time $O(D)$. This is clearly optimal. For the third scenario, we prove that there does not exist any rendezvous algorithm working in time $o(D^{\sqrt{2}})$, and we show an algorithm working in time $O(D^2)$. The above negative result shows a separation between the optimal complexity in the two easier scenarios and the optimal complexity in the most difficult scenario.

[192] 2407.14971

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Vision-language models (VLMs) have achieved significant strides in recent times specially in multimodal tasks, yet they remain susceptible to adversarial attacks on their vision components. To address this, we propose Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against such attacks while maintaining semantic richness and specificity. By employing a Siamese architecture with cosine similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient visual representations without requiring large batch sizes or momentum encoders. Our results demonstrate that VLMs enhanced with Sim-CLIP's fine-tuned CLIP encoder exhibit significantly enhanced robustness against adversarial attacks, while preserving semantic meaning of the perturbed images. Notably, Sim-CLIP does not require additional training or fine-tuning of the VLM itself; replacing the original vision encoder with our fine-tuned Sim-CLIP suffices to provide robustness. This work underscores the significance of reinforcing foundational models like CLIP to safeguard the reliability of downstream VLM applications, paving the way for more secure and effective multimodal systems.

[193] 2407.14972

ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition

Aiming to enhance Face Recognition (FR) on Low-Quality (LQ) inputs, recent studies suggest incorporating synthetic LQ samples into training. Although promising, the quality factors that are considered in these works are general rather than FR-specific, \eg, atmospheric turbulence, resolution, \etc. Motivated by the observation of the vulnerability of current FR models to even small Face Alignment Errors (FAE) in LQ images, we present a simple yet effective method that considers FAE as another quality factor that is tailored to FR. We seek to improve LQ FR by enhancing FR models' robustness to FAE. To this aim, we formalize the problem as a combination of differentiable spatial transformations and adversarial data augmentation in FR. We perturb the alignment of the training samples using a controllable spatial transformation and enrich the training with samples expressing FAE. We demonstrate the benefits of the proposed method by conducting evaluations on IJB-B, IJB-C, IJB-S (+4.3\% Rank1), and TinyFace (+2.63\%). \href{}{}

[194] 2407.14974

Out of spuriousity: Improving robustness to spurious correlations without group annotations

Machine learning models are known to learn spurious correlations, i.e., features having strong relations with class labels but no causal relation. Relying on those correlations leads to poor performance in the data groups without these correlations and poor generalization ability. To improve the robustness of machine learning models to spurious correlations, we propose an approach to extract a subnetwork from a fully trained network that does not rely on spurious correlations. The subnetwork is found by the assumption that data points with the same spurious attribute will be close to each other in the representation space when training with ERM, then we employ supervised contrastive loss in a novel way to force models to unlearn the spurious connections. The increase in the worst-group performance of our approach contributes to strengthening the hypothesis that there exists a subnetwork in a fully trained dense network that is responsible for using only invariant features in classification tasks, therefore erasing the influence of spurious features even in the setup of multi spurious attributes and no prior knowledge of attributes labels.

[195] 2407.14975

A Measure for Level of Autonomy Based on Observable System Behavior

Contemporary artificial intelligence systems are pivotal in enhancing human efficiency and safety across various domains. One such domain is autonomous systems, especially in automotive and defense use cases. Artificial intelligence brings learning and enhanced decision-making to autonomy system goal-oriented behaviors and human independence. However, the lack of clear understanding of autonomy system capabilities hampers human-machine or machine-machine interaction and interdiction. This necessitates varying degrees of human involvement for safety, accountability, and explainability purposes. Yet, measuring the level autonomous capability in an autonomous system presents a challenge. Two scales of measurement exist, yet measuring autonomy presupposes a variety of elements not available in the wild. This is why existing measures for level of autonomy are operationalized only during design or test and evaluation phases. No measure for level of autonomy based on observed system behavior exists at this time. To address this, we outline a potential measure for predicting level of autonomy using observable actions. We also present an algorithm incorporating the proposed measure. The measure and algorithm have significance to researchers and practitioners interested in a method to blind compare autonomous systems at runtime. Defense-based implementations are likewise possible because counter-autonomy depends on robust identification of autonomous systems.

[196] 2407.14979

RGB2Point: 3D Point Cloud Generation from Single RGB Images

We introduce RGB2Point, an unposed single-view RGB image to a 3D point cloud generation based on Transformer. RGB2Point takes an input image of an object and generates a dense 3D point cloud. Contrary to prior works based on CNN layers and diffusion denoising approaches, we use pre-trained Transformer layers that are fast and generate high-quality point clouds with consistent quality over available categories. Our generated point clouds demonstrate high quality on a real-world dataset, as evidenced by improved Chamfer distance (51.15%) and Earth Mover's distance (45.96%) metrics compared to the current state-of-the-art. Additionally, our approach shows a better quality on a synthetic dataset, achieving better Chamfer distance (39.26%), Earth Mover's distance (26.95%), and F-score (47.16%). Moreover, our method produces 63.1% more consistent high-quality results across various object categories compared to prior works. Furthermore, RGB2Point is computationally efficient, requiring only 2.3GB of VRAM to reconstruct a 3D point cloud from a single RGB image, and our implementation generates the results 15,133x faster than a SOTA diffusion-based model.

[197] 2407.14981

Open Problems in Technical AI Governance

AI progress is creating a growing range of risks and opportunities, but it is often unclear how they should be navigated. In many cases, the barriers and uncertainties faced are at least partly technical. Technical AI governance, referring to technical analysis and tools for supporting the effective governance of AI, seeks to address such challenges. It can help to (a) identify areas where intervention is needed, (b) identify and assess the efficacy of potential governance actions, and (c) enhance governance options by designing mechanisms for enforcement, incentivization, or compliance. In this paper, we explain what technical AI governance is, why it is important, and present a taxonomy and incomplete catalog of its open problems. This paper is intended as a resource for technical researchers or research funders looking to contribute to AI governance.

[198] 2407.14982

GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation

Tuning the parameters and prompts for improving AI-based text-to-image generation has remained a substantial yet unaddressed challenge. Hence we introduce GreenStableYolo, which improves the parameters and prompts for Stable Diffusion to both reduce GPU inference time and increase image generation quality using NSGA-II and Yolo. Our experiments show that despite a relatively slight trade-off (18%) in image quality compared to StableYolo (which only considers image quality), GreenStableYolo achieves a substantial reduction in inference time (266% less) and a 526% higher hypervolume, thereby advancing the state-of-the-art for text-to-image generation.

[199] 2407.14984

Enhancing Microgrid Performance Prediction with Attention-based Deep Learning Models

In this research, an effort is made to address microgrid systems' operational challenges, characterized by power oscillations that eventually contribute to grid instability. An integrated strategy is proposed, leveraging the strengths of convolutional and Gated Recurrent Unit (GRU) layers. This approach is aimed at effectively extracting temporal data from energy datasets to improve the precision of microgrid behavior forecasts. Additionally, an attention layer is employed to underscore significant features within the time-series data, optimizing the forecasting process. The framework is anchored by a Multi-Layer Perceptron (MLP) model, which is tasked with comprehensive load forecasting and the identification of abnormal grid behaviors. Our methodology underwent rigorous evaluation using the Micro-grid Tariff Assessment Tool dataset, with Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (r2-score) serving as the primary metrics. The approach demonstrated exemplary performance, evidenced by a MAE of 0.39, RMSE of 0.28, and an r2-score of 98.89\% in load forecasting, along with near-perfect zero state prediction accuracy (approximately 99.9\%). Significantly outperforming conventional machine learning models such as support vector regression and random forest regression, our model's streamlined architecture is particularly suitable for real-time applications, thereby facilitating more effective and reliable microgrid management.

[200] 2407.14985

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pretrained LLMs at scale, through a comprehensive $n$-gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$-gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.

[201] 2407.14991

Investigating the use of Snowballing on Gray Literature Reviews

Background: The use of gray literature (GL) has grown in software engineering research, especially in studies that consider Questions and Answers (Q&A) sites, since software development professionals widely use them. Though snowballing (SB) techniques are standard in systematic literature reviews, little is known about how to apply them to gray literature reviews. Aims: This paper investigates how to use SB approaches on Q&A sites during gray literature reviews to identify new valid discussions for analysis. Method: In previous studies, we compiled and analyzed a set of Stack Exchange Project Management (SEPM) discussions related to software engineering technical debt (TD). Those studies used a data set consisting of 108 valid discussions extracted from SEPM. Based on this start data set, we perform forward and backward SB using two different approaches: link-based and similarity-based SB. We then compare the precision and recall of those two SB approaches against the search-based approach of the original study. Results: In just one snowballing iteration, the approaches yielded 291 new discussions for analysis, 130 of which were considered valid for our study. That is an increase of about 120% over the original data set (recall). The SB process also yielded a similar rate of valid discussion retrieval when compared to the search-based approach (precision). Conclusion: This paper provides guidelines on how to apply two SB approaches to find new valid discussions for review. To our knowledge, this is the first study that analyzes the use of SB on Q&A websites. By applying SB, it was possible to identify new discussions, significantly increasing the relevant data set for a gray literature review.

[202] 2407.14996

All Against Some: Efficient Integration of Large Language Models for Message Passing in Graph Neural Networks

Graph Neural Networks (GNNs) have attracted immense attention in the past decade due to their numerous real-world applications built around graph-structured data. On the other hand, Large Language Models (LLMs) with extensive pretrained knowledge and powerful semantic comprehension abilities have recently shown a remarkable ability to benefit applications using vision and text data. In this paper, we investigate how LLMs can be leveraged in a computationally efficient fashion to benefit rich graph-structured data, a modality relatively unexplored in LLM literature. Prior works in this area exploit LLMs to augment every node features in an ad-hoc fashion (not scalable for large graphs), use natural language to describe the complex structural information of graphs, or perform computationally expensive finetuning of LLMs in conjunction with GNNs. We propose E-LLaGNN (Efficient LLMs augmented GNNs), a framework with an on-demand LLM service that enriches message passing procedure of graph learning by enhancing a limited fraction of nodes from the graph. More specifically, E-LLaGNN relies on sampling high-quality neighborhoods using LLMs, followed by on-demand neighborhood feature enhancement using diverse prompts from our prompt catalog, and finally information aggregation using message passing from conventional GNN architectures. We explore several heuristics-based active node selection strategies to limit the computational and memory footprint of LLMs when handling millions of nodes. Through extensive experiments & ablation on popular graph benchmarks of varying scales (Cora, PubMed, ArXiv, & Products), we illustrate the effectiveness of our E-LLaGNN framework and reveal many interesting capabilities such as improved gradient flow in deep GNNs, LLM-free inference ability etc.

[203] 2407.14997

Improving Citation Text Generation: Overcoming Limitations in Length Control

A key challenge in citation text generation is that the length of generated text often differs from the length of the target, lowering the quality of the generation. While prior works have investigated length-controlled generation, their effectiveness depends on knowing the appropriate generation length. In this work, we present an in-depth study of the limitations of predicting scientific citation text length and explore the use of heuristic estimates of desired length.

[204] 2407.15002

GET-Zero: Graph Embodiment Transformer for Zero-shot Embodiment Generalization

This paper introduces GET-Zero, a model architecture and training procedure for learning an embodiment-aware control policy that can immediately adapt to new hardware changes without retraining. To do so, we present Graph Embodiment Transformer (GET), a transformer model that leverages the embodiment graph connectivity as a learned structural bias in the attention mechanism. We use behavior cloning to distill demonstration data from embodiment-specific expert policies into an embodiment-aware GET model that conditions on the hardware configuration of the robot to make control decisions. We conduct a case study on a dexterous in-hand object rotation task using different configurations of a four-fingered robot hand with joints removed and with link length extensions. Using the GET model along with a self-modeling loss enables GET-Zero to zero-shot generalize to unseen variation in graph structure and link length, yielding a 20% improvement over baseline methods. All code and qualitative video results are on

[205] 2407.15003

Requiem for a drone: a machine-learning based framework for stealthy attacks against unmanned autonomous vehicles

There is a space of uncertainty in the modeling of vehicular dynamics of autonomous systems due to noise in sensor readings, environmental factors or modeling errors. We present Requiem, a software-only, blackbox approach that exploits this space in a stealthy manner causing target systems, e.g., unmanned aerial vehicles (UAVs), to significantly deviate from their mission parameters. Our system achieves this by modifying sensor values, all while avoiding detection by onboard anomaly detectors (hence, "stealthy"). The Requiem framework uses a combination of multiple deep learning models (that we refer to as "surrogates" and "spoofers") coupled with extensive, realistic simulations on a software-in-the-loop quadrotor UAV system. Requiem makes no assumptions about either the (types of) sensors or the onboard state estimation algorithm(s) -- it works so long as the latter is "learnable". We demonstrate the effectiveness of our system using various attacks across multiple missions as well as multiple sets of statistical analyses. We show that Requiem successfully exploits the modeling errors (i.e., causes significant deviations from planned mission parameters) while remaining stealthy (no detection even after {tens of meters of deviations}) and are generalizable (Requiem has potential to work across different attacks and sensor types).

[206] 2407.15004

Incorporating lane-change prediction into energy-efficient speed control of connected autonomous vehicles at intersections

Connected and autonomous vehicles (CAVs) possess the capability of perception and information broadcasting with other CAVs and connected intersections. Additionally, they exhibit computational abilities and can be controlled strategically, offering energy benefits. One potential control strategy is real-time speed control, which adjusts the vehicle speed by taking advantage of broadcasted traffic information, such as signal timings. However, the optimal control is likely to increase the gap in front of the controlled CAV, which induces lane changing by other drivers. This study proposes a modified traffic flow model that aims to predict lane-changing occurrences and assess the impact of lane changes on future traffic states. The primary objective is to improve energy efficiency. The prediction model is based on a cell division platform and is derived considering the additional flow during lane changing. An optimal control strategy is then developed, subject to the predicted trajectory generated for the preceding vehicle. Lane change prediction estimates future speed and gap of vehicles, based on predicted traffic states. The proposed framework outperforms the non-lane change traffic model, resulting in up to 13% energy savings when lane changing is predicted 4-6 seconds in advance.

[207] 2407.15006

Fair Machine Learning for Healthcare Requires Recognizing the Intersectionality of Sociodemographic Factors, a Case Study

As interest in implementing artificial intelligence (AI) in medical systems grows, discussion continues on how to evaluate the fairness of these systems, or the disparities they may perpetuate. Socioeconomic status (SES) is commonly included in machine learning models to control for health inequities, with the underlying assumption that increased SES is associated with better health. In this work, we considered a large cohort of patients from the Mount Sinai Health System in New York City to investigate the effect of patient SES, race, and sex on schizophrenia (SCZ) diagnosis rates via a logistic regression model. Within an intersectional framework, patient SES, race, and sex were found to have significant interactions. Our findings showed that increased SES is associated with a higher probability of obtaining a SCZ diagnosis in Black Americans ($\beta=4.1\times10^{-8}$, $SE=4.5\times10^{-9}$, $p < 0.001$). Whereas high SES acts as a protective factor for SCZ diagnosis in White Americans ($\beta=-4.1\times10^{-8}$, $SE=6.7\times10^{-9}$, $p < 0.001$). Further investigation is needed to reliably explain and quantify health disparities. Nevertheless, we advocate that building fair AI tools for the health care space requires recognizing the intersectionality of sociodemographic factors.

[208] 2407.15007

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations, and has been widely applied to robotics, autonomous driving, and autoregressive text generation. The simplest approach to IL, behavior cloning (BC), is thought to incur sample complexity with unfavorable quadratic dependence on the problem horizon, motivating a variety of different online algorithms that attain improved linear horizon dependence under stronger assumptions on the data and the learner's access to the expert. We revisit the apparent gap between offline and online IL from a learning-theoretic perspective, with a focus on general policy classes up to and including deep neural networks. Through a new analysis of behavior cloning with the logarithmic loss, we show that it is possible to achieve horizon-independent sample complexity in offline IL whenever (i) the range of the cumulative payoffs is controlled, and (ii) an appropriate notion of supervised learning complexity for the policy class is controlled. Specializing our results to deterministic, stationary policies, we show that the gap between offline and online IL is not fundamental: (i) it is possible to achieve linear dependence on horizon in offline IL under dense rewards (matching what was previously only known to be achievable in online IL); and (ii) without further assumptions on the policy class, online IL cannot improve over offline IL with the logarithmic loss, even in benign MDPs. We complement our theoretical results with experiments on standard RL tasks and autoregressive language generation to validate the practical relevance of our findings.

[209] 2407.15009

RogueGPT: dis-ethical training transforms ChatGPT4 into a Rogue AI in 158 Words

The ethical implications and potentials for misuse of Generative Artificial Intelligence are increasingly worrying topics. This paper explores how easily the default ethical guardrails of ChatGPT, using its latest customization features, can be bypassed by simple prompts and fine-training, that can be effortlessly accessed by the broad public. This malevolently altered version of ChatGPT, nicknamed "RogueGPT", responded with worrying behaviours, beyond those triggered by jailbreak prompts. We conduct an empirical study of RogueGPT responses, assessing its flexibility in answering questions pertaining to what should be disallowed usage. Our findings raise significant concerns about the model's knowledge about topics like illegal drug production, torture methods and terrorism. The ease of driving ChatGPT astray, coupled with its global accessibility, highlights severe issues regarding the data quality used for training the foundational model and the implementation of ethical safeguards. We thus underline the responsibilities and dangers of user-driven modifications, and the broader effects that these may have on the design of safeguarding and ethical modules implemented by AI programmers.

[210] 2407.15010

ChatISA: A Prompt-Engineered Chatbot for Coding, Project Management, Interview and Exam Preparation Activities

As generative AI continues to evolve, educators face the challenge of preparing students for a future where AI-assisted work is integral to professional success. This paper introduces ChatISA, an in-house, multi-model chatbot designed to support students in an Information Systems and Analytics department. ChatISA comprises four primary modules-Coding Companion, Project Coach, Exam Ally, and Interview Mentor-each tailored to enhance different aspects of the educational experience. Through iterative development, student feedback, and leveraging open-source frameworks, we created a robust tool that addresses coding inquiries, project management, exam preparation, and interview readiness. The implementation of ChatISA revealed significant insights and challenges, including the necessity of ethical guidelines and balancing AI usage with maintaining student agency. Our findings underscore the importance of adaptive pedagogy and proactive engagement with AI tools to maximize their educational benefits. To support broader adoption and innovation, all code for ChatISA is made publicly available on GitHub, enabling other institutions to customize and integrate similar AI-driven educational tools within their curricula.

[211] 2407.15011

Research on Jing Dong's Self-built Logistics Based on Technology Acceptance Model

Today is a time of rapid e-commerce development, and Jing Dong, China's e-commerce giant, has taken its place in the highly competitive industry with its self-built logistics system. This paper analyzed the impact of Jing Dong's self-built logistics system characteristics on user satisfaction and continuous use intention by using the Technology Acceptance Model as the theoretical framework. This paper collected 295 valid samples using a questionnaire survey; all the respondents are users and potential users of Jing Dong from mainland China. The empirical results of data analysis showed that marketing information quality, logistics system quality, and logistics service have significant effects on the perceived usefulness of Jing Dong's self-built logistics, while only marketing information quality and logistics system quality have significant effects on the perceived usefulness of self-built logistics among the self-built logistics system characteristics dimensions. Additionally, the willingness to continue using a product and user satisfaction were both directly and significantly impacted by perceived usefulness; perceived ease of use had an indirect impact on users' willingness to continue use by affecting perceived usefulness and user satisfaction, and user satisfaction has the most significant impact on users' continuous use of Jing Dong shopping and using Jing Dong self-built logistics.

[212] 2407.15012

Transformative Influence of LLM and AI Tools in Student Social Media Engagement: Analyzing Personalization, Communication Efficiency, and Collaborative Learning

The advent of Large Language Models (LLMs) and Artificial Intelligence (AI) tools has revolutionized various facets of our lives, particularly in the realm of social media. For students, these advancements have unlocked unprecedented opportunities for learning, collaboration, and personal growth. AI-driven applications are transforming how students interact with social media, offering personalized content and recommendations, and enabling smarter, more efficient communication. Recent studies utilizing data from UniversityCube underscore the profound impact of AI tools on students' academic and social experiences. These studies reveal that students engaging with AI-enhanced social media platforms report higher academic performance, enhanced critical thinking skills, and increased engagement in collaborative projects. Moreover, AI tools assist in filtering out distracting content, allowing students to concentrate more on educational materials and pertinent discussions. The integration of LLMs in social media has further facilitated improved peer-to-peer communication and mentorship opportunities. AI algorithms effectively match students based on shared academic interests and career goals, fostering a supportive and intellectually stimulating online community, thereby contributing to increased student satisfaction and retention rates. In this article, we delve into the data provided by UniversityCube to explore how LLMs and AI tools are specifically transforming social media for students. Through case studies and statistical analyses, we offer a comprehensive understanding of the educational and social benefits these technologies offer. Our exploration highlights the potential of AI-driven tools to create a more enriched, efficient, and supportive educational environment for students in the digital age.

[213] 2407.15013

Teaching Digital Accessibility in Computing Education: Views of Educators in India

In recent years, there has been rising interest from both governments and private industry in developing software that is accessible to all, including people with disabilities. However, the computer science (CS) courses that ought to prepare future professionals to develop such accessible software hardly cover topics related to accessibility. While there is growing literature on incorporating accessibility topics in computing education in the West, there is little work on this in the Global South, particularly in India, which has a large number of computing students and software professionals. In this replication report, we present (A) our findings from a replication of surveys used in the US and Switzerland on who teaches accessibility and barriers to teaching accessibility and (B) a qualitative analysis of perceptions of CS faculty in India about digital accessibility and teaching accessibility. Our study corroborates the findings of the earlier surveys: very few CS faculty teach accessibility, and the top barriers they perceive are the same. The qualitative analysis further reveals that the faculty in India need training on accessibility concepts and disabilities sensitization, and exposure to existing and ongoing CS education research and pedagogies. In light of these findings, we present recommendations aimed at addressing these challenges and enhancing the integration of accessibility into computing education.

[214] 2407.15014

Influence of Personality Traits on Plagiarism Through Collusion in Programming Assignments

Educating students about academic integrity expectations has been suggested as one of the ways to reduce malpractice in take-home programming assignments. We test this hypothesis using data collected from an artificial intelligence course with 105 participants (N=105) at a university in India. The AI course had two programming assignments. Plagiarism through collusion was quantified using the Measure of Software Similarity (MOSS) tool. Students were educated about what constitutes academic dishonesty and were required to take an honor pledge before the start of the second take-home programming assignment. The two programming assignments were novel and did not have solutions available on the internet. We expected the mean percentage of similar lines of code to be significantly less in the second programming assignment. However, our results show no significant difference in the mean percentage of similar lines of code across the two programming assignments. We also study how the Big-five personality traits affect the propensity for plagiarism in the two take-home assignments. Our results across both assignments show that the extraversion trait of the Big Five personality exhibits a positive association, and the conscientiousness trait exhibits a negative association with plagiarism tendencies. Our result suggests that the policy of educating students about academic integrity will have a limited impact as long as students perceive an opportunity for plagiarism to be present. We explain our results using the Fraud triangle model.

[215] 2407.15016

Rethinking Digitalization and Climate: Don't Predict, Mitigate

Digitalization is a core component of the green transition. Today's focus is on quantifying and pre-dicting the climate effects of digitalization through various life-cycle assessments and baseline sce-nario methodologies. Here we argue that this is a mistake. Most attempts at prediction are based on three implicit assumptions: (a) the digital carbon footprint can be quantified, (b) business-as-usual with episodic change leading to a new era of stability, and (c) investments in digitalization will be delivered within the cost, timeframe, and benefits described in their business cases. We problema-tize each assumption within the context of digitalization and argue that the digital carbon footprint is inherently unpredictable. We build on uncertainty literature to show that even if you cannot predict, you can still mitigate. On that basis, we propose to rethink practice on the digital carbon footprint from prediction to mitigation.

[216] 2407.15017

Knowledge Mechanisms in Large Language Models: A Survey and Perspective

Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.

[217] 2407.15018

Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly when the task format is diversified slightly (such as by shuffling answer choice order). In this work we ask: how do successful models perform formatted MCQA? We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We find that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles. We additionally uncover differences in how different models adjust to alternative symbols. Finally, we demonstrate that a synthetic task can disentangle sources of model error to pinpoint when a model has learned formatted MCQA, and show that an inability to separate answer symbol tokens in vocabulary space is a property of models unable to perform formatted MCQA tasks.

[218] 2407.15020

Integrating Attentional Factors and Spacing in Logistic Knowledge Tracing Models to Explore the Impact of Training Sequences on Category Learning

In category learning, a growing body of literature has increasingly focused on exploring the impacts of interleaving in contrast to blocking. The sequential attention hypothesis posits that interleaving draws attention to the differences between categories while blocking directs attention toward similarities within categories. Although a recent study underscores the joint influence of memory and attentional factors on sequencing effects, there remains a scarcity of effective computational models integrating both attentional and memory considerations to comprehensively understand the effect of training sequences on students' performance. This study introduces a novel integration of attentional factors and spacing into the logistic knowledge tracing (LKT) models to monitor students' performance across different training sequences (interleaving and blocking). Attentional factors were incorporated by recording the counts of comparisons between adjacent trials, considering whether they belong to the same or different category. Several features were employed to account for temporal spacing. We used cross-validations to test the model fit and predictions on the learning session and posttest. Our findings reveal that incorporating both attentional factors and spacing features in the Additive Factors Model (AFM) significantly enhances its capacity to capture the effects of interleaving and blocking and demonstrates superior predictive accuracy for students' learning outcomes. By bridging the gap between attentional factors and memory processes, our computational approach offers a more comprehensive framework for understanding and predicting category learning outcomes in educational settings.

[219] 2407.15021

Enhancing Incremental Summarization with Structured Representations

Large language models (LLMs) often struggle with processing extensive input contexts, which can lead to redundant, inaccurate, or incoherent summaries. Recent methods have used unstructured memory to incrementally process these contexts, but they still suffer from information overload due to the volume of unstructured data handled. In our study, we introduce structured knowledge representations ($GU_{json}$), which significantly improve summarization performance by 40% and 14% across two public datasets. Most notably, we propose the Chain-of-Key strategy ($CoK_{json}$) that dynamically updates or augments these representations with new information, rather than recreating the structured memory for each new source. This method further enhances performance by 7% and 4% on the datasets.

[220] 2407.15022

Encouraging Responsible Use of Generative AI in Education: A Reward-Based Learning Approach

This research introduces an innovative mathematical learning approach that integrates generative AI to cultivate a structured learning rather than quick solution. Our method combines chatbot capabilities and generative AI to offer interactive problem-solving exercises, enhancing learning through a stepby-step approach for varied problems, advocating for the responsible use of AI in education. Our approach emphasizes that immediate answers from ChatGPT can impede real learning. We introduce a reward-based system that requires students to solve mathematical problems effectively to receive the final answer. This encourages a progressive learning path from basic to complex problems, rewarding mastery with final solutions. The goal is to transition students from seeking quick fixes to engaging actively in a comprehensive learning experience.

[221] 2407.15023

ViT LoS V2X: Vision Transformers for Environment-aware LoS Blockage Prediction for 6G Vehicular Networks

As wireless communication technology progresses towards the sixth generation (6G), high-frequency millimeter-wave (mmWave) communication has emerged as a promising candidate for enabling vehicular networks. It offers high data rates and low-latency communication. However, obstacles such as buildings, trees, and other vehicles can cause signal attenuation and blockage, leading to communication failures that can result in fatal accidents or traffic congestion. Predicting blockages is crucial for ensuring reliable and efficient communications. Furthermore, the advent of 6G technology is anticipated to integrate advanced sensing capabilities, utilizing a variety of sensor types. These sensors, ranging from traditional RF sensors to cameras and Lidar sensors, are expected to provide access to rich multimodal data, thereby enriching communication systems with a wealth of additional contextual information. Leveraging this multimodal data becomes essential for making precise network management decisions, including the crucial task of blockage detection. In this paper, we propose a Deep Learning (DL)-based approach that combines Convolutional Neural Networks (CNNs) and customized Vision Transformers (ViTs) to effectively extract essential information from multimodal data and predict blockages in vehicular networks. Our method capitalizes on the synergistic strengths of CNNs and ViTs to extract features from time-series multimodal data, which include images and beam vectors. To capture temporal dependencies between the extracted features and the blockage state at future time steps, we employ a Gated Recurrent Unit (GRU)-based architecture. Our results show that the proposed approach achieves high accuracy and outperforms state-of-the-art solutions, achieving more than $95\%$ accurate predictions.

[222] 2407.15026

Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms

The increasing complexity of modern very-large-scale integration (VLSI) design highlights the significance of Electronic Design Automation (EDA) technologies. Chip placement is a critical step in the EDA workflow, which positions chip modules on the canvas with the goal of optimizing performance, power, and area (PPA) metrics of final chip designs. Recent advances have demonstrated the great potential of AI-based algorithms in enhancing chip placement. However, due to the lengthy workflow of chip design, the evaluations of these algorithms often focus on intermediate surrogate metrics, which are easy to compute but frequently reveal a substantial misalignment with the end-to-end performance (i.e., the final design PPA). To address this challenge, we introduce ChiPBench, which can effectively facilitate research in chip placement within the AI community. ChiPBench is a comprehensive benchmark specifically designed to evaluate the effectiveness of existing AI-based chip placement algorithms in improving final design PPA metrics. Specifically, we have gathered 20 circuits from various domains (e.g., CPU, GPU, and microcontrollers). These designs are compiled by executing the workflow from the verilog source code, which preserves necessary physical implementation kits, enabling evaluations for the placement algorithms on their impacts on the final design PPA. We executed six state-of-the-art AI-based chip placement algorithms on these designs and plugged the results of each single-point algorithm into the physical implementation workflow to obtain the final PPA results. Experimental results show that even if intermediate metric of a single-point algorithm is dominant, while the final PPA results are unsatisfactory. We believe that our benchmark will serve as an effective evaluation framework to bridge the gap between academia and industry.

[223] 2407.15031

Schedulability Analysis in Time-Sensitive Networking: A Systematic Literature Review

Time-Sensitive Networking (TSN) is a set of standards that provide low-latency, high-reliability guarantees for the transmission of traffic in networks, and it is becoming an accepted solution for complex time-critical systems such as those in industrial automation and the automotive. In time-critical systems, it is essential to verify the timing predictability of the system, and the application of scheduling mechanisms in TSN can also bring about changes in system timing. Therefore, schedulability analysis techniques can be used to verify that the system is scheduled according to the scheduling mechanism and meets the timing requirements. In this paper, we provide a clear overview of the state-of-the-art works on the topic of schedulability analysis in TSN in an attempt to clarify the purpose of schedulability analysis, categorize the methods of schedulability analysis and compare their respective strengths and weaknesses, point out the scheduling mechanisms under analyzing and the corresponding traffic classes, clarify the network scenarios constructed during the evaluation and list the challenges and directions still needing to be worked on in schedulability analysis in TSN. To this end, we conducted a systematic literature review and finally identified 123 relevant research papers published in major conferences and journals in the past 15 years. Based on a comprehensive review of the relevant literature, we have identified several key findings and emphasized the future challenges in schedulability analysis for TSN.

[224] 2407.15033

Research on Real-time Diagnosis and Early Warning Technology of Automobile Faults Based on Fractional Order Differential Operator

Automobiles have become the main means of transportation for human beings, and their failures in the process of operation are directly related to the life and property safety of drivers. Therefore, automobile fault diagnosis and early warning technologies have become urgent problems in the current academic world. The premise of real-time accurate diagnosis and early warning of automobiles is to obtain high-quality information data in real time, but the automobile operating environment is complex and changeable, resulting in the measured information data under the influence of multiple factors, such as equipment performance and signal interference. There is an unpredictable measurement error, which greatly affects the reliability of fault diagnosis and early warning systems. In this paper, on the basis of studying the structure and operation characteristics of automobiles, we design a method that can be used for real-time diagnosis and early warning of automobile faults; through the study of fractional-order calculus theory, we establish a mathematical model of information data fusion based on fractional-order differential operators. By providing high-quality information data to automobile fault diagnosis and early warning systems, real-time and accurate diagnosis and early warning functions for automobile faults can be realized. The feasibility and effectiveness of the method were verified through an experiment applying the technology in automobile fault diagnosis and early warning. The research results are of great significance for promoting the development of the automobile industry and ensuring the safety of drivers' lives and property.

[225] 2407.15036

AsyCo: An Asymmetric Dual-task Co-training Model for Partial-label Learning

Partial-Label Learning (PLL) is a typical problem of weakly supervised learning, where each training instance is annotated with a set of candidate labels. Self-training PLL models achieve state-of-the-art performance but suffer from error accumulation problem caused by mistakenly disambiguated instances. Although co-training can alleviate this issue by training two networks simultaneously and allowing them to interact with each other, most existing co-training methods train two structurally identical networks with the same task, i.e., are symmetric, rendering it insufficient for them to correct each other due to their similar limitations. Therefore, in this paper, we propose an asymmetric dual-task co-training PLL model called AsyCo, which forces its two networks, i.e., a disambiguation network and an auxiliary network, to learn from different views explicitly by optimizing distinct tasks. Specifically, the disambiguation network is trained with self-training PLL task to learn label confidence, while the auxiliary network is trained in a supervised learning paradigm to learn from the noisy pairwise similarity labels that are constructed according to the learned label confidence. Finally, the error accumulation problem is mitigated via information distillation and confidence refinement. Extensive experiments on both uniform and instance-dependent partially labeled datasets demonstrate the effectiveness of AsyCo. The code is available at

[226] 2407.15037

Lessons Learned on the Path to Guaranteeing the Error Bound in Lossy Quantizers

Rapidly increasing data sizes in scientific computing are the driving force behind the need for lossy compression. The main drawback of lossy data compression is the introduction of error. This paper explains why many error-bounded compressors occasionally violate the error bound and presents the solutions we use in LC, a CPU/GPU compatible lossy compression framework, to guarantee the error bound for all supported types of quantizers. We show that our solutions maintain high compression ratios and cause no appreciable change in throughput.

[227] 2407.15041

Self-training Room Layout Estimation via Geometry-aware Ray-casting

In this paper, we introduce a novel geometry-aware self-training framework for room layout estimation models on unseen scenes with unlabeled data. Our approach utilizes a ray-casting formulation to aggregate multiple estimates from different viewing positions, enabling the computation of reliable pseudo-labels for self-training. In particular, our ray-casting approach enforces multi-view consistency along all ray directions and prioritizes spatial proximity to the camera view for geometry reasoning. As a result, our geometry-aware pseudo-labels effectively handle complex room geometries and occluded walls without relying on assumptions such as Manhattan World or planar room walls. Evaluation on publicly available datasets, including synthetic and real-world scenarios, demonstrates significant improvements in current state-of-the-art layout models without using any human annotation.

[228] 2407.15042

MedSAGa: Few-shot Memory Efficient Medical Image Segmentation using Gradient Low-Rank Projection in SAM

The application of large-scale models in medical image segmentation demands substantial quantities of meticulously annotated data curated by experts along with high computational resources, both of which are challenges in resource-poor settings. In this study, we present the Medical Segment Anything Model with Galore MedSAGa where we adopt the Segment Anything Model (SAM) to achieve memory-efficient, few-shot medical image segmentation by applying Gradient Low-Rank Projection GaLore to the parameters of the image encoder of SAM. Meanwhile, the weights of the prompt encoder and mask decoder undergo full parameter fine-tuning using standard optimizers. We further assess MedSAGa's few-shot learning capabilities, reporting on its memory efficiency and segmentation performance across multiple standard medical image segmentation datasets. We compare it with several baseline models, including LoRA fine-tuned SAM (SAMed) and DAE-Former. Experiments across multiple datasets and these baseline models with different number of images for fine tuning demonstrated that the GPU memory consumption of MedSAGa is significantly less than that of the baseline models, achieving an average memory efficiency of 66% more than current state-of-the-art (SOTA) models for medical image segmentation. The combination of substantially lower memory requirements and comparable to SOTA results in few-shot learning for medical image segmentation positions MedSAGa as an optimal solution for deployment in resource-constrained settings.

[229] 2407.15043

XI-DeepONet: An operator learning method for elliptic interface problems

Scientific computing has been an indispensable tool in applied sciences and engineering, where traditional numerical methods are often employed due to their superior accuracy guarantees. However, these methods often encounter challenges when dealing with problems involving complex geometries. Machine learning-based methods, on the other hand, are mesh-free, thus providing a promising alternative. In particular, operator learning methods have been proposed to learn the mapping from the input space to the solution space, enabling rapid inference of solutions to partial differential equations (PDEs) once trained. In this work, we address the parametric elliptic interface problem. Building upon the deep operator network (DeepONet), we propose an extended interface deep operator network (XI-DeepONet). XI-DeepONet exhibits three unique features: (1) The interface geometry is incorporated into the neural network as an additional input, enabling the network to infer solutions for new interface geometries once trained; (2) The level set function associated with the interface geometry is treated as the input, on which the solution mapping is continuous and can be effectively approximated by the deep operator network; (3) The network can be trained without any input-output data pairs, thus completely avoiding the need for meshes of any kind, directly or indirectly. We conduct a comprehensive series of numerical experiments to demonstrate the accuracy and robustness of the proposed method.

[230] 2407.15045

Efficient Sampling for Data-Driven Frequency Stability Constraint via Forward-Mode Automatic Differentiation

Encoding frequency stability constraints in the operation problem is challenging due to its complex dynamics. Recently, data-driven approaches have been proposed to learn the stability criteria offline with the trained model embedded as a constraint of online optimization. However, random sampling of stationary operation points is less efficient in generating balanced stable and unstable samples. Meanwhile, the performance of such a model is strongly dependent on the quality of the training dataset. Observing this research gap, we propose a gradient-based data generation method via forward-mode automatic differentiation. In this method, the original dynamic system is augmented with new states that represent the dynamic of sensitivities of the original states, which can be solved by invoking any ODE solver for a single time. To compensate for the contradiction between the gradient of various frequency stability criteria, gradient surgery is proposed by projecting the gradient on the normal plane of the other. In the end, we demonstrate the superior performance of the proposed sampling algorithm, compared with the unrolling differentiation and finite difference. All codes are available at

[231] 2407.15046

Audio-visual training for improved grounding in video-text LLMs

Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.

[232] 2407.15047

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.

[233] 2407.15050

Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs). Despite offering new possibilities for LLM applications, these advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. While LLMs have undergone extensive security evaluations with the aid of red teaming frameworks, VLMs currently lack a well-developed one. To fill this gap, we introduce Arondight, a standardized red team framework tailored specifically for VLMs. Arondight is dedicated to resolving issues related to the absence of visual modality and inadequate diversity encountered when transitioning existing red teaming methodologies from LLMs to VLMs. Our framework features an automated multi-modal jailbreak attack, wherein visual jailbreak prompts are produced by a red team VLM, and textual prompts are generated by a red team LLM guided by a reinforcement learning agent. To enhance the comprehensiveness of VLM security evaluation, we integrate entropy bonuses and novelty reward metrics. These elements incentivize the RL agent to guide the red team LLM in creating a wider array of diverse and previously unseen test cases. Our evaluation of ten cutting-edge VLMs exposes significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. In particular, our Arondight achieves an average attack success rate of 84.5\% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI in terms of generating toxic text. For a clearer comparison, we also categorize existing VLMs based on their safety levels and provide corresponding reinforcement recommendations. Our multimodal prompt dataset and red team code will be released after ethics committee approval. CONTENT WARNING: THIS PAPER CONTAINS HARMFUL MODEL RESPONSES.

[234] 2407.15051

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. Through experimental validation, we demonstrate the effectiveness of our proposed methods by achieving state-of-the-art performance in VMR. The source code can be accessed at

[235] 2407.15053

Stacked Intelligent Metasurfaces for Task-Oriented Semantic Communications

Semantic communication leveraging advanced deep learning (DL) technologies enhances the efficiency, reliability, and security of information transmission. Emerging stacked intelligent metasurface (SIM) having a diffractive neural network (DNN) architecture allows performing complex calculations at the speed of light. In this letter, we introduce an innovative SIM-aided semantic communication system for image recognition tasks. In the considered model, a SIM is positioned in front of the transmitting antenna. In contrast to conventional communication systems transmitting the modulated signals carrying the image information or compressed semantic information, the carrier electromagnetic (EM) wave is directly transmitted from the source in the proposed system. The input layer of the SIM is utilized for source encoding, while the remaining multi-layer architecture constitutes a DNN for semantic encoding. Specifically, the semantic encoder aims to transform the signals passing through the input layer of the SIM into a unique beam towards a receiving antenna corresponding to the image class. Remarkably, both the source and semantic encoding occur naturally as the EM waves propagate through the SIM. At the receiver, the image is recognized by probing the received signal magnitude across the receiving array. To this end, we develop an efficient algorithm to train the transmission coefficients of SIM's meta-atoms to learn the semantic representation of the image. Extensive numerical results verify the effectiveness of utilizing the SIM-based DNN for image recognition task-oriented semantic communications, achieving more than 90% recognition accuracy.

[236] 2407.15055

Natural Language Task-Oriented Dialog System 2.0

Task-oriented dialog (TOD) systems play a crucial role in facilitating efficient interactions between users and machines by focusing on achieving specific goals through natural language communication. These systems traditionally rely on manually annotated metadata, such as dialog states and policy annotations, which is labor-intensive, expensive, inconsistent, and prone to errors, thereby limiting the potential to leverage the vast amounts of available conversational data. A critical aspect of TOD systems involves accessing and integrating information from external sources to effectively engage users. The process of determining when and how to query external resources represents a fundamental challenge in system design, however existing approaches expect this information to provided in the context. In this paper, we introduce Natural Language Task Oriented Dialog System (NL-ToD), a novel model that removes the dependency on manually annotated turn-wise data by utilizing dialog history and domain schemas to create a Zero Shot Generalizable TOD system. We also incorporate query generation as a core task of the system, where the output of the system could be a response to the user or an API query to communicate with an external resource. To achieve a more granular analysis of the system output, we classify the output into multiple categories: slot filling, retrieval, and query generation. Our analysis reveals that slot filling is the most challenging TOD task for all models. Experimental results on three popular TOD datasets (SGD, KETOD and BiToD) shows the effectiveness of our approach as NL-ToD outperforms state-of-the-art approaches, particularly with a \textbf{31.4\%} and \textbf{82.1\%} improvement in the BLEU-4 score on the SGD and KETOD dataset.

[237] 2407.15056

Lexicase Selection Parameter Analysis: Varying Population Size and Test Case Redundancy with Diagnostic Metrics

Lexicase selection is a successful parent selection method in genetic programming that has outperformed other methods across multiple benchmark suites. Unlike other selection methods that require explicit parameters to function, such as tournament size in tournament selection, lexicase selection does not. However, if evolutionary parameters like population size and number of generations affect the effectiveness of a selection method, then lexicase's performance may also be impacted by these `hidden' parameters. Here, we study how these hidden parameters affect lexicase's ability to exploit gradients and maintain specialists using diagnostic metrics. By varying the population size with a fixed evaluation budget, we show that smaller populations tend to have greater exploitation capabilities, whereas larger populations tend to maintain more specialists. We also consider the effect redundant test cases have on specialist maintenance, and find that high redundancy may hinder the ability to optimize and maintain specialists, even for larger populations. Ultimately, we highlight that population size, evaluation budget, and test cases must be carefully considered for the characteristics of the problem being solved.

[238] 2407.15059

The statistical spread of transmission outages on a fast protection time scale based on utility data

When there is a fault, the protection system automatically removes one or more transmission lines on a fast time scale of less than one minute. The outaged lines form a pattern in the transmission network. We extract these patterns from utility outage data, determine some key statistics of these patterns, and then show how to generate new patterns consistent with these statistics. The generated patterns provide a new and easily feasible way to model the overall effect of the protection system at the scale of a large transmission system. This new generative modeling of protection is expected to contribute to simulations of disturbances in large grids so that they can better quantify the risk of blackouts. Analysis of the pattern sizes suggests an index that describes how much outages spread in the transmission network at the fast timescale.

[239] 2407.15060

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online,

[240] 2407.15062

AGORA: Open More and Trust Less in Binary Verification Service

Binary verification plays a pivotal role in software security, yet building a verification service that is both open and trustworthy poses a formidable challenge. In this paper, we introduce a novel binary verification service, AGORA, scrupulously designed to overcome the challenge. At the heart of this approach lies a strategic insight: certain tasks can be delegated to untrusted entities, while the corresponding validators are securely housed within the trusted computing base (TCB). AGORA can validate untrusted assertions generated for versatile policies. Through a novel blockchain-based bounty task manager, it also utilizes crowdsourcing to remove trust in theorem provers. These synergistic techniques successfully ameliorate the TCB size burden associated with two procedures: binary analysis and theorem proving. The design of AGORA allows untrusted parties to participate in these complex processes. Moreover, based on running the optimized TCB within trusted execution environments and recording the verification process on a blockchain, the public can audit the correctness of verification results. By implementing verification workflows for software-based fault isolation policy and side-channel mitigation, our evaluation demonstrates the efficacy of AGORA.

[241] 2407.15063

Feeling the Grass Grow: Making Midair Haptic Parameters Visible, Touchable and Controllable

In this paper, we present an ultrasound mid-air haptic interaction system that integrates a designed visualization of haptic parameters while maintaining ease of control. The design of corresponding haptic parameters for real-world tactile textures is a complex task. Furthermore, users often face difficulties in simultaneously controlling multi-dimensional haptic parameters to achieve the desired vibration feedback. To address these challenges, the SLS optimization method facilitates user control of these multi-dimensional parameters through a simple one-dimensional slider. Concurrently, our system employs the "Growing Grass" metaphor to visualize haptic parameter adjustments in real-time. This approach combining visual and haptic sensations can bring richer experiences and generate a realistic sensation of touching a grassy surface. Our objective is to enhance users' intuitive comprehension of haptic parameters through this innovative system.

[242] 2407.15066

LSReGen: Large-Scale Regional Generator via Backward Guidance Framework

In recent years, advancements in AIGC (Artificial Intelligence Generated Content) technology have significantly enhanced the capabilities of large text-to-image models. Despite these improvements, controllable image generation remains a challenge. Current methods, such as training, forward guidance, and backward guidance, have notable limitations. The first two approaches either demand substantial computational resources or produce subpar results. The third approach depends on phenomena specific to certain model architectures, complicating its application to large-scale image generation.To address these issues, we propose a novel controllable generation framework that offers a generalized interpretation of backward guidance without relying on specific assumptions. Leveraging this framework, we introduce LSReGen, a large-scale layout-to-image method designed to generate high-quality, layout-compliant images. Experimental results show that LSReGen outperforms existing methods in the large-scale layout-to-image task, underscoring the effectiveness of our proposed framework. Our code and models will be open-sourced.

[243] 2407.15067

VoxDepth: Rectification of Depth Images on Edge Devices

Autonomous mobile robots like self-flying drones and industrial robots heavily depend on depth images to perform tasks such as 3D reconstruction and visual SLAM. However, the presence of inaccuracies in these depth images can greatly hinder the effectiveness of these applications, resulting in sub-optimal results. Depth images produced by commercially available cameras frequently exhibit noise, which manifests as flickering pixels and erroneous patches. ML-based methods to rectify these images are unsuitable for edge devices that have very limited computational resources. Non-ML methods are much faster but have limited accuracy, especially for correcting errors that are a result of occlusion and camera movement. We propose a scheme called VoxDepth that is fast, accurate, and runs very well on edge devices. It relies on a host of novel techniques: 3D point cloud construction and fusion, and using it to create a template that can fix erroneous depth images. VoxDepth shows superior results on both synthetic and real-world datasets. We demonstrate a 31% improvement in quality as compared to state-of-the-art methods on real-world depth datasets, while maintaining a competitive framerate of 27 FPS (frames per second).

[244] 2407.15070

3D Gaussian Parametric Head Model

Creating high-fidelity 3D human head avatars is crucial for applications in VR/AR, telepresence, digital human interfaces, and film production. Recent advances have leveraged morphable face models to generate animated head avatars from easily accessible data, representing varying identities and expressions within a low-dimensional parametric space. However, existing methods often struggle with modeling complex appearance details, e.g., hairstyles and accessories, and suffer from low rendering quality and efficiency. This paper introduces a novel approach, 3D Gaussian Parametric Head Model, which employs 3D Gaussians to accurately represent the complexities of the human head, allowing precise control over both identity and expression. Additionally, it enables seamless face portrait interpolation and the reconstruction of detailed head avatars from a single image. Unlike previous methods, the Gaussian model can handle intricate details, enabling realistic representations of varying appearances and complex expressions. Furthermore, this paper presents a well-designed training framework to ensure smooth convergence, providing a guarantee for learning the rich content. Our method achieves high-quality, photo-realistic rendering with real-time efficiency, making it a valuable contribution to the field of parametric head models.

[245] 2407.15071

Relational Database Augmented Large Language Model

Large language models (LLMs) excel in many natural language processing (NLP) tasks. However, since LLMs can only incorporate new knowledge through training or supervised fine-tuning processes, they are unsuitable for applications that demand precise, up-to-date, and private information not available in the training corpora. This precise, up-to-date, and private information is typically stored in relational databases. Thus, a promising solution is to augment LLMs with the inclusion of relational databases as external memory. This can ensure the timeliness, correctness, and consistency of data, and assist LLMs in performing complex arithmetic operations beyond their inherent capabilities. However, bridging the gap between LLMs and relational databases is challenging. It requires the awareness of databases and data values stored in databases to select correct databases and issue correct SQL queries. Besides, it is necessary for the external memory to be independent of the LLM to meet the needs of real-world applications. We introduce a novel LLM-agnostic memory architecture comprising a database selection memory, a data value memory, and relational databases. And we design an elegant pipeline to retrieve information from it. Besides, we carefully design the prompts to instruct the LLM to maximize the framework's potential. To evaluate our method, we compose a new dataset with various types of questions. Experimental results show that our framework enables LLMs to effectively answer database-related questions, which is beyond their direct ability.

[246] 2407.15073

Multi-Agent Causal Discovery Using Large Language Models

Large Language Models (LLMs) have demonstrated significant potential in causal discovery tasks by utilizing their vast expert knowledge from extensive text corpora. However, the multi-agent capabilities of LLMs in causal discovery remain underexplored. This paper introduces a general framework to investigate this potential. The first is the Meta Agents Model, which relies exclusively on reasoning and discussions among LLM agents to conduct causal discovery. The second is the Coding Agents Model, which leverages the agents' ability to plan, write, and execute code, utilizing advanced statistical libraries for causal discovery. The third is the Hybrid Model, which integrates both the Meta Agents Model and CodingAgents Model approaches, combining the statistical analysis and reasoning skills of multiple agents. Our proposed framework shows promising results by effectively utilizing LLMs expert knowledge, reasoning capabilities, multi-agent cooperation, and statistical causal methods. By exploring the multi-agent potential of LLMs, we aim to establish a foundation for further research in utilizing LLMs multi-agent for solving causal-related problems.

[247] 2407.15077

B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency

Most multi-agent reinforcement learning approaches adopt two types of policy optimization methods that either update policy simultaneously or sequentially. Simultaneously updating policies of all agents introduces non-stationarity problem. Although sequentially updating policies agent-by-agent in an appropriate order improves policy performance, it is prone to low efficiency due to sequential execution, resulting in longer model training and execution time. Intuitively, partitioning policies of all agents according to their interdependence and updating joint policy batch-by-batch can effectively balance performance and efficiency. However, how to determine the optimal batch partition of policies and batch updating order are challenging problems. Firstly, a sequential batched policy updating scheme, B2MAPO (Batch by Batch Multi-Agent Policy Optimization), is proposed with a theoretical guarantee of the monotonic incrementally tightened bound. Secondly, a universal modulized plug-and-play B2MAPO hierarchical framework, which satisfies CTDE principle, is designed to conveniently integrate any MARL models to fully exploit and merge their merits, including policy optimality and inference efficiency. Next, a DAG-based B2MAPO algorithm is devised, which is a carefully designed implementation of B2MAPO framework. Comprehensive experimental results conducted on StarCraftII Multi-agent Challenge and Google Football Research demonstrate the performance of DAG-based B2MAPO algorithm outperforms baseline methods. Meanwhile, compared with A2PO, our algorithm reduces the model training and execution time by 60.4% and 78.7%, respectively.

[248] 2407.15078

Learning to Compile Programs to Neural Networks

A $\textit{neural surrogate of a program}$ is a neural network that mimics the behavior of a program. Researchers have used these neural surrogates to automatically tune program inputs, adapt programs to new settings, and accelerate computations. Researchers traditionally develop neural surrogates by training on input-output examples from a single program. Alternatively, language models trained on a large dataset including many programs can consume program text, to act as a neural surrogate. Using a language model to both generate a surrogate and act as a surrogate, however, leading to a trade-off between resource consumption and accuracy. We present $\textit{neural surrogate compilation}$, a technique for producing neural surrogates directly from program text without coupling neural surrogate generation and execution. We implement neural surrogate compilers using hypernetworks trained on a dataset of C programs and find that they produce neural surrogates that are $1.9$-$9.5\times$ as data-efficient, produce visual results that are $1.0$-$1.3\times$ more similar to ground truth, and train in $4.3$-$7.3\times$ fewer epochs than neural surrogates trained from scratch.

[249] 2407.15080

SNIP: Speculative Execution and Non-Interference Preservation for Compiler Transformations

We address the problem of preserving non-interference across compiler transformations under speculative semantics. We develop a proof method that ensures the preservation uniformly across all source programs. The basis of our proof method is a new form of simulation relation. It operates over directives that model the attacker's control over the micro-architectural state, and it accounts for the fact that the compiler transformation may change the influence of the micro-architectural state on the execution (and hence the directives). Using our proof method, we show the correctness of dead code elimination. When we tried to prove register allocation correct, we identified a previously unknown weakness that introduces violations to non-interference. We have confirmed the weakness for a mainstream compiler on code from the libsodium cryptographic library. To reclaim security once more, we develop a novel static analysis that operates on a product of source program and register-allocated program. Using the analysis, we present an automated fix to existing register allocation implementations. We prove the correctness of the fixed register allocations with our proof method.

[250] 2407.15083

Rocket Landing Control with Random Annealing Jump Start Reinforcement Learning

Rocket recycling is a crucial pursuit in aerospace technology, aimed at reducing costs and environmental impact in space exploration. The primary focus centers on rocket landing control, involving the guidance of a nonlinear underactuated rocket with limited fuel in real-time. This challenging task prompts the application of reinforcement learning (RL), yet goal-oriented nature of the problem poses difficulties for standard RL algorithms due to the absence of intermediate reward signals. This paper, for the first time, significantly elevates the success rate of rocket landing control from 8% with a baseline controller to 97% on a high-fidelity rocket model using RL. Our approach, called Random Annealing Jump Start (RAJS), is tailored for real-world goal-oriented problems by leveraging prior feedback controllers as guide policy to facilitate environmental exploration and policy learning in RL. In each episode, the guide policy navigates the environment for the guide horizon, followed by the exploration policy taking charge to complete remaining steps. This jump-start strategy prunes exploration space, rendering the problem more tractable to RL algorithms. The guide horizon is sampled from a uniform distribution, with its upper bound annealing to zero based on performance metrics, mitigating distribution shift and mismatch issues in existing methods. Additional enhancements, including cascading jump start, refined reward and terminal condition, and action smoothness regulation, further improve policy performance and practical applicability. The proposed method is validated through extensive evaluation and Hardware-in-the-Loop testing, affirming the effectiveness, real-time feasibility, and smoothness of the proposed controller.

[251] 2407.15085

Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization

Domain generalization (DG) aims to avoid the performance degradation of the model when the distribution shift between the limited training data and unseen test data occurs. Recently, foundation models with enormous parameters have been pre-trained with huge datasets, demonstrating strong generalization ability and showing promising direction for solving the DG problem. However, fully Fine-Tuning (FT) the foundation models results in unsatisfactory out-of-distribution accuracy due to the destroyed pre-trained generalized features. Recently, Parameter-Efficient Fine-Tuning (PEFT) alleviates the above problem by fine-tuning a small portion of the model parameters while keeping the rest frozen, which achieves better generalization performance compared to FT. Nevertheless, PEFT still suffers from the issue of overfitting to the training domains. To address the above issue, we propose Parameter-Efficient Group with Orthogonal regularization (PEGO) for vision transformers, which effectively preserves the generalization ability of the pre-trained network and learns more diverse knowledge compared with conventional PEFT. Specifically, we inject a group of trainable Low-Rank Adaptation (LoRA) modules into the pre-trained model and propose an orthogonal regularization loss to enhance the generalization ability of the model. Our framework achieves SOTA performance on five DG benchmarks, while only requiring training a small number of parameters without adding additional testing cost.

[252] 2407.15086

MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery

We aim to discover manipulation concepts embedded in the unannotated demonstrations, which are recognized as key physical states. The discovered concepts can facilitate training manipulation policies and promote generalization. Current methods relying on multimodal foundation models for deriving key states usually lack accuracy and semantic consistency due to limited multimodal robot data. In contrast, we introduce an information-theoretic criterion to characterize the regularities that signify a set of physical states. We also develop a framework that trains a concept discovery network using this criterion, thus bypassing the dependence on human semantics and alleviating costly human labeling. The proposed criterion is based on the observation that key states, which deserve to be conceptualized, often admit more physical constraints than non-key states. This phenomenon can be formalized as maximizing the mutual information between the putative key state and its preceding state, i.e., Maximal Mutual Information (MaxMI). By employing MaxMI, the trained key state localization network can accurately identify states of sufficient physical significance, exhibiting reasonable semantic compatibility with human perception. Furthermore, the proposed framework produces key states that lead to concept-guided manipulation policies with higher success rates and better generalization in various robotic tasks compared to the baselines, verifying the effectiveness of the proposed criterion.

[253] 2407.15087

Navigation Instruction Generation with BEV Perception and Large Language Models

Navigation instruction generation, which requires embodied agents to describe the navigation routes, has been of great interest in robotics and human-computer interaction. Existing studies directly map the sequence of 2D perspective observations to route descriptions. Though straightforward, they overlook the geometric information and object semantics of the 3D environment. To address these challenges, we propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation. Specifically, BEVInstructor constructs a PerspectiveBEVVisual Encoder for the comprehension of 3D environments through fusing BEV and perspective features. To leverage the powerful language capabilities of MLLMs, the fused representations are used as visual prompts for MLLMs, and perspective-BEV prompt tuning is proposed for parameter-efficient updating. Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner. BEVInstructor achieves impressive performance across diverse datasets (i.e., R2R, REVERIE, and UrbanWalk).

[254] 2407.15092

PFWNN: A deep learning method for solving forward and inverse problems of phase-field models

Phase-field models have been widely used to investigate the phase transformation phenomena. However, it is difficult to solve the problems numerically due to their strong nonlinearities and higher-order terms. This work is devoted to solving forward and inverse problems of the phase-field models by a novel deep learning framework named Phase-Field Weak-form Neural Networks (PFWNN), which is based on the weak forms of the phase-field equations. In this framework, the weak solutions are parameterized as deep neural networks with a periodic layer, while the test function space is constructed by functions compactly supported in small regions. The PFWNN can efficiently solve the phase-field equations characterizing the sharp transitions and identify the important parameters by employing the weak forms. It also allows local training in small regions, which significantly reduce the computational cost. Moreover, it can guarantee the residual descending along the time marching direction, enhancing the convergence of the method. Numerical examples are presented for several benchmark problems. The results validate the efficiency and accuracy of the PFWNN. This work also sheds light on solving the forward and inverse problems of general high-order time-dependent partial differential equations.

[255] 2407.15094

Determining a Time-Varying Potential in Time-Fractional Diffusion from Observation at a Single Point

We discuss the identification of a time-dependent potential in a time-fractional diffusion model from a boundary measurement taken at a single point. Theoretically, we establish a conditional Lipschitz stability for this inverse problem. Numerically, we develop an easily implementable iterative algorithm to recover the unknown coefficient, and also derive rigorous error bounds for the discrete reconstruction. These results are attained by using the (discrete) solution theory of direct problems, and applying error estimates that are optimal with respect to problem data regularity. Numerical simulations are provided to demonstrate the theoretical results.

[256] 2407.15098

SeqMIA: Sequential-Metric Based Membership Inference Attack

Most existing membership inference attacks (MIAs) utilize metrics (e.g., loss) calculated on the model's final state, while recent advanced attacks leverage metrics computed at various stages, including both intermediate and final stages, throughout the model training. Nevertheless, these attacks often process multiple intermediate states of the metric independently, ignoring their time-dependent patterns. Consequently, they struggle to effectively distinguish between members and non-members who exhibit similar metric values, particularly resulting in a high false-positive rate. In this study, we delve deeper into the new membership signals in the black-box scenario. We identify a new, more integrated membership signal: the Pattern of Metric Sequence, derived from the various stages of model training. We contend that current signals provide only partial perspectives of this new signal: the new one encompasses both the model's multiple intermediate and final states, with a greater emphasis on temporal patterns among them. Building upon this signal, we introduce a novel attack method called Sequential-metric based Membership Inference Attack (SeqMIA). Specifically, we utilize knowledge distillation to obtain a set of distilled models representing various stages of the target model's training. We then assess multiple metrics on these distilled models in chronological order, creating distilled metric sequence. We finally integrate distilled multi-metric sequences as a sequential multiformat and employ an attention-based RNN attack model for inference. Empirical results show SeqMIA outperforms all baselines, especially can achieve an order of magnitude improvement in terms of TPR @ 0.1% FPR. Furthermore, we delve into the reasons why this signal contributes to SeqMIA's high attack performance, and assess various defense mechanisms against SeqMIA.

[257] 2407.15100

A General Framework for Data-Use Auditing of ML Models

Auditing the use of data in training machine-learning (ML) models is an increasingly pressing challenge, as myriad ML practitioners routinely leverage the effort of content creators to train models without their permission. In this paper, we propose a general method to audit an ML model for the use of a data-owner's data in training, without prior knowledge of the ML task for which the data might be used. Our method leverages any existing black-box membership inference method, together with a sequential hypothesis test of our own design, to detect data use with a quantifiable, tunable false-detection rate. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models, namely image classifiers and foundation models.

[258] 2407.15104

Lifted Hamming and Simplex Codes and Their Designs

A generator matrix of a linear code $\C$ over $\gf(q)$ is also a matrix of the same rank $k$ over any extension field $\gf(q^\ell)$ and generates a linear code of the same length, same dimension and same minimum distance over $\gf(q^\ell)$, denoted by $\C(q|q^\ell)$ and called a lifted code of $\C$. Although $\C$ and their lifted codes $\C(q|q^\ell)$ have the same parameters, they have different weight distributions and different applications. Few results about lifted linear codes are known in the literature. This paper proves some fundamental theory for lifted linear codes, settles the weight distributions of lifted Hamming codes and lifted Simplex codes, and investigates the $2$-designs supported by the lifted Hamming and Simplex codes. Infinite families of $2$-designs are obtained. In addition, an infinite family of two-weight codes and an infinite family of three-weight codes are presented.

[259] 2407.15110

Practical multi-fidelity machine learning: fusion of deterministic and Bayesian models

Multi-fidelity machine learning methods address the accuracy-efficiency trade-off by integrating scarce, resource-intensive high-fidelity data with abundant but less accurate low-fidelity data. We propose a practical multi-fidelity strategy for problems spanning low- and high-dimensional domains, integrating a non-probabilistic regression model for the low-fidelity with a Bayesian model for the high-fidelity. The models are trained in a staggered scheme, where the low-fidelity model is transfer-learned to the high-fidelity data and a Bayesian model is trained for the residual. This three-model strategy -- deterministic low-fidelity, transfer learning, and Bayesian residual -- leads to a prediction that includes uncertainty quantification both for noisy and noiseless multi-fidelity data. The strategy is general and unifies the topic, highlighting the expressivity trade-off between the transfer-learning and Bayesian models (a complex transfer-learning model leads to a simpler Bayesian model, and vice versa). We propose modeling choices for two scenarios, and argue in favor of using a linear transfer-learning model that fuses 1) kernel ridge regression for low-fidelity with Gaussian processes for high-fidelity; or 2) deep neural network for low-fidelity with a Bayesian neural network for high-fidelity. We demonstrate the effectiveness and efficiency of the proposed strategies and contrast them with the state-of-the-art based on various numerical examples. The simplicity of these formulations makes them practical for a broad scope of future engineering applications.

[260] 2407.15111

D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

In this paper, we introduce D$^4$-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoising. Our approach utilizes two key technologies: Firstly, Dynamic Semantics Disentangling Modules (DSDMs) extract abstract semantic information from garments to create distinct local flows, improving precise garment warping in a self-discovered manner. Secondly, by integrating a Differential Information Tracking Path (DITP), we establish a novel diffusion-based VTON paradigm. This path captures differential information between incomplete try-on inputs and their complete versions, enabling the network to handle multiple degradations independently, thereby minimizing learning ambiguities and achieving realistic results with minimal overhead. Extensive experiments demonstrate that D$^4$-VTON significantly outperforms existing methods in both quantitative metrics and qualitative evaluations, demonstrating its capability in generating realistic images and ensuring semantic consistency.

[261] 2407.15122

UAV Active Perception and Motion Control for Improving Navigation Using Low-Cost Sensors

In this study a model pipeline is proposed that combines computer vision with control-theoretic methods and utilizes low cost sensors. The proposed work enables perception-aware motion control for a quadrotor UAV to detect and navigate to objects of interest such as wind turbines and electric towers. The distance to the object of interest was estimated utilizing RGB as the primary sensory input. For the needs of the study, the Microsoft AirSim simulator was used. As a first step, a YOLOv8 model was integrated providing the basic position setpoints towards the detection. From the YOLOv8 inference, a target yaw angle was derived. The subsequent algorithms, combining performant in computational terms computer vision methods and YOLOv8, actively drove the drone to measure the height of the detection. Based on the height, an estimate of the depth was retrieved. In addition to this step, a convolutional neural network was developed, namely ActvePerceptionNet aiming at active YOLOv8 inference. The latter was validated for wind turbines where the rotational motion of the propeller was found to affect object confidence in a near periodical fashion. The results of the simulation experiments conducted in this study showed efficient object height and distance estimation and effective localization.

[262] 2407.15124

Chemical Reaction Extraction for Chemical Knowledge Base

The task of searching through patent documents is crucial for chemical patent recommendation and retrieval. This can be enhanced by creating a patent knowledge base (ChemPatKB) to aid in prior art searches and to provide a platform for domain experts to explore new innovations in chemical compound synthesis and use-cases. An essential foundational component of this KB is the extraction of important reaction snippets from long patents documents which facilitates multiple downstream tasks such as reaction co-reference resolution and chemical entity role identification. In this work, we explore the problem of extracting reactions spans from chemical patents in order to create a reactions resource database. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs that contain a description of a reaction. We propose several approaches and modifications of the baseline models and study how different methods generalize across different domains of chemical patents.

[263] 2407.15125

The dark side of the metaverse: The role of gamification in event virtualization

The virtualization of cultural events in the metaverse creates opportunities to generate valuable and innovative experiences that replicate and extend in-person events; but the process faces associated challenges. In the absence of relevant empirical studies, the aim of this article is to analyze the positive and negative aspects of the user experience in a cultural event held in the metaverse. A mixed-methods approach is employed to test the proposed hypotheses. The results from three focus groups demonstrated the difficulty that users face in focusing their attention on the main elements of the metaverse, and the inability of this virtual sphere to convey the authenticity of a cultural event. Based on these findings, a metaverse-focused quantitative study was conducted to examine whether perceived gamification mitigate the negative effects of users failing to pay attention in their metaverse experiences. When users increased their attention levels, their ability to imagine the real experience and their perceptions of the authenticity of the cultural event increased, which produced positive behavioral intentions. This is one of the first studies to empirically analyze the tourist experience in the metaverse; managers and policymakers can benefit from the results to hold valuable virtual cultural events.

[264] 2407.15127

A Model of Proactive Safety Based on Knowledge Graph

In contemporary safety management, despite the abundance of safety data gathered from routine operation tasks and safety management activities, actions cannot prevent all accidents effectively due to a lack of effective utilization of these data as safety knowledge. To bridge this gap, this paper proposes a hybrid proactive safety model integrating data-driven and knowledge-driven approaches. The model comprises three main steps: proactive safety actions to generate safety data, data-driven approaches to mine safety data, and knowledge-driven approaches to depicting risk knowledge graphs. Application of this model to a continuous stirred tank reactor (CSTR) scenario demonstrates its efficacy in identifying and addressing safety issues proactively. The results demonstrate the effectiveness and practicality of the proposed proactive safety model, suggesting its endorsement within both academic and industrial applications.

[265] 2407.15130

DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer

In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.

[266] 2407.15131

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.

[267] 2407.15134

Proximal Policy Distillation

We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: `sb3-distill'.

[268] 2407.15136

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

[269] 2407.15138

D$^4$M: Dataset Distillation via Disentangled Diffusion Model

Dataset distillation offers a lightweight synthetic dataset for fast network training with promising test accuracy. To imitate the performance of the original dataset, most approaches employ bi-level optimization and the distillation space relies on the matching architecture. Nevertheless, these approaches either suffer significant computational costs on large-scale datasets or experience performance decline on cross-architectures. We advocate for designing an economical dataset distillation framework that is independent of the matching architectures. With empirical observations, we argue that constraining the consistency of the real and synthetic image spaces will enhance the cross-architecture generalization. Motivated by this, we introduce Dataset Distillation via Disentangled Diffusion Model (D$^4$M), an efficient framework for dataset distillation. Compared to architecture-dependent methods, D$^4$M employs latent diffusion model to guarantee consistency and incorporates label information into category prototypes. The distilled datasets are versatile, eliminating the need for repeated generation of distinct datasets for various architectures. Through comprehensive experiments, D$^4$M demonstrates superior performance and robust generalization, surpassing the SOTA methods across most aspects.

[270] 2407.15139

An Interface Method for Co-simulation of EMT Model and Shifted Frequency EMT Model Based on Rotational Invariance Techniques

The shifted frequency-based electromagnetic transient (SFEMT) simulation has greatly improved the computational efficiency of traditional electromagnetic transient (EMT) simulation for the ac grid. This letter proposes a novel interface for the co-simulation of the SFEMT model and the traditional EMT model. The general form of SFEMT modeling and the principle of analytical signal construction are first derived. Then, an interface for the co-simulation of EMT and SFEMT simulation is proposed based on rotational invariance techniques. Theoretical analyses and test results demonstrate the effectiveness of the proposed method.

[271] 2407.15141

Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation

High-throughput reaction condition (RC) screening is fundamental to chemical synthesis. However, current RC screening suffers from laborious and costly trial-and-error workflows. Traditional computer-aided synthesis planning (CASP) tools fail to find suitable RCs due to data sparsity and inadequate reaction representations. Nowadays, large language models (LLMs) are capable of tackling chemistry-related problems, such as molecule design, and chemical logic Q\&A tasks. However, LLMs have not yet achieved accurate predictions of chemical reaction conditions. Here, we present MM-RCR, a text-augmented multimodal LLM that learns a unified reaction representation from SMILES, reaction graphs, and textual corpus for chemical reaction recommendation (RCR). To train MM-RCR, we construct 1.2 million pair-wised Q\&A instruction datasets. Our experimental results demonstrate that MM-RCR achieves state-of-the-art performance on two open benchmark datasets and exhibits strong generalization capabilities on out-of-domain (OOD) and High-Throughput Experimentation (HTE) datasets. MM-RCR has the potential to accelerate high-throughput condition screening in chemical synthesis.

[272] 2407.15143

Rethinking Feature Backbone Fine-tuning for Remote Sensing Object Detection

Recently, numerous methods have achieved impressive performance in remote sensing object detection, relying on convolution or transformer architectures. Such detectors typically have a feature backbone to extract useful features from raw input images. For the remote sensing domain, a common practice among current detectors is to initialize the backbone with pre-training on ImageNet consisting of natural scenes. Fine-tuning the backbone is typically required to generate features suitable for remote-sensing images. However, this could hinder the extraction of basic visual features in long-term training, thus restricting performance improvement. To mitigate this issue, we propose a novel method named DBF (Dynamic Backbone Freezing) for feature backbone fine-tuning on remote sensing object detection. Our method aims to handle the dilemma of whether the backbone should extract low-level generic features or possess specific knowledge of the remote sensing domain, by introducing a module called 'Freezing Scheduler' to dynamically manage the update of backbone features during training. Extensive experiments on DOTA and DIOR-R show that our approach enables more accurate model learning while substantially reducing computational costs. Our method can be seamlessly adopted without additional effort due to its straightforward design.

[273] 2407.15152

SNNGX: Securing Spiking Neural Networks with Genetic XOR Encryption on RRAM-based Neuromorphic Accelerator

Biologically plausible Spiking Neural Networks (SNNs), characterized by spike sparsity, are growing tremendous attention over intellectual edge devices and critical bio-medical applications as compared to artificial neural networks (ANNs). However, there is a considerable risk from malicious attempts to extract white-box information (i.e., weights) from SNNs, as attackers could exploit well-trained SNNs for profit and white-box adversarial concerns. There is a dire need for intellectual property (IP) protective measures. In this paper, we present a novel secure software-hardware co-designed RRAM-based neuromorphic accelerator for protecting the IP of SNNs. Software-wise, we design a tailored genetic algorithm with classic XOR encryption to target the least number of weights that need encryption. From a hardware perspective, we develop a low-energy decryption module, meticulously designed to provide zero decryption latency. Extensive results from various datasets, including NMNIST, DVSGesture, EEGMMIDB, Braille Letter, and SHD, demonstrate that our proposed method effectively secures SNNs by encrypting a minimal fraction of stealthy weights, only 0.00005% to 0.016% weight bits. Additionally, it achieves a substantial reduction in energy consumption, ranging from x59 to x6780, and significantly lowers decryption latency, ranging from x175 to x4250. Moreover, our method requires as little as one sample per class in dataset for encryption and addresses hessian/gradient-based search insensitive problems. This strategy offers a highly efficient and flexible solution for securing SNNs in diverse applications.

[274] 2407.15153

Anchored Diffusion for Video Face Reenactment

Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.

[275] 2407.15154

Fine-grained Gender Control in Machine Translation with Large Language Models

In machine translation, the problem of ambiguously gendered input has been pointed out, where the gender of an entity is not available in the source sentence. To address this ambiguity issue, the task of controlled translation that takes the gender of the ambiguous entity as additional input have been proposed. However, most existing works have only considered a simplified setup of one target gender for input. In this paper, we tackle controlled translation in a more realistic setting of inputs with multiple entities and propose Gender-of-Entity (GoE) prompting method for LLMs. Our proposed method instructs the model with fine-grained entity-level gender information to translate with correct gender inflections. By utilizing four evaluation benchmarks, we investigate the controlled translation capability of LLMs in multiple dimensions and find that LLMs reach state-of-the-art performance in controlled translation. Furthermore, we discover an emergence of gender interference phenomenon when controlling the gender of multiple entities. Finally, we address the limitations of existing gender accuracy evaluation metrics and propose leveraging LLMs as an evaluator for gender inflection in machine translation.

[276] 2407.15155

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

Data-Free Knowledge Distillation (DFKD) has shown great potential in creating a compact student model while alleviating the dependency on real training data by synthesizing surrogate data. However, prior arts are seldom discussed under distribution shifts, which may be vulnerable in real-world applications. Recent Vision-Language Foundation Models, e.g., CLIP, have demonstrated remarkable performance in zero-shot out-of-distribution generalization, yet consuming heavy computation resources. In this paper, we discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets. The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts, inheriting the out-of-distribution generalization capability from the pre-trained foundation models. In order to avoid generalization degradation, the primary challenge of this task lies in synthesizing diverse surrogate images driven by text prompts. Since not only category concepts but also style information are encoded in text prompts, we propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles, namely Mix-Prompt, Random-Prompt, and Contrastive-Prompt. Experiments on out-of-distribution generalization datasets demonstrate the effectiveness of the proposed methods, with Contrastive-Prompt performing the best.

[277] 2407.15156

Computational and analytical studies of a new nonlocal phase-field crystal model in two dimensions

A nonlocal phase-field crystal (NPFC) model is presented as a nonlocal counterpart of the local phase-field crystal (LPFC) model and a special case of the structural PFC (XPFC) derived from classical field theory for crystal growth and phase transition. The NPFC incorporates a finite range of spatial nonlocal interactions that can account for both repulsive and attractive effects. The specific form is data-driven and determined by a fitting to the materials structure factor, which can be much more accurate than the LPFC and previously proposed fractional variant. In particular, it is able to match the experimental data of the structure factor up to the second peak, an achievement not possible with other PFC variants studied in the literature. Both LPFC and fractional PFC (FPFC) are also shown to be distinct scaling limits of the NPFC, which reflects the generality. The advantage of NPFC in retaining material properties suggests that it may be more suitable for characterizing liquid-solid transition systems. Moreover, we study numerical discretizations using Fourier spectral methods, which are shown to be convergent and asymptotically compatible, making them robust numerical discretizations across different parameter ranges. Numerical experiments are given in the two-dimensional case to demonstrate the effectiveness of the NPFC in simulating crystal structures and grain boundaries.

[278] 2407.15158

HERGen: Elevating Radiology Report Generation with Longitudinal Data

Radiology reports provide detailed descriptions of medical imaging integrated with patients' medical histories, while report writing is traditionally labor-intensive, increasing radiologists' workload and the risk of diagnostic errors. Recent efforts in automating this process seek to mitigate these issues by enhancing accuracy and clinical efficiency. Emerging research in automating this process promises to alleviate these challenges by reducing errors and streamlining clinical workflows. However, existing automated approaches are based on a single timestamp and often neglect the critical temporal aspect of patients' imaging histories, which is essential for accurate longitudinal analysis. To address this gap, we propose a novel History Enhanced Radiology Report Generation (HERGen) framework that employs a employs a group causal transformer to efficiently integrate longitudinal data across patient visits. Our approach not only allows for comprehensive analysis of varied historical data but also improves the quality of generated reports through an auxiliary contrastive objective that aligns image sequences with their corresponding reports. More importantly, we introduce a curriculum learning-based strategy to adeptly handle the inherent complexity of longitudinal radiology data and thus stabilize the optimization of our framework. The extensive evaluations across three datasets demonstrate that our framework surpasses existing methods in generating accurate radiology reports and effectively predicting disease progression from medical images.

[279] 2407.15160

When Can Transformers Count to n?

Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks.

[280] 2407.15161

FFHFlow: A Flow-based Variational Approach for Multi-fingered Grasp Synthesis in Real Time

Synthesizing diverse and accurate grasps with multi-fingered hands is an important yet challenging task in robotics. Previous efforts focusing on generative modeling have fallen short of precisely capturing the multi-modal, high-dimensional grasp distribution. To address this, we propose exploiting a special kind of Deep Generative Model (DGM) based on Normalizing Flows (NFs), an expressive model for learning complex probability distributions. Specifically, we first observed an encouraging improvement in diversity by directly applying a single conditional NFs (cNFs), dubbed FFHFlow-cnf, to learn a grasp distribution conditioned on the incomplete point cloud. However, we also recognized limited performance gains due to restricted expressivity in the latent space. This motivated us to develop a novel flow-based d Deep Latent Variable Model (DLVM), namely FFHFlow-lvm, which facilitates more reasonable latent features, leading to both diverse and accurate grasp synthesis for unseen objects. Unlike Variational Autoencoders (VAEs), the proposed DLVM counteracts typical pitfalls such as mode collapse and mis-specified priors by leveraging two cNFs for the prior and likelihood distributions, which are usually restricted to being isotropic Gaussian. Comprehensive experiments in simulation and real-robot scenarios demonstrate that our method generates more accurate and diverse grasps than the VAE baselines. Additionally, a run-time comparison is conducted to reveal its high potential for real-time applications.

[281] 2407.15166

Adversarial Circuit Evaluation

Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.

[282] 2407.15167

The VEP Booster: A Closed-Loop AI System for Visual EEG Biomarker Auto-generation

Effective visual brain-machine interfaces (BMI) is based on reliable and stable EEG biomarkers. However, traditional adaptive filter-based approaches may suffer from individual variations in EEG signals, while deep neural network-based approaches may be hindered by the non-stationarity of EEG signals caused by biomarker attenuation and background oscillations. To address these challenges, we propose the Visual Evoked Potential Booster (VEP Booster), a novel closed-loop AI framework that generates reliable and stable EEG biomarkers under visual stimulation protocols. Our system leverages an image generator to refine stimulus images based on real-time feedback from human EEG signals, generating visual stimuli tailored to the preferences of primary visual cortex (V1) neurons and enabling effective targeting of neurons most responsive to stimuli. We validated our approach by implementing a system and employing steady-state visual evoked potential (SSVEP) visual protocols in five human subjects. Our results show significant enhancements in the reliability and utility of EEG biomarkers for all individuals, with the largest improvement in SSVEP response being 105%, the smallest being 28%, and the average increase being 76.5%. These promising results have implications for both clinical and technological applications

[283] 2407.15168

Mitigating Deep Reinforcement Learning Backdoors in the Neural Activation Space

This paper investigates the threat of backdoors in Deep Reinforcement Learning (DRL) agent policies and proposes a novel method for their detection at runtime. Our study focuses on elusive in-distribution backdoor triggers. Such triggers are designed to induce a deviation in the behaviour of a backdoored agent while blending into the expected data distribution to evade detection. Through experiments conducted in the Atari Breakout environment, we demonstrate the limitations of current sanitisation methods when faced with such triggers and investigate why they present a challenging defence problem. We then evaluate the hypothesis that backdoor triggers might be easier to detect in the neural activation space of the DRL agent's policy network. Our statistical analysis shows that indeed the activation patterns in the agent's policy network are distinct in the presence of a trigger, regardless of how well the trigger is concealed in the environment. Based on this, we propose a new defence approach that uses a classifier trained on clean environment samples and detects abnormal activations. Our results show that even lightweight classifiers can effectively prevent malicious actions with considerable accuracy, indicating the potential of this research direction even against sophisticated adversaries.

[284] 2407.15170

Semi-Supervised Pipe Video Temporal Defect Interval Localization

In sewer pipe Closed-Circuit Television (CCTV) inspection, accurate temporal defect localization is essential for effective defect classification, detection, segmentation and quantification. Industry standards typically do not require time-interval annotations, even though they are more informative than time-point annotations for defect localization, resulting in additional annotation costs when fully supervised methods are used. Additionally, differences in scene types and camera motion patterns between pipe inspections and Temporal Action Localization (TAL) hinder the effective transfer of point-supervised TAL methods. Therefore, this study introduces a Semi-supervised multi-Prototype-based method incorporating visual Odometry for enhanced attention guidance (PipeSPO). PipeSPO fully leverages unlabeled data through unsupervised pretext tasks and utilizes time-point annotated data with a weakly supervised multi-prototype-based method, relying on visual odometry features to capture camera pose information. Experiments on real-world datasets demonstrate that PipeSPO achieves 41.89% average precision across Intersection over Union (IoU) thresholds of 0.1-0.7, improving by 8.14% over current state-of-the-art methods.

[285] 2407.15171

Assessing Sample Quality via the Latent Space of Generative Models

Advances in generative models increase the need for sample quality assessment. To do so, previous methods rely on a pre-trained feature extractor to embed the generated samples and real samples into a common space for comparison. However, different feature extractors might lead to inconsistent assessment outcomes. Moreover, these methods are not applicable for domains where a robust, universal feature extractor does not yet exist, such as medical images or 3D assets. In this paper, we propose to directly examine the latent space of the trained generative model to infer generated sample quality. This is feasible because the quality a generated sample directly relates to the amount of training data resembling it, and we can infer this information by examining the density of the latent space. Accordingly, we use a latent density score function to quantify sample quality. We show that the proposed score correlates highly with the sample quality for various generative models including VAEs, GANs and Latent Diffusion Models. Compared with previous quality assessment methods, our method has the following advantages: 1) pre-generation quality estimation with reduced computational cost, 2) generalizability to various domains and modalities, and 3) applicability to latent-based image editing and generation methods. Extensive experiments demonstrate that our proposed methods can benefit downstream tasks such as few-shot image classification and latent face image editing. Code is available at

[286] 2407.15173

Rethinking Domain Adaptation and Generalization in the Era of CLIP

In recent studies on domain adaptation, significant emphasis has been placed on the advancement of learning shared knowledge from a source domain to a target domain. Recently, the large vision-language pre-trained model, i.e., CLIP has shown strong ability on zero-shot recognition, and parameter efficient tuning can further improve its performance on specific tasks. This work demonstrates that a simple domain prior boosts CLIP's zero-shot recognition in a specific domain. Besides, CLIP's adaptation relies less on source domain data due to its diverse pre-training dataset. Furthermore, we create a benchmark for zero-shot adaptation and pseudo-labeling based self-training with CLIP. Last but not least, we propose to improve the task generalization ability of CLIP from multiple unlabeled domains, which is a more practical and unique scenario. We believe our findings motivate a rethinking of domain adaptation benchmarks and the associated role of related algorithms in the era of CLIP.

[287] 2407.15174

TADA: Temporal Adversarial Data Augmentation for Time Series Data

Domain generalization involves training machine learning models to perform robustly on unseen samples from out-of-distribution datasets. Adversarial Data Augmentation (ADA) is a commonly used approach that enhances model adaptability by incorporating synthetic samples, designed to simulate potential unseen samples. While ADA effectively addresses amplitude-related distribution shifts, it falls short in managing temporal shifts, which are essential for time series data. To address this limitation, we propose the Temporal Adversarial Data Augmentation for time teries Data (TADA), which incorporates a time warping technique specifically targeting temporal shifts. Recognizing the challenge of non-differentiability in traditional time warping, we make it differentiable by leveraging phase shifts in the frequency domain. Our evaluations across diverse domains demonstrate that TADA significantly outperforms existing ADA variants, enhancing model performance across time series datasets with varied distributions.

[288] 2407.15176

Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope

The maximum supported context length is a critical bottleneck limiting the practical application of the Large Language Model (LLM). Although existing length extrapolation methods can extend the context of LLMs to millions of tokens, these methods all have an explicit upper bound. In this work, we propose LongCache, a training-free approach that enables LLM to support an infinite context with finite context scope, through full-context cache selection and training-free integration. This effectively frees LLMs from the length extrapolation issue. We validate LongCache on the LongBench and L-Eval and demonstrate its performance is on par with traditional full-attention mechanisms. Furthermore, we have applied LongCache on mainstream LLMs, including LLaMA3 and Mistral-v0.3, enabling them to support context lengths of at least 400K in Needle-In-A-Haystack tests. We will improve the efficiency of LongCache by GPU-aware optimization soon.

[289] 2407.15177

Measurements of the Safety Function Response Time on a Private 5G and IO-Link Wireless Testbed

In the past few years, there has been a growing significance of interactions between human workers and automated systems throughout the factory floor. Wherever static or mobile robots, such as automated guided vehicles, operate autonomously, a protected environment for personnel and machines must be provided by, e.g., safe, deterministic and low-latency technologies. Another trend in this area is the increased use of wireless communication, offering a high flexibility, modularity, and reduced installation and maintenance efforts. This work presents a testbed implementation that integrates a wireless framework, employing IO-Link Wireless (IOLW) and a private 5G cellular network, to orchestrate a complete example process from sensors and actuators up into the edge, represented by a programmable logic controller (PLC). Latency assessments identify the systems cycle time as well as opportunities for improvement. A worst-case estimation shows the attainable safety function response time for practical applications in the context of functional safety.

[290] 2407.15184

Decoding Multilingual Moral Preferences: Unveiling LLM's Biases Through the Moral Machine Experiment

Large language models (LLMs) increasingly find their way into the most diverse areas of our everyday lives. They indirectly influence people's decisions or opinions through their daily use. Therefore, understanding how and which moral judgements these LLMs make is crucial. However, morality is not universal and depends on the cultural background. This raises the question of whether these cultural preferences are also reflected in LLMs when prompted in different languages or whether moral decision-making is consistent across different languages. So far, most research has focused on investigating the inherent values of LLMs in English. While a few works conduct multilingual analyses of moral bias in LLMs in a multilingual setting, these analyses do not go beyond atomic actions. To the best of our knowledge, a multilingual analysis of moral bias in dilemmas has not yet been conducted. To address this, our paper builds on the moral machine experiment (MME) to investigate the moral preferences of five LLMs, Falcon, Gemini, Llama, GPT, and MPT, in a multilingual setting and compares them with the preferences collected from humans belonging to different cultures. To accomplish this, we generate 6500 scenarios of the MME and prompt the models in ten languages on which action to take. Our analysis reveals that all LLMs inhibit different moral biases to some degree and that they not only differ from the human preferences but also across multiple languages within the models themselves. Moreover, we find that almost all models, particularly Llama 3, divert greatly from human values and, for instance, prefer saving fewer people over saving more.

[291] 2407.15185

A Spatio-Temporal Approach with Self-Corrective Causal Inference for Flight Delay Prediction

Accurate flight delay prediction is crucial for the secure and effective operation of the air traffic system. Recent advances in modeling inter-airport relationships present a promising approach for investigating flight delay prediction from the multi-airport scenario. However, the previous prediction works only accounted for the simplistic relationships such as traffic flow or geographical distance, overlooking the intricate interactions among airports and thus proving inadequate. In this paper, we leverage causal inference to precisely model inter-airport relationships and propose a self-corrective spatio-temporal graph neural network (named CausalNet) for flight delay prediction. Specifically, Granger causality inference coupled with a self-correction module is designed to construct causality graphs among airports and dynamically modify them based on the current airport's delays. Additionally, the features of the causality graphs are adaptively extracted and utilized to address the heterogeneity of airports. Extensive experiments are conducted on the real data of top-74 busiest airports in China. The results show that CausalNet is superior to baselines. Ablation studies emphasize the power of the proposed self-correction causality graph and the graph feature extraction module. All of these prove the effectiveness of the proposed methodology.

[292] 2407.15186

A Survey on Employing Large Language Models for Text-to-SQL Tasks

The increasing volume of data stored in relational databases has led to the need for efficient querying and utilization of this data in various sectors. However, writing SQL queries requires specialized knowledge, which poses a challenge for non-professional users trying to access and query databases. Text-to-SQL parsing solves this issue by converting natural language queries into SQL queries, thus making database access more accessible for non-expert users. To take advantage of the recent developments in Large Language Models (LLMs), a range of new methods have emerged, with a primary focus on prompt engineering and fine-tuning. This survey provides a comprehensive overview of LLMs in text-to-SQL tasks, discussing benchmark datasets, prompt engineering, fine-tuning methods, and future research directions. We hope this review will enable readers to gain a broader understanding of the recent advances in this field and offer some insights into its future trajectory.

[293] 2407.15187

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

3D scene generation is in high demand across various domains, including virtual reality, gaming, and the film industry. Owing to the powerful generative capabilities of text-to-image diffusion models that provide reliable priors, the creation of 3D scenes using only text prompts has become viable, thereby significantly advancing researches in text-driven 3D scene generation. In order to obtain multiple-view supervision from 2D diffusion models, prevailing methods typically employ the diffusion model to generate an initial local image, followed by iteratively outpainting the local image using diffusion models to gradually generate scenes. Nevertheless, these outpainting-based approaches prone to produce global inconsistent scene generation results without high degree of completeness, restricting their broader applications. To tackle these problems, we introduce HoloDreamer, a framework that first generates high-definition panorama as a holistic initialization of the full 3D scene, then leverage 3D Gaussian Splatting (3D-GS) to quickly reconstruct the 3D scene, thereby facilitating the creation of view-consistent and fully enclosed 3D scenes. Specifically, we propose Stylized Equirectangular Panorama Generation, a pipeline that combines multiple diffusion models to enable stylized and detailed equirectangular panorama generation from complex text prompts. Subsequently, Enhanced Two-Stage Panorama Reconstruction is introduced, conducting a two-stage optimization of 3D-GS to inpaint the missing region and enhance the integrity of the scene. Comprehensive experiments demonstrated that our method outperforms prior works in terms of overall visual consistency and harmony as well as reconstruction quality and rendering robustness when generating fully enclosed scenes.

[294] 2407.15192

Error Detection and Constraint Recovery in Hierarchical Multi-Label Classification without Prior Knowledge

Recent advances in Hierarchical Multi-label Classification (HMC), particularly neurosymbolic-based approaches, have demonstrated improved consistency and accuracy by enforcing constraints on a neural model during training. However, such work assumes the existence of such constraints a-priori. In this paper, we relax this strong assumption and present an approach based on Error Detection Rules (EDR) that allow for learning explainable rules about the failure modes of machine learning models. We show that these rules are not only effective in detecting when a machine learning classifier has made an error but also can be leveraged as constraints for HMC, thereby allowing the recovery of explainable constraints even if they are not provided. We show that our approach is effective in detecting machine learning errors and recovering constraints, is noise tolerant, and can function as a source of knowledge for neurosymbolic models on multiple datasets, including a newly introduced military vehicle recognition dataset.

[295] 2407.15193

The Complexity of (P3, H)-Arrowing and Beyond

Often regarded as the study of how order emerges from randomness, Ramsey theory has played an important role in mathematics and computer science, giving rise to applications in numerous domains such as logic, parallel processing, and number theory. The core of graph Ramsey theory is arrowing: For fixed graphs $F$ and $H$, the $(F, H)$-Arrowing problem asks whether a given graph, $G$, has a red/blue coloring of the edges of $G$ such that there are no red copies of $F$ and no blue copies of $H$. For some cases, the problem has been shown to be coNP-complete, or solvable in polynomial time. However, a more systematic approach is needed to categorize the complexity of all cases. We focus on $(P_3, H)$-Arrowing as $F = P_3$ is the simplest meaningful case for which the complexity question remains open, and the hardness for this case likely extends to general $(F, H)$-Arrowing for nontrivial $F$. In this pursuit, we also gain insight into the complexity of a class of matching removal problems, since $(P_3, H)$-Arrowing is equivalent to $H$-free Matching Removal. We show that $(P_3, H)$-Arrowing is coNP-complete for all $2$-connected $H$ except when $H = K_3$, in which case the problem is in P. We introduce a new graph invariant to help us carefully combine graphs when constructing the gadgets for our reductions. Moreover, we show how $(P_3,H)$-Arrowing hardness results can be extended to other $(F,H)$-Arrowing problems. This allows for more intuitive and palatable hardness proofs instead of ad-hoc constructions of SAT gadgets, bringing us closer to categorizing the complexity of all $(F, H)$-Arrowing problems.

[296] 2407.15199

Multiple Object Detection and Tracking in Panoramic Videos for Cycling Safety Analysis

Panoramic cycling videos can record 360{\deg} views around the cyclists. Thus, it is essential to conduct automatic road user analysis on them using computer vision models to provide data for studies on cycling safety. However, the features of panoramic data such as severe distortions, large number of small objects and boundary continuity have brought great challenges to the existing CV models, including poor performance and evaluation methods that are no longer applicable. In addition, due to the lack of data with annotations, it is not easy to re-train the models. In response to these problems, the project proposed and implemented a three-step methodology: (1) improve the prediction performance of the pre-trained object detection models on panoramic data by projecting the original image into 4 perspective sub-images; (2) introduce supports for boundary continuity and category information into DeepSORT, a commonly used multiple object tracking model, and set an improved detection model as its detector; (3) using the tracking results, develop an application for detecting the overtaking behaviour of the surrounding vehicles. Evaluated on the panoramic cycling dataset built by the project, the proposed methodology improves the average precision of YOLO v5m6 and Faster RCNN-FPN under any input resolution setting. In addition, it raises MOTA and IDF1 of DeepSORT by 7.6\% and 9.7\% respectively. When detecting the overtakes in the test videos, it achieves the F-score of 0.88. The code is available on GitHub at to ensure the reproducibility and further improvements of results.

[297] 2407.15200

HyperbolicLR: Epoch insensitive learning rate scheduler

This study proposes two novel learning rate schedulers: the Hyperbolic Learning Rate Scheduler (HyperbolicLR) and the Exponential Hyperbolic Learning Rate Scheduler (ExpHyperbolicLR). These schedulers attempt to address the inconsistent learning curves often observed in conventional schedulers when adjusting the number of epochs. By leveraging the asymptotic behavior of hyperbolic curves, the proposed schedulers maintain more consistent learning curves across varying epoch settings. The HyperbolicLR algorithm directly applies this property to the epoch-learning rate space, while the ExpHyperbolicLR maps this concept onto the exponential space of epochs and learning rates. To evaluate the performance of these schedulers, first we found the optimal hyperparameters for each scheduler on a small number of epochs, fixed these values, and compared their performance as the number of epochs increased. Our experimental results on various deep learning tasks and architectures demonstrate that both HyperbolicLR and ExpHyperbolicLR maintain more consistent performance improvements compared to conventional schedulers as the number of epochs increases. These findings suggest that our hyperbolic-based learning rate schedulers offer a more robust and efficient approach to training deep neural networks, especially in scenarios where computational resources or time constraints limit extensive hyperparameter searches.

[298] 2407.15203

Mask Guided Gated Convolution for Amodal Content Completion

We present a model to reconstruct partially visible objects. The model takes a mask as an input, which we call weighted mask. The mask is utilized by gated convolutions to assign more weight to the visible pixels of the occluded instance compared to the background, while ignoring the features of the invisible pixels. By drawing more attention from the visible region, our model can predict the invisible patch more effectively than the baseline models, especially in instances with uniform texture. The model is trained on COCOA dataset and two subsets of it in a self-supervised manner. The results demonstrate that our model generates higher quality and more texture-rich outputs compared to baseline models. Code is available at:

[299] 2407.15208

Flow as the Cross-Domain Manipulation Interface

We present Im2Flow2Act, a scalable learning framework that enables robots to acquire manipulation skills from diverse data sources. The key idea behind Im2Flow2Act is to use object flow as the manipulation interface, bridging domain gaps between different embodiments (i.e., human and robot) and training environments (i.e., real-world and simulated). Im2Flow2Act comprises two components: a flow generation network and a flow-conditioned policy. The flow generation network, trained on human demonstration videos, generates object flow from the initial scene image, conditioned on the task description. The flow-conditioned policy, trained on simulated robot play data, maps the generated object flow to robot actions to realize the desired object movements. By using flow as input, this policy can be directly deployed in the real world with a minimal sim-to-real gap. By leveraging real-world human videos and simulated robot play data, we bypass the challenges of teleoperating physical robots in the real world, resulting in a scalable system for diverse tasks. We demonstrate Im2Flow2Act's capabilities in a variety of real-world tasks, including the manipulation of rigid, articulated, and deformable objects.

[300] 2407.15211

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of "highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

[301] 2407.15212

Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Video

Efficient and accurate reconstruction of a relightable, dynamic clothed human avatar from a monocular video is crucial for the entertainment industry. This paper introduces the Surfel-based Gaussian Inverse Avatar (SGIA) method, which introduces efficient training and rendering for relightable dynamic human reconstruction. SGIA advances previous Gaussian Avatar methods by comprehensively modeling Physically-Based Rendering (PBR) properties for clothed human avatars, allowing for the manipulation of avatars into novel poses under diverse lighting conditions. Specifically, our approach integrates pre-integration and image-based lighting for fast light calculations that surpass the performance of existing implicit-based techniques. To address challenges related to material lighting disentanglement and accurate geometry reconstruction, we propose an innovative occlusion approximation strategy and a progressive training approach. Extensive experiments demonstrate that SGIA not only achieves highly accurate physical properties but also significantly enhances the realistic relighting of dynamic human avatars, providing a substantial speed advantage. We exhibit more results in our project page: \href{}{}.

[302] 2407.15213

More-than-Moore Microacoustics: A Scalable Fabrication Process for Suspended Lamb Wave Resonators

Deep Ultraviolet (DUV) Photolithography is currently used to fabricate mass-scale integrated circuits (ICs). Its high throughput and resolution could benefit large-scale RF MEMS production for the telecommunication market. We present a process flow to fabricate suspended acoustic resonators using DUV Photolithography. This method allows for scalable production of resonators with critical dimensions of 250 nm and alignment accuracy of less than 100 nm. We show how photoresists and anti-reflective coatings integrate with the process, help with deposition quality and resolution, and how Ion Beam Etching allows for vertical sidewalls of the resonators. We measure resonance frequencies (fr) up to 7.5 GHz and electromechanical couplings up to 8%, and we investigate the uniformity of this process by analyzing the deviation of fs over the wafer surface for four main resonance modes. We show that the deviation of the S0 mode can be kept below 1%. These results indicate the suitability of this process for quick scale-up of Lamb wave resonator technology, bridging the gap from research to industry.

[303] 2407.15216

Explainability Paths for Sustained Artistic Practice with AI

The development of AI-driven generative audio mirrors broader AI trends, often prioritizing immediate accessibility at the expense of explainability. Consequently, integrating such tools into sustained artistic practice remains a significant challenge. In this paper, we explore several paths to improve explainability, drawing primarily from our research-creation practice in training and implementing generative audio models. As practical provisions for improved explainability, we highlight human agency over training materials, the viability of small-scale datasets, the facilitation of the iterative creative process, and the integration of interactive machine learning as a mapping tool. Importantly, these steps aim to enhance human agency over generative AI systems not only during model inference, but also when curating and preprocessing training data as well as during the training phase of models.

[304] 2407.15219

Efficient Visual Transformer by Learnable Token Merging

Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT-S/16, and Swin-T, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of mask module in our LTM blocks which generate the token merging mask is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers. The code of the LTM-Transformer is available at \url{}.

[305] 2407.15221

Secure Web Objects: Building Blocks for Metaverse Interoperability and Decentralization

This position paper explores how to support the Web's evolution through an underlying data-centric approach that better matches the data-orientedness of modern and emerging applications. We revisit the original vision of the Web as a hypermedia system that supports document composability and application interoperability via name-based data access. We propose the use of secure web objects (SWO), a data-oriented communication approach that can reduce complexity, centrality, and inefficiency, particularly for collaborative and local-first applications, such as the Metaverse and other collaborative applications. SWO are named, signed, application-defined objects that are secured independently of their containers or communications channels, an approach that leverages the results from over a decade-long data-centric networking research. This approach does not require intermediation by aggregators of identity, storage, and other services that are common today. We present a brief design overview, illustrated through prototypes for two editors of shared hypermedia documents: one for 3D and one for LaTeX. We also discuss our findings and suggest a roadmap for future research.

[306] 2407.15224

PUFFLE: Balancing Privacy, Utility, and Fairness in Federated Learning

Training and deploying Machine Learning models that simultaneously adhere to principles of fairness and privacy while ensuring good utility poses a significant challenge. The interplay between these three factors of trustworthiness is frequently underestimated and remains insufficiently explored. Consequently, many efforts focus on ensuring only two of these factors, neglecting one in the process. The decentralization of the datasets and the variations in distributions among the clients exacerbate the complexity of achieving this ethical trade-off in the context of Federated Learning (FL). For the first time in FL literature, we address these three factors of trustworthiness. We introduce PUFFLE, a high-level parameterised approach that can help in the exploration of the balance between utility, privacy, and fairness in FL scenarios. We prove that PUFFLE can be effective across diverse datasets, models, and data distributions, reducing the model unfairness up to 75%, with a maximum reduction in the utility of 17% in the worst-case scenario, while maintaining strict privacy guarantees during the FL training.

[307] 2407.15227

A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech

Violence-provoking speech -- speech that implicitly or explicitly promotes violence against the members of the targeted community, contributed to a massive surge in anti-Asian crimes during the pandemic. While previous works have characterized and built tools for detecting other forms of harmful speech, like fear speech and hate speech, our work takes a community-centric approach to studying anti-Asian violence-provoking speech. Using data from ~420k Twitter posts spanning a 3-year duration (January 1, 2020 to February 1, 2023), we develop a codebook to characterize anti-Asian violence-provoking speech and collect a community-crowdsourced dataset to facilitate its large-scale detection using state-of-the-art classifiers. We contrast the capabilities of natural language processing classifiers, ranging from BERT-based to LLM-based classifiers, in detecting violence-provoking speech with their capabilities to detect anti-Asian hateful speech. In contrast to prior work that has demonstrated the effectiveness of such classifiers in detecting hateful speech ($F_1 = 0.89$), our work shows that accurate and reliable detection of violence-provoking speech is a challenging task ($F_1 = 0.69$). We discuss the implications of our findings, particularly the need for proactive interventions to support Asian communities during public health crises. The resources related to the study are available at

[308] 2407.15228

3D Reconstruction of the Human Colon from Capsule Endoscope Video

As the number of people affected by diseases in the gastrointestinal system is ever-increasing, a higher demand on preventive screening is inevitable. This will significantly increase the workload on gastroenterologists. To help reduce the workload, tools from computer vision may be helpful. In this paper, we investigate the possibility of constructing 3D models of whole sections of the human colon using image sequences from wireless capsule endoscope video, providing enhanced viewing for gastroenterologists. As capsule endoscope images contain distortion and artifacts non-ideal for many 3D reconstruction algorithms, the problem is challenging. However, recent developments of virtual graphics-based models of the human gastrointestinal system, where distortion and artifacts can be enabled or disabled, makes it possible to ``dissect'' the problem. The graphical model also provides a ground truth, enabling computation of geometric distortion introduced by the 3D reconstruction method. In this paper, most distortions and artifacts are left out to determine if it is feasible to reconstruct whole sections of the human gastrointestinal system by existing methods. We demonstrate that 3D reconstruction is possible using simultaneous localization and mapping. Further, to reconstruct the gastrointestinal wall surface from resulting point clouds, varying greatly in density, Poisson surface reconstruction is a good option. The results are promising, encouraging further research on this problem.

[309] 2407.15229

The Hitchhiker's Guide to Human Alignment with *PO

With the growing utilization of large language models (LLMs) across domains, alignment towards human preferences has become one of the most critical aspects of training models. At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). However, prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we aim to identify the algorithm that, while being performant, is simultaneously more robust to varying hyperparameters, thereby increasing the likelihood of achieving better results. We focus on a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment, offering practical insights into the strengths and weaknesses of these methods. Furthermore, to better understand the shortcomings of generations from the different methods, we analyze the model generations through the lens of KL divergence of the SFT model and the response length statistics. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality that are very close to the SFT responses. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality compared to the policy obtained by vanilla DPO.

[310] 2407.15233

CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model

Layout generation is the foundation task of intelligent design, which requires the integration of visual aesthetics and harmonious expression of content delivery. However, existing methods still face challenges in generating precise and visually appealing layouts, including blocking, overlap, or spatial misalignment between layouts, which are closely related to the spatial structure of graphic layouts. We find that these methods overly focus on content information and lack constraints on layout spatial structure, resulting in an imbalance of learning content-aware and graphic-aware features. To tackle this issue, we propose Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model (CGB-DM). Specifically, we first design a regulator that balances the predicted content and graphic weight, overcoming the tendency of paying more attention to the content on canvas. Secondly, we introduce a graphic constraint of saliency bounding box to further enhance the alignment of geometric features between layout representations and images. In addition, we adapt a transformer-based diffusion model as the backbone, whose powerful generation capability ensures the quality in layout generation. Extensive experimental results indicate that our method has achieved state-of-the-art performance in both quantitative and qualitative evaluations. Our model framework can also be expanded to other graphic design fields.

[311] 2407.15234

Exploring the Design of Collaborative Applications via the Lens of NDN Workspace

Metaverse applications desire to communicate with semantically identified objects among a diverse set of cyberspace entities, such as cameras for collecting images from, sensors for sensing environment, and users collaborating with each other, all could be nearby or far away, in a timely and secure way. However, supporting the above function faces networking challenges. Today's metaverse implementations are, by and large, use secure transport connections to communicate with cloud servers instead of letting participating entities communicate directly. In this paper, we use the design and implementation of NDN Workspace, a web-based, multi-user collaborative app to showcase a new way to networking that supports many-to-many secure data exchanges among communicating entities directly. NDN Workspace users establish trust relations among each other, exchange URI-identified objects directly, and can collaborate through intermittent connectivity, all in the absence of cloud servers. Its data-centric design offers an exciting new approach to metaverse app development.

[312] 2407.15235

TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Instruction tuning has achieved unprecedented success in NLP, turning large language models into versatile chatbots. However, the increasing variety and volume of instruction datasets demand significant computational resources. To address this, it is essential to extract a small and highly informative subset (i.e., Coreset) that achieves comparable performance to the full dataset. Achieving this goal poses non-trivial challenges: 1) data selection requires accurate data representations that reflect the training samples' quality, 2) considering the diverse nature of instruction datasets, and 3) ensuring the efficiency of the coreset selection algorithm for large models. To address these challenges, we propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS). Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection. Experimental results show that our algorithm, selecting only 5% of the data, surpasses other unsupervised methods and achieves performance close to that of the full dataset.

[313] 2407.15236

Deep State Space Recurrent Neural Networks for Time Series Forecasting

We explore various neural network architectures for modeling the dynamics of the cryptocurrency market. Traditional linear models often fall short in accurately capturing the unique and complex dynamics of this market. In contrast, Deep Neural Networks (DNNs) have demonstrated considerable proficiency in time series forecasting. This papers introduces novel neural network framework that blend the principles of econometric state space models with the dynamic capabilities of Recurrent Neural Networks (RNNs). We propose state space models using Long Short Term Memory (LSTM), Gated Residual Units (GRU) and Temporal Kolmogorov-Arnold Networks (TKANs). According to the results, TKANs, inspired by Kolmogorov-Arnold Networks (KANs) and LSTM, demonstrate promising outcomes.

[314] 2407.15237

Two eyes, Two views, and finally, One summary! Towards Multi-modal Multi-tasking Knowledge-Infused Medical Dialogue Summarization

We often summarize a multi-party conversation in two stages: chunking with homogeneous units and summarizing the chunks. Thus, we hypothesize that there exists a correlation between homogeneous speaker chunking and overall summarization tasks. In this work, we investigate the effectiveness of a multi-faceted approach that simultaneously produces summaries of medical concerns, doctor impressions, and an overall view. We introduce a multi-modal, multi-tasking, knowledge-infused medical dialogue summary generation (MMK-Summation) model, which is incorporated with adapter-based fine-tuning through a gated mechanism for multi-modal information integration. The model, MMK-Summation, takes dialogues as input, extracts pertinent external knowledge based on the context, integrates the knowledge and visual cues from the dialogues into the textual content, and ultimately generates concise summaries encompassing medical concerns, doctor impressions, and a comprehensive overview. The introduced model surpasses multiple baselines and traditional summarization models across all evaluation metrics (including human evaluation), which firmly demonstrates the efficacy of the knowledge-guided multi-tasking, multimodal medical conversation summarization. The code is available at

[315] 2407.15238

Variational Potential Flow: A Novel Probabilistic Framework for Energy-Based Generative Modelling

Energy based models (EBMs) are appealing for their generality and simplicity in data likelihood modeling, but have conventionally been difficult to train due to the unstable and time-consuming implicit MCMC sampling during contrastive divergence training. In this paper, we present a novel energy-based generative framework, Variational Potential Flow (VAPO), that entirely dispenses with implicit MCMC sampling and does not rely on complementary latent models or cooperative training. The VAPO framework aims to learn a potential energy function whose gradient (flow) guides the prior samples, so that their density evolution closely follows an approximate data likelihood homotopy. An energy loss function is then formulated to minimize the Kullback-Leibler divergence between density evolution of the flow-driven prior and the data likelihood homotopy. Images can be generated after training the potential energy, by initializing the samples from Gaussian prior and solving the ODE governing the potential flow on a fixed time interval using generic ODE solvers. Experiment results show that the proposed VAPO framework is capable of generating realistic images on various image datasets. In particular, our proposed framework achieves competitive FID scores for unconditional image generation on the CIFAR-10 and CelebA datasets.

[316] 2407.15239

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Image-text retrieval (ITR), an important task in information retrieval (IR), is driven by pretrained vision-language models (VLMs) that consistently achieve state-of-the-art performance. However, a significant challenge lies in the brittleness of existing ITR benchmarks. In standard datasets for the task, captions often provide broad summaries of scenes, neglecting detailed information about specific concepts. Additionally, the current evaluation setup assumes simplistic binary matches between images and texts and focuses on intra-modality rather than cross-modal relationships, which can lead to misinterpretations of model performance. Motivated by this gap, in this study, we focus on examining the brittleness of the ITR evaluation pipeline with a focus on concept granularity. We start by analyzing two common benchmarks, MS-COCO and Flickr30k, and compare them with their augmented versions, MS-COCO-FG and Flickr30k-FG, given a specified set of linguistic features capturing concept granularity. We discover that Flickr30k-FG and MS COCO-FG consistently achieve higher scores across all the selected features. To investigate the performance of VLMs on coarse and fine-grained datasets, we introduce a taxonomy of perturbations. We apply these perturbations to the selected datasets. We evaluate four state-of-the-art models - ALIGN, AltCLIP, CLIP, and GroupViT - on the standard and fine-grained datasets under zero-shot conditions, with and without the applied perturbations. The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. Moreover, the relative performance drop across all setups is consistent across all models and datasets, indicating that the issue lies within the benchmarks. We conclude the paper by providing an agenda for improving ITR evaluation pipelines.

[317] 2407.15240

BIGbench: A Unified Benchmark for Social Bias in Text-to-Image Generative Models Based on Multi-modal LLM

Text-to-Image (T2I) generative models are becoming more crucial in terms of their ability to generate complex and high-quality images, which also raises concerns about the social biases in their outputs, especially in human generation. Sociological research has established systematic classifications of bias; however, existing research of T2I models often conflates different types of bias, hindering the progress of these methods. In this paper, we introduce BIGbench, a unified benchmark for Biases of Image Generation with a well-designed dataset. In contrast to existing benchmarks, BIGbench classifies and evaluates complex biases into four dimensions: manifestation of bias, visibility of bias, acquired attributes, and protected attributes. Additionally, BIGbench applies advanced multi-modal large language models (MLLM), achieving fully automated evaluation while maintaining high accuracy. We apply BIGbench to evaluate eight recent general T2I models and three debiased methods. We also conduct human evaluation, whose results demonstrated the effectiveness of BIGbench in aligning images and identifying various biases. Besides, our study also revealed new research directions about biases, including the side-effect of irrelevant protected attributes and distillation. Our dataset and benchmark is openly accessible to the research community to ensure the reproducibility.

[318] 2407.15241

Temporal Abstraction in Reinforcement Learning with Offline Data

Standard reinforcement learning algorithms with a single policy perform poorly on tasks in complex environments involving sparse rewards, diverse behaviors, or long-term planning. This led to the study of algorithms that incorporate temporal abstraction by training a hierarchy of policies that plan over different time scales. The options framework has been introduced to implement such temporal abstraction by learning low-level options that act as extended actions controlled by a high-level policy. The main challenge in applying these algorithms to real-world problems is that they suffer from high sample complexity to train multiple levels of the hierarchy, which is impossible in online settings. Motivated by this, in this paper, we propose an offline hierarchical RL method that can learn options from existing offline datasets collected by other unknown agents. This is a very challenging problem due to the distribution mismatch between the learned options and the policies responsible for the offline dataset and to our knowledge, this is the first work in this direction. In this work, we propose a framework by which an online hierarchical reinforcement learning algorithm can be trained on an offline dataset of transitions collected by an unknown behavior policy. We validate our method on Gym MuJoCo locomotion environments and robotic gripper block-stacking tasks in the standard as well as transfer and goal-conditioned settings.

[319] 2407.15243

Genetic Algorithm to Optimize Design of Micro-Surgical Scissors

Microrobotics is an attractive area of research as small-scale robots have the potential to improve the precision and dexterity offered by minimally invasive surgeries. One example of such a tool is a pair of micro-surgical scissors that was developed for cutting of tumors or cancerous tissues present deep inside the body such as in the brain. This task is often deemed difficult or impossible with conventional robotic tools due to their size and dexterity. The scissors are designed with two magnets placed a specific distance apart to maximize deflection and generate cutting forces. However, remote actuation and size requirements of the micro-surgical scissors limits the force that can be generated to puncture the tissue. To address the limitation of small output forces, we use an evolutionary algorithm to further optimize the performance of the scissors. In this study, the design of the previously developed untethered micro-surgical scissors has been modified and their performance is enhanced by determining the optimal position of the magnets as well as the direction of each magnetic moment. The developed algorithm is successfully applied to a 4-magnet configuration which results in increased net torque. This improvement in net torque is directly translated into higher cutting forces. The new configuration generates a cutting force of 58 mN from 80 generations of the evolutionary algorithm which is a 1.65 times improvement from the original design. Furthermore, the developed algorithm has the advantage that it can be deployed with minor modifications to other microrobotic tools and systems, opening up new possibilities for various medical procedures and applications.

[320] 2407.15247

TimeInf: Time Series Data Contribution via Influence Functions

Evaluating the contribution of individual data points to a model's prediction is critical for interpreting model predictions and improving model performance. Existing data contribution methods have been applied to various data types, including tabular data, images, and texts; however, their primary focus has been on i.i.d. settings. Despite the pressing need for principled approaches tailored to time series datasets, the problem of estimating data contribution in such settings remains unexplored, possibly due to challenges associated with handling inherent temporal dependencies. This paper introduces TimeInf, a data contribution estimation method for time-series datasets. TimeInf uses influence functions to attribute model predictions to individual time points while preserving temporal structures. Our extensive empirical results demonstrate that TimeInf outperforms state-of-the-art methods in identifying harmful anomalies and helpful time points for forecasting. Additionally, TimeInf offers intuitive and interpretable attributions of data values, allowing us to easily distinguish diverse anomaly patterns through visualizations.

[321] 2407.15248

XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models

In this survey, we address the key challenges in Large Language Models (LLM) research, focusing on the importance of interpretability. Driven by increasing interest from AI and business sectors, we highlight the need for transparency in LLMs. We examine the dual paths in current LLM research and eXplainable Artificial Intelligence (XAI): enhancing performance through XAI and the emerging focus on model interpretability. Our paper advocates for a balanced approach that values interpretability equally with functional advancements. Recognizing the rapid development in LLM research, our survey includes both peer-reviewed and preprint (arXiv) papers, offering a comprehensive overview of XAI's role in LLM research. We conclude by urging the research community to advance both LLM and XAI fields together.

[322] 2407.15249

Hurricane Evacuation Analysis with Large-scale Mobile Device Location Data during Hurricane Ian

Hurricane Ian is the deadliest and costliest hurricane in Florida's history, with 2.5 million people ordered to evacuate. As we witness increasingly severe hurricanes in the context of climate change, mobile device location data offers an unprecedented opportunity to study hurricane evacuation behaviors. With a terabyte-level GPS dataset, we introduce a holistic hurricane evacuation behavior algorithm with a case study of Ian: we infer evacuees' departure time and categorize them into different behavioral groups, including self, voluntary, mandatory, shadow and in-zone evacuees. Results show the landfall area (Fort Myers, Lee County) had lower out-of-zone but higher overall evacuation rate, while the predicted landfall area (Tampa, Hillsborough County) had the opposite, suggesting the effects of delayed evacuation order. Out-of-zone evacuation rates would increase from shore to inland. Spatiotemporal analysis identified three evacuation waves: during formation, before landfall, and after landfall. These insights are valuable for enhancing future disaster planning and management.

[323] 2407.15252

An Adaptive System for Wearable Devices to Detect Stress Using Physiological Signals

Timely stress detection is crucial for protecting vulnerable groups from long-term detrimental effects by enabling early intervention. Wearable devices, by collecting real-time physiological signals, offer a solution for accurate stress detection accommodating individual differences. This position paper introduces an adaptive framework for personalized stress detection using PPG and EDA signals. Unlike traditional methods that rely on a generalized model, which may suffer performance drops when applied to new users due to domain shifts, this framework aims to provide each user with a personalized model for higher stress detection accuracy. The framework involves three stages: developing a generalized model offline with an initial dataset, adapting the model to the user's unlabeled data, and fine-tuning it with a small set of labeled data obtained through user interaction. This approach not only offers a foundation for mobile applications that provide personalized stress detection and intervention but also has the potential to address a wider range of mental health issues beyond stress detection using physiological signals.

[324] 2407.15255

Explaining Decisions of Agents in Mixed-Motive Games

In recent years, agents have become capable of communicating seamlessly via natural language and navigating in environments that involve cooperation and competition, a fact that can introduce social dilemmas. Due to the interleaving of cooperation and competition, understanding agents' decision-making in such environments is challenging, and humans can benefit from obtaining explanations. However, such environments and scenarios have rarely been explored in the context of explainable AI. While some explanation methods for cooperative environments can be applied in mixed-motive setups, they do not address inter-agent competition, cheap-talk, or implicit communication by actions. In this work, we design explanation methods to address these issues. Then, we proceed to demonstrate their effectiveness and usefulness for humans, using a non-trivial mixed-motive game as a test case. Lastly, we establish generality and demonstrate the applicability of the methods to other games, including one where we mimic human game actions using large language models.

[325] 2407.15259

New Rules for Causal Identification with Background Knowledge

Identifying causal relations is crucial for a variety of downstream tasks. In additional to observational data, background knowledge (BK), which could be attained from human expertise or experiments, is usually introduced for uncovering causal relations. This raises an open problem that in the presence of latent variables, what causal relations are identifiable from observational data and BK. In this paper, we propose two novel rules for incorporating BK, which offer a new perspective to the open problem. In addition, we show that these rules are applicable in some typical causality tasks, such as determining the set of possible causal effects with observational data. Our rule-based approach enhances the state-of-the-art method by circumventing a process of enumerating block sets that would otherwise take exponential complexity.

[326] 2407.15260

Weakly SSM : On the Viability of Weakly Supervised Segmentations for Statistical Shape Modeling

Statistical Shape Models (SSMs) excel at identifying population level anatomical variations, which is at the core of various clinical and biomedical applications, including morphology-based diagnostics and surgical planning. However, the effectiveness of SSM is often constrained by the necessity for expert-driven manual segmentation, a process that is both time-intensive and expensive, thereby restricting their broader application and utility. Recent deep learning approaches enable the direct estimation of Statistical Shape Models (SSMs) from unsegmented images. While these models can predict SSMs without segmentation during deployment, they do not address the challenge of acquiring the manual annotations needed for training, particularly in resource-limited settings. Semi-supervised and foundation models for anatomy segmentation can mitigate the annotation burden. Yet, despite the abundance of available approaches, there are no established guidelines to inform end-users on their effectiveness for the downstream task of constructing SSMs. In this study, we systematically evaluate the potential of weakly supervised methods as viable alternatives to manual segmentation's for building SSMs. We establish a new performance benchmark by employing various semi-supervised and foundational model methods for anatomy segmentation under low annotation settings, utilizing the predicted segmentation's for the task of SSM. We compare the modes of shape variation and use quantitative metrics to compare against a shape model derived from a manually annotated dataset. Our results indicate that some methods produce noisy segmentation, which is very unfavorable for SSM tasks, while others can capture the correct modes of variations in the population cohort with 60-80\% reduction in required manual annotation.

[327] 2407.15261

Pandora's Box Problem Over Time

The Pandora's Box problem models the search for the best alternative when evaluation is costly. In its simplest variant, a decision maker is presented with $n$ boxes, each associated with a cost of inspection and a distribution over the reward hidden within. The decision maker inspects a subset of these boxes one after the other, in a possibly adaptive ordering, and obtains as utility the difference between the largest reward uncovered and the sum of the inspection costs. While this classic version of the problem is well understood (Weitzman 1979), recent years have seen a flourishing of the literature on variants of the problem. In this paper, we introduce a general framework -- the Pandora's Box Over Time problem -- that captures a wide range of variants where time plays a role, e.g., as it might constrain the schedules of exploration and influence both costs and rewards. In the Pandora's Box Over Time problem, each box is characterized by time-dependent rewards and costs, and inspecting it might require a box-specific processing time. Moreover, once a box is inspected, its reward may deteriorate over time, possibly differently for each box. Our main result is an efficient $21.3$-approximation to the optimal strategy, which is NP-hard to compute in general. We further obtain improved results for the natural special cases where boxes have no processing time, or when costs and reward distributions do not depend on time (but rewards may deteriorate after inspecting).

[328] 2407.15264

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storagebased approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads.LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training

[329] 2407.15266

STrack: A Reliable Multipath Transport for AI/ML Clusters

Emerging artificial intelligence (AI) and machine learning (ML) workloads present new challenges of managing the collective communication used in distributed training across hundreds or even thousands of GPUs. This paper presents STrack, a novel hardware-offloaded reliable transport protocol aimed at improving the performance of AI /ML workloads by rethinking key aspects of the transport layer. STrack optimizes congestion control and load balancing in tandem: it incorporates an adaptive load balancing algorithm leveraging ECN, while adopts RTT as multi-bit congestion indicators for precise congestion window adjustment. Additionally, STrack facilitates out-of-order delivery, selective retransmission, and swift loss recovery in hardware for multipath environment. The extensive simulation comparing STrack and RoCEv2 demonstrates that STrack outperforms RoCEv2 by up to 6X with synthetic workloads and by 27.4% with collective workloads, even with the optimized RoCEv2 system setup.

[330] 2407.15267

A Learning-Based Attack Framework to Break SOTA Poisoning Defenses in Federated Learning

Federated Learning (FL) is a novel client-server distributed learning framework that can protect data privacy. However, recent works show that FL is vulnerable to poisoning attacks. Many defenses with robust aggregators (AGRs) are proposed to mitigate the issue, but they are all broken by advanced attacks. Very recently, some renewed robust AGRs are designed, typically with novel clipping or/and filtering strate-gies, and they show promising defense performance against the advanced poisoning attacks. In this paper, we show that these novel robust AGRs are also vulnerable to carefully designed poisoning attacks. Specifically, we observe that breaking these robust AGRs reduces to bypassing the clipping or/and filtering of malicious clients, and propose an optimization-based attack framework to leverage this observation. Under the framework, we then design the customized attack against each robust AGR. Extensive experiments on multiple datasets and threat models verify our proposed optimization-based attack can break the SOTA AGRs. We hence call for novel defenses against poisoning attacks to FL. Code is available at: BreakSTOAPoisoningDefenses.

[331] 2407.15268

Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation

Multimodal foundation models hold significant potential for automating radiology report generation, thereby assisting clinicians in diagnosing cardiac diseases. However, generated reports often suffer from serious factual inaccuracy. In this paper, we introduce a fact-aware multimodal retrieval-augmented pipeline in generating accurate radiology reports (FactMM-RAG). We first leverage RadGraph to mine factual report pairs, then integrate factual knowledge to train a universal multimodal retriever. Given a radiology image, our retriever can identify high-quality reference reports to augment multimodal foundation models, thus enhancing the factual completeness and correctness of report generation. Experiments on two benchmark datasets show that our multimodal retriever outperforms state-of-the-art retrievers on both language generation and radiology-specific metrics, up to 6.5% and 2% score in F1CheXbert and F1RadGraph. Further analysis indicates that employing our factually-informed training strategy imposes an effective supervision signal, without relying on explicit diagnostic label guidance, and successfully propagates fact-aware capabilities from the multimodal retriever to the multimodal foundation model in radiology report generation.

[332] 2407.15272

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at

[333] 2407.15273

Unifying Invariant and Variant Features for Graph Out-of-Distribution via Probability of Necessity and Sufficiency

Graph Out-of-Distribution (OOD), requiring that models trained on biased data generalize to the unseen test data, has considerable real-world applications. One of the most mainstream methods is to extract the invariant subgraph by aligning the original and augmented data with the help of environment augmentation. However, these solutions might lead to the loss or redundancy of semantic subgraphs and result in suboptimal generalization. To address this challenge, we propose exploiting Probability of Necessity and Sufficiency (PNS) to extract sufficient and necessary invariant substructures. Beyond that, we further leverage the domain variant subgraphs related to the labels to boost the generalization performance in an ensemble manner. Specifically, we first consider the data generation process for graph data. Under mild conditions, we show that the sufficient and necessary invariant subgraph can be extracted by minimizing an upper bound, built on the theoretical advance of the probability of necessity and sufficiency. To further bridge the theory and algorithm, we devise the model called Sufficiency and Necessity Inspired Graph Learning (SNIGL), which ensembles an invariant subgraph classifier on top of latent sufficient and necessary invariant subgraphs, and a domain variant subgraph classifier specific to the test domain for generalization enhancement. Experimental results demonstrate that our SNIGL model outperforms the state-of-the-art techniques on six public benchmarks, highlighting its effectiveness in real-world scenarios.

[334] 2407.15277

Conformal Predictions under Markovian Data

We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is the mixing time of the chain). We also derive upper bounds on the impact of the correlations on the size of the prediction set. Finally we present $K$-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain. Its coverage gap is reduced to $t_\mathrm{mix}/(n\ln(n))$ without really affecting the size of the prediction set. We finally test our algorithms on synthetic and real-world datasets.

[335] 2407.15278

Minimizing the Number of Roles in Bottom-Up Role-Mining using Maximal Biclique Enumeration

Bottom-up role-mining is the determination of a set of roles given as input a set of users and the permissions those users possess. It is well-established in the research literature, and in practice, as an important problem in information security. A natural objective that has been explored in prior work is for the set of roles to be of minimum size. We address this problem for practical inputs while reconciling foundations, specifically, that the problem is \cnph. We first observe that an approach from prior work that exploits a sufficient condition for an efficient algorithm, while a useful first step, does not scale to more recently proposed benchmark inputs. We propose a new technique: the enumeration of maximal bicliques. We point out that the number of maximal bicliques provides a natural measure of the hardness of an input. We leverage the enumeration of maximal bicliques in two different ways. Our first approach addresses more than half the benchmark inputs to yield exact results. The other approach is needed for hard instances; in it, we identify and adopt as roles those that correspond to large maximal bicliques. We have implemented all our algorithms and carried out an extensive empirical assessment, which suggests that our approaches are promising. Our code is available publicly as open-source.

[336] 2407.15281

SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking

Understanding rich dialogues often requires NLP systems to access relevant commonsense persona knowledge, but retrieving this knowledge is challenging due to complex contexts and the implicit nature of commonsense. This paper presents our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge, addressing the critical need for integrating persona and commonsense knowledge in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that leverages Large Language Models to generate high-quality synthetic datasets for training commonsense persona knowledge linkers. To demonstrate the efficacy of our approach, we present SynCPKL, a new dataset specifically designed for this task. Our experiments validate the effectiveness of SynCPKL for training commonsense persona knowledge linkers. Additionally, our top-performing model, Derberta-SynCPKL, secured first place in the CPKL challenge by a 16% improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at

[337] 2407.15282

Point Transformer V3 Extreme: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

In this technical report, we detail our first-place solution for the 2024 Waymo Open Dataset Challenge's semantic segmentation track. We significantly enhanced the performance of Point Transformer V3 on the Waymo benchmark by implementing cutting-edge, plug-and-play training and inference technologies. Notably, our advanced version, Point Transformer V3 Extreme, leverages multi-frame training and a no-clipping-point policy, achieving substantial gains over the original PTv3 performance. Additionally, employing a straightforward model ensemble strategy further boosted our results. This approach secured us the top position on the Waymo Open Dataset semantic segmentation leaderboard, markedly outperforming other entries.

[338] 2407.15283

Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms

Industry is rapidly moving towards fully autonomous and interconnected systems that can detect and adapt to changing conditions, including machine hardware faults. Traditional methods for adding hardware fault tolerance to machines involve duplicating components and algorithmically reconfiguring a machine's processes when a fault occurs. However, the growing interest in reinforcement learning-based robotic control offers a new perspective on achieving hardware fault tolerance. However, limited research has explored the potential of these approaches for hardware fault tolerance in machines. This paper investigates the potential of two state-of-the-art reinforcement learning algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), to enhance hardware fault tolerance into machines. We assess the performance of these algorithms in two OpenAI Gym simulated environments, Ant-v2 and FetchReach-v1. Robot models in these environments are subjected to six simulated hardware faults. Additionally, we conduct an ablation study to determine the optimal method for transferring an agent's knowledge, acquired through learning in a normal (pre-fault) environment, to a (post-)fault environment in a continual learning setting. Our results demonstrate that reinforcement learning-based approaches can enhance hardware fault tolerance in simulated machines, with adaptation occurring within minutes. Specifically, PPO exhibits the fastest adaptation when retaining the knowledge within its models, while SAC performs best when discarding all acquired knowledge. Overall, this study highlights the potential of reinforcement learning-based approaches, such as PPO and SAC, for hardware fault tolerance in machines. These findings pave the way for the development of robust and adaptive machines capable of effectively operating in real-world scenarios.

[339] 2407.15284

Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing

We delve into the issue of node classification within graphs, specifically reevaluating the concept of neighborhood aggregation, which is a fundamental component in graph neural networks (GNNs). Our analysis reveals conceptual flaws within certain benchmark GNN models when operating under the assumption of edge-independent node labels, a condition commonly observed in benchmark graphs employed for node classification. Approaching neighborhood aggregation from a statistical signal processing perspective, our investigation provides novel insights which may be used to design more efficient GNN models.

[340] 2407.15285

New Philosopher Inequalities for Online Bayesian Matching, via Pivotal Sampling

We study the polynomial-time approximability of the optimal online stochastic bipartite matching algorithm, initiated by Papadimitriou et al. (EC'21). Here, nodes on one side of the graph are given upfront, while at each time $t$, an online node and its edge weights are drawn from a time-dependent distribution. The optimal algorithm is $\textsf{PSPACE}$-hard to approximate within some universal constant. We refer to this optimal algorithm, which requires time to think (compute), as a philosopher, and refer to polynomial-time online approximations of the above as philosopher inequalities. The best known philosopher inequality for online matching yields a $0.652$-approximation. In contrast, the best possible prophet inequality, or approximation of the optimum offline solution, is $0.5$. Our main results are a $0.678$-approximate algorithm and a $0.685$-approximation for a vertex-weighted special case. Notably, both bounds exceed the $0.666$-approximation of the offline optimum obtained by Tang, Wu, and Wu (STOC'22) for the vertex-weighted problem. Building on our algorithms and the recent black-box reduction of Banihashem et al. (SODA'24), we provide polytime (pricing-based) truthful mechanisms which $0.678$-approximate the social welfare of the optimal online allocation for bipartite matching markets. Our online allocation algorithm relies on the classic pivotal sampling algorithm (Srinivasan FOCS'01, Gandhi et al. J.ACM'06), along with careful discarding to obtain negative correlations between offline nodes. Consequently, the analysis boils down to examining the distribution of a weighted sum $X$ of negatively correlated Bernoulli variables, specifically lower bounding its mass below a threshold, $\mathbb{E}[\min(1,X)]$, of possible independent interest. Interestingly, our bound relies on an imaginary invocation of pivotal sampling.

[341] 2407.15286

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. However, the process of how injecting self-correction instructions can modify the behavior of LLMs remains under-explored. In this paper, we explore the effectiveness of moral self-correction by answering three research questions: (1) In what scenarios does moral self-correction work? (2) What are the internal mechanisms of LLMs, e.g., hidden states, that are influenced by moral self-correction instructions? (3) Is intrinsic moral self-correction actually superficial? We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states. Through empirical investigation with tasks of language generation and multi-choice question answering, we conclude: (i) LLMs exhibit good performance across both tasks, and self-correction instructions are particularly beneficial when the correct answer is already top-ranked; (ii) The morality levels in intermediate hidden states are strong indicators as to whether one instruction would be more effective than another; (iii) Based on our analysis of intermediate hidden states and task case studies of self-correction behaviors, we are first to propose the hypothesis that intrinsic moral self-correction is in fact superficial.

[342] 2407.15288

SLA Decomposition for Network Slicing: A Deep Neural Network Approach

For a network slice that spans multiple technology and/or administrative domains, these domains must ensure that the slice's End-to-End (E2E) Service Level Agreement (SLA) is met. Thus, the E2E SLA should be decomposed to partial SLAs, assigned to each of these domains. Assuming a two level management architecture consisting of an E2E service orchestrator and local domain controllers, we consider that the former is only aware of historical data of the local controllers' responses to previous slice requests, and captures this knowledge in a risk model per domain. In this study, we propose the use of Neural Network (NN) based risk models, using such historical data, to decompose the E2E SLA. Specifically, we introduce models that incorporate monotonicity, applicable even in cases involving small datasets. An empirical study on a synthetic multidomain dataset demonstrates the efficiency of our approach.

[343] 2407.15291

Evidence-Based Temporal Fact Verification

Automated fact verification plays an essential role in fostering trust in the digital space. Despite the growing interest, the verification of temporal facts has not received much attention in the community. Temporal fact verification brings new challenges where cues of the temporal information need to be extracted and temporal reasoning involving various temporal aspects of the text must be applied. In this work, we propose an end-to-end solution for temporal fact verification that considers the temporal information in claims to obtain relevant evidence sentences and harness the power of large language model for temporal reasoning. Recognizing that temporal facts often involve events, we model these events in the claim and evidence sentences. We curate two temporal fact datasets to learn time-sensitive representations that encapsulate not only the semantic relationships among the events, but also their chronological proximity. This allows us to retrieve the top-k relevant evidence sentences and provide the context for a large language model to perform temporal reasoning and outputs whether a claim is supported or refuted by the retrieved evidence sentences. Experiment results demonstrate that the proposed approach significantly enhances the accuracy of temporal claim verification, thereby advancing current state-of-the-art in automated fact verification.

[344] 2407.15293

Enhancing Retinal Disease Classification from OCTA Images via Active Learning Techniques

Eye diseases are common in older Americans and can lead to decreased vision and blindness. Recent advancements in imaging technologies allow clinicians to capture high-quality images of the retinal blood vessels via Optical Coherence Tomography Angiography (OCTA), which contain vital information for diagnosing these diseases and expediting preventative measures. OCTA provides detailed vascular imaging as compared to the solely structural information obtained by common OCT imaging. Although there have been considerable studies on OCT imaging, there have been limited to no studies exploring the role of artificial intelligence (AI) and machine learning (ML) approaches for predictive modeling with OCTA images. In this paper, we explore the use of deep learning to identify eye disease in OCTA images. However, due to the lack of labeled data, the straightforward application of deep learning doesn't necessarily yield good generalization. To this end, we utilize active learning to select the most valuable subset of data to train our model. We demonstrate that active learning subset selection greatly outperforms other strategies, such as inverse frequency class weighting, random undersampling, and oversampling, by up to 49% in F1 evaluation.

[345] 2407.15295

VideoGameBunny: Towards vision assistants for video games

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at

[346] 2407.15296

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a process we call "Weak-to-Strong Compositional Learning" (WSCL). To achieve this, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.

[347] 2407.15300

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative accuracy gains for RAVDESS and CREMA-D, respectively. Moreover, SELM can further boost its performance by Few-Shot Learning using a few annotated examples. The results highlight the effectiveness of our SER formulation, especially to improve performance in OOD scenarios.

[348] 2407.15302

Fever Detection with Infrared Thermography: Enhancing Accuracy through Machine Learning Techniques

The COVID-19 pandemic has underscored the necessity for advanced diagnostic tools in global health systems. Infrared Thermography (IRT) has proven to be a crucial non-contact method for measuring body temperature, vital for identifying febrile conditions associated with infectious diseases like COVID-19. Traditional non-contact infrared thermometers (NCITs) often exhibit significant variability in readings. To address this, we integrated machine learning algorithms with IRT to enhance the accuracy and reliability of temperature measurements. Our study systematically evaluated various regression models using heuristic feature engineering techniques, focusing on features' physiological relevance and statistical significance. The Convolutional Neural Network (CNN) model, utilizing these techniques, achieved the lowest RMSE of 0.2223, demonstrating superior performance compared to results reported in previous literature. Among non-neural network models, the Binning method achieved the best performance with an RMSE of 0.2296. Our findings highlight the potential of combining advanced feature engineering with machine learning to improve diagnostic tools' effectiveness, with implications extending to other non-contact or remote sensing biomedical applications. This paper offers a comprehensive analysis of these methodologies, providing a foundation for future research in the field of non-invasive medical diagnostics.

[349] 2407.15304

Appearance-Based Loop Closure Detection for Online Large-Scale and Long-Term Operation

In appearance-based localization and mapping, loop closure detection is the process used to determinate if the current observation comes from a previously visited location or a new one. As the size of the internal map increases, so does the time required to compare new observations with all stored locations, eventually limiting online processing. This paper presents an online loop closure detection approach for large-scale and long-term operation. The approach is based on a memory management method, which limits the number of locations used for loop closure detection so that the computation time remains under real-time constraints. The idea consists of keeping the most recent and frequently observed locations in a Working Memory (WM) used for loop closure detection, and transferring the others into a Long-Term Memory (LTM). When a match is found between the current location and one stored in WM, associated locations stored in LTM can be updated and remembered for additional loop closure detections. Results demonstrate the approach's adaptability and scalability using ten standard data sets from other appearance-based loop closure approaches, one custom data set using real images taken over a 2 km loop of our university campus, and one custom data set (7 hours) using virtual images from the racing video game ``Need for Speed: Most Wanted''.

[350] 2407.15305

Online Global Loop Closure Detection for Large-Scale Multi-Session Graph-Based SLAM

For large-scale and long-term simultaneous localization and mapping (SLAM), a robot has to deal with unknown initial positioning caused by either the kidnapped robot problem or multi-session mapping. This paper addresses these problems by tying the SLAM system with a global loop closure detection approach, which intrinsically handles these situations. However, online processing for global loop closure detection approaches is generally influenced by the size of the environment. The proposed graph-based SLAM system uses a memory management approach that only consider portions of the map to satisfy online processing requirements. The approach is tested and demonstrated using five indoor mapping sessions of a building using a robot equipped with a laser rangefinder and a Kinect.

[351] 2407.15309

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.

[352] 2407.15312

FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

Histopathological image classification constitutes a pivotal task in computer-aided diagnostics. The precise identification and categorization of histopathological images are of paramount significance for early disease detection and treatment. In the diagnostic process of pathologists, a multi-tiered approach is typically employed to assess abnormalities in cell regions at different magnifications. However, feature extraction is often performed at a single granularity, overlooking the multi-granular characteristics of cells. To address this issue, we propose the Fuzzy-guided Multi-granularity Deep Neural Network (FMDNN). Inspired by the multi-granular diagnostic approach of pathologists, we perform feature extraction on cell structures at coarse, medium, and fine granularity, enabling the model to fully harness the information in histopathological images. We incorporate the theory of fuzzy logic to address the challenge of redundant key information arising during multi-granular feature extraction. Cell features are described from different perspectives using multiple fuzzy membership functions, which are fused to create universal fuzzy features. A fuzzy-guided cross-attention module guides universal fuzzy features toward multi-granular features. We propagate these features through an encoder to all patch tokens, aiming to achieve enhanced classification accuracy and robustness. In experiments on multiple public datasets, our model exhibits a significant improvement in accuracy over commonly used classification methods for histopathological image classification and shows commendable interpretability.

[353] 2407.15313

Should we use model-free or model-based control? A case study of battery management systems

Reinforcement learning (RL) and model predictive control (MPC) each offer distinct advantages and limitations when applied to control problems in power and energy systems. Despite various studies on these methods, benchmarks remain lacking and the preference for RL over traditional controls is not well understood. In this work, we put forth a comparative analysis using RL- and MPC-based controllers for optimizing a battery management system (BMS). The BMS problem aims to minimize costs while adhering to operational limits. by adjusting the battery (dis)charging in response to fluctuating electricity prices over a time horizon. The MPC controller uses a learningbased forecast of future demand and price changes to formulate a multi-period linear program, that can be solved using off-the-shelf solvers. Meanwhile, the RL controller requires no timeseries modeling but instead is trained from the sample trajectories using the proximal policy optimization (PPO) algorithm. Numerical tests compare these controllers across optimality, training time, testing time, and robustness, providing a comprehensive evaluation of their efficacy. RL not only yields optimal solutions quickly but also ensures robustness to shifts in customer behavior, such as changes in demand distribution. However, as expected, training the RL agent is more time-consuming than MPC.

[354] 2407.15315

A Fast and Accurate Solver for the Fractional Fokker-Planck Equation with Dirac-Delta Initial Conditions

The classical Fokker-Planck equation (FPE) is a key tool in physics for describing systems influenced by drag forces and Gaussian noise, with applications spanning multiple fields. We consider the fractional Fokker-Planck equation (FFPE), which models the time evolution of probability densities for systems driven by L\'evy processes, relevant in scenarios where Gaussian assumptions fail. The paper presents an efficient and accurate numerical approach for the free-space FFPE with constant coefficients and Dirac-delta initial conditions. This method utilizes the integral representation of the solutions and enables the efficient handling of very high-dimensional problems using fast algorithms. Our work is the first to present a high-precision numerical solver for the free-space FFPE with Dirac-delta initial conditions. This opens the door for future research on more complex scenarios, including those with variable coefficients and other types of initial conditions.

[355] 2407.15316

AI as a Tool for Fair Journalism: Case Studies from Malta

In today`s media landscape, the role of Artificial Intelligence (AI) in shaping societal perspectives and journalistic integrity is becoming increasingly apparent. This paper presents two case studies centred on Malta`s media market featuring technical novelty. Despite its relatively small scale, Malta offers invaluable insights applicable to both similar and broader media contexts. These two projects focus on media monitoring and present tools designed to analyse potential biases in news articles and television news segments. The first project uses Computer Vision and Natural Language Processing techniques to analyse the coherence between images in news articles and their corresponding captions, headlines, and article bodies. The second project employs computer vision techniques to track individuals` on-screen time or visual exposure in news videos, providing queryable data. These initiatives aim to contribute to society by providing both journalists and the public with the means to identify biases. Furthermore, we make these tools accessible to journalists to improve the trustworthiness of media outlets by offering robust tools for detecting and reducing bias.

[356] 2407.15317

Open-CD: A Comprehensive Toolbox for Change Detection

We present Open-CD, a change detection toolbox that contains a rich set of change detection methods as well as related components and modules. The toolbox started from a series of open source general vision task tools, including OpenMMLab Toolkits, PyTorch Image Models, etc. It gradually evolves into a unified platform that covers many popular change detection methods and contemporary modules. It not only includes training and inference codes, but also provides some useful scripts for data analysis. We believe this toolbox is by far the most complete change detection toolbox. In this report, we introduce the various features, supported methods and applications of Open-CD. In addition, we also conduct a benchmarking study on different methods and components. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new change detectors. Code and models are available at \url{}. Pioneeringly, this report also includes brief descriptions of the algorithms supported in Open-CD, mainly contributed by their authors. We sincerely encourage researchers in this field to participate in this project and work together to create a more open community. This toolkit and report will be kept updated.

[357] 2407.15318

Modified Bat Algorithm: A Newly Proposed Approach for Solving Complex and Real-World Problems

Bat Algorithm (BA) is a nature-inspired metaheuristic search algorithm designed to efficiently explore complex problem spaces and find near-optimal solutions. The algorithm is inspired by the echolocation behavior of bats, which acts as a signal system to estimate the distance and hunt prey. Although the BA has proven effective for various optimization problems, it exhibits limited exploration ability and susceptibility to local optima. The algorithm updates velocities and positions based on the current global best solution, causing all agents to converge towards a specific location, potentially leading to local optima issues in optimization problems. On this premise, this paper proposes the Modified Bat Algorithm (MBA) as an enhancement to address the local optima limitation observed in the original BA. MBA incorporates the frequency and velocity of the current best solution, enhancing convergence speed to the optimal solution and preventing local optima entrapment. While the original BA faces diversity issues, both the original BA and MBA are introduced. To assess MBAs performance, three sets of test functions (classical benchmark functions, CEC2005, and CEC2019) are employed, with results compared to those of the original BA, Particle Swarm Optimization (PSO), Genetic Algorithm (GA), and Dragonfly Algorithm (DA). The outcomes demonstrate the MBAs significant superiority over other algorithms. Additionally, MBA successfully addresses a real-world assignment problem (call center problem), traditionally solved using linear programming methods, with satisfactory results.

[358] 2407.15320

Edge Graph Intelligence: Reciprocally Empowering Edge Networks with Graph Intelligence

Recent years have witnessed a thriving growth of computing facilities connected at the network edge, cultivating edge computing networks as a fundamental infrastructure for supporting miscellaneous intelligent services. Meanwhile, Artificial Intelligence frontiers have extrapolated Machine Learning to the graph domain and promoted Graph Intelligence (GI), which unlocks unprecedented ability in learning from massive data in graph structures. Given the inherent relation between graphs and networks, the interdiscipline of graph representation learning and edge networks, i.e., Edge GI or EGI, has revealed a novel interplay between them -- GI models principally open a new door for modeling, understanding, and optimizing edge networks, and conversely, edge networks serve as physical support for training, deploying, and accelerating GI models. Driven by this delicate closed-loop, EGI can be widely recognized as a promising solution to fully unleash the potential of edge computing power and is garnering significant attention. Nevertheless, research on EGI yet remains nascent, and there is a soaring demand within both the communications and AI communities for a dedicated venue to share recent advancements. To this end, this paper promotes the concept of EGI, explores its scope and core principles, and conducts a comprehensive survey concerning recent research efforts on this emerging field and specifically, introduces and discusses: 1) fundamentals of edge computing and graph representation learning, 2) emerging techniques centering on the closed loop between graph intelligence and edge networks, and 3) open challenges and research opportunities of future EGI. By bridging the gap across communication, networking, and graph learning areas, we believe that this survey can garner increased attention, foster meaningful discussions, and inspire further research ideas in EGI.

[359] 2407.15324

Cooperative Salvo Guidance over Leader-Follower Network with Free-Will Arbitrary Time Convergence

A cooperative salvo strategy is proposed in this paper which achieves consensus among the interceptors within a pre-defined arbitrary settling time. Considering non-linear engagement kinematics and a system lag to capture the effect of interceptor autopilot as present in realistic interception scenarios, the guidance schemes use the time-to-go estimates of the interceptors in order to achieve simultaneous interception of a stationary target at a pre-determined impact time. The guidance scheme ensures that consensus among the time-to-go estimates of the interceptors is achieved within a settling time whose upper bound can be pre-specified arbitrarily independent of the initial conditions or design parameters. The efficacy of the proposed guidance strategy is demonstrated using numerical simulations with varied conditions of initial position, velocities and heading angle errors of the interceptors as well as different desired impact times.

[360] 2407.15325

Odyssey: Empowering Agents with Open-World Skills

Recent studies have delved into constructing generalist agents for open-world embodied environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce ODYSSEY, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. ODYSSEY comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new open-world benchmark includes thousands of long-term planning tasks, tens of dynamic-immediate planning tasks, and one autonomous exploration task. Extensive experiments demonstrate that the proposed ODYSSEY framework can effectively evaluate the planning and exploration capabilities of agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.

[361] 2407.15326

Intelligence Preschool Education System based on Multimodal Interaction Systems and AI

Rapid progress in AI technologies has generated considerable interest in their potential to address challenges in every field and education is no exception. Improving learning outcomes and providing relevant education to all have been dominant themes universally, both in the developed and developing world. And they have taken on greater significance in the current era of technology driven personalization.

[362] 2407.15328

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

Diffusion models, known for their tremendous ability to generate novel and high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent approaches for memory mitigation either only focused on the text modality problem in cross-modal generation tasks or utilized data augmentation strategies. In this paper, we propose a novel training framework for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. To facilitate ``forgetting'' of stored information in diffusion model parameters, we propose an iterative ensemble training strategy by splitting the data into multiple shards for training multiple models and intermittently aggregating these model parameters. Moreover, practical analysis of losses illustrates that the training loss for easily memorable images tends to be obviously lower. Thus, we propose an anti-gradient control method to exclude the sample with a lower loss value from the current mini-batch to avoid memorizing. Extensive experiments and analysis on \crnote{four} datasets are conducted to illustrate the effectiveness of our method, and results show that our method successfully reduces memory capacity while even improving the performance slightly. Moreover, to save the computing cost, we successfully apply our method to fine-tune the well-trained diffusion models by limited epochs, demonstrating the applicability of our method. Code is available in

[363] 2407.15330

A Methodology for Power Dispatch Based on Traction Station Clusters in the Flexible Traction Power Supply System

The flexible traction power supply system (FTPSS) eliminates the neutral zone but leads to increased complexity in power flow coordinated control and power mismatch. To address these challenges, the methodology for power dispatch (PD) based on traction station clusters (TSCs) in FTPSS is proposed, in which each TSC with a consistent structure performs independent local phase angle control. First, to simplify the PD problem of TSCs, the system is transformed into an equivalent model with constant topology, resulting in it can be solved by univariate numerical optimization with higher computational performance. Next, the calculation method of the feasible phase angle domain under strict and relaxed power circulation constraints are described, respectively, which ensures that power circulation can be either eliminated or precisely controlled. Finally, the PD method with three unique modes for uncertain train loads is introduced to enhance power flow flexibility: specified power distribution coefficients between traction substations (TSs), constant output power of TSs, and maximum consumption of renewable resources within TSs. In the experimental section, the performance of the TSC methodology for PD is verified through detailed train operation scenarios.

[364] 2407.15334

Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.

[365] 2407.15337

ThermalNeRF: Thermal Radiance Fields

Thermal imaging has a variety of applications, from agricultural monitoring to building inspection to imaging under poor visibility, such as in low light, fog, and rain. However, reconstructing thermal scenes in 3D presents several challenges due to the comparatively lower resolution and limited features present in long-wave infrared (LWIR) images. To overcome these challenges, we propose a unified framework for scene reconstruction from a set of LWIR and RGB images, using a multispectral radiance field to represent a scene viewed by both visible and infrared cameras, thus leveraging information across both spectra. We calibrate the RGB and infrared cameras with respect to each other, as a preprocessing step using a simple calibration target. We demonstrate our method on real-world sets of RGB and LWIR photographs captured from a handheld thermal camera, showing the effectiveness of our method at scene representation across the visible and infrared spectra. We show that our method is capable of thermal super-resolution, as well as visually removing obstacles to reveal objects that are occluded in either the RGB or thermal channels. Please see for video results as well as our code and dataset release.

[366] 2407.15341

ZZU-NLP at SIGHAN-2024 dimABSA Task: Aspect-Based Sentiment Analysis with Coarse-to-Fine In-context Learning

The DimABSA task requires fine-grained sentiment intensity prediction for restaurant reviews, including scores for Valence and Arousal dimensions for each Aspect Term. In this study, we propose a Coarse-to-Fine In-context Learning(CFICL) method based on the Baichuan2-7B model for the DimABSA task in the SIGHAN 2024 workshop. Our method improves prediction accuracy through a two-stage optimization process. In the first stage, we use fixed in-context examples and prompt templates to enhance the model's sentiment recognition capability and provide initial predictions for the test data. In the second stage, we encode the Opinion field using BERT and select the most similar training data as new in-context examples based on similarity. These examples include the Opinion field and its scores, as well as related opinion words and their average scores. By filtering for sentiment polarity, we ensure that the examples are consistent with the test data. Our method significantly improves prediction accuracy and consistency by effectively utilizing training data and optimizing in-context examples, as validated by experimental results.

[367] 2407.15343

Improving Minimum Bayes Risk Decoding with Multi-Prompt

While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single "best" prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks, and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.

[368] 2407.15346

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs). However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{}.

[369] 2407.15349

RoadPainter: Points Are Ideal Navigators for Topology transformER

Topology reasoning aims to provide a precise understanding of road scenes, enabling autonomous systems to identify safe and efficient routes. In this paper, we present RoadPainter, an innovative approach for detecting and reasoning the topology of lane centerlines using multi-view images. The core concept behind RoadPainter is to extract a set of points from each centerline mask to improve the accuracy of centerline prediction. We start by implementing a transformer decoder that integrates a hybrid attention mechanism and a real-virtual separation strategy to predict coarse lane centerlines and establish topological associations. Then, we generate centerline instance masks guided by the centerline points from the transformer decoder. Moreover, we derive an additional set of points from each mask and combine them with previously detected centerline points for further refinement. Additionally, we introduce an optional module that incorporates a Standard Definition (SD) map to further optimize centerline detection and enhance topological reasoning performance. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of RoadPainter.

[370] 2407.15350

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in hundreds of traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also pro-vide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development.

[371] 2407.15351

LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation

Recent studies seek to provide Graph Neural Network (GNN) interpretability via multiple unsupervised learning models. Due to the scarcity of datasets, current methods easily suffer from learning bias. To solve this problem, we embed a Large Language Model (LLM) as knowledge into the GNN explanation network to avoid the learning bias problem. We inject LLM as a Bayesian Inference (BI) module to mitigate learning bias. The efficacy of the BI module has been proven both theoretically and experimentally. We conduct experiments on both synthetic and real-world datasets. The innovation of our work lies in two parts: 1. We provide a novel view of the possibility of an LLM functioning as a Bayesian inference to improve the performance of existing algorithms; 2. We are the first to discuss the learning bias issues in the GNN explanation problem.

[372] 2407.15352

MAVEN-Fact: A Large-scale Event Factuality Detection Dataset

Event Factuality Detection (EFD) task determines the factuality of textual events, i.e., classifying whether an event is a fact, possibility, or impossibility, which is essential for faithfully understanding and utilizing event knowledge. However, due to the lack of high-quality large-scale data, event factuality detection is under-explored in event understanding research, which limits the development of EFD community. To address these issues and provide faithful event understanding, we introduce MAVEN-Fact, a large-scale and high-quality EFD dataset based on the MAVEN dataset. MAVEN-Fact includes factuality annotations of 112,276 events, making it the largest EFD dataset. Extensive experiments demonstrate that MAVEN-Fact is challenging for both conventional fine-tuned models and large language models (LLMs). Thanks to the comprehensive annotations of event arguments and relations in MAVEN, MAVEN-Fact also supports some further analyses and we find that adopting event arguments and relations helps in event factuality detection for fine-tuned models but does not benefit LLMs. Furthermore, we preliminarily study an application case of event factuality detection and find it helps in mitigating event-related hallucination in LLMs. Our dataset and codes can be obtained from \url{}

[373] 2407.15353

Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA

Retrieval augmented generation (RAG) enhances the accuracy and reliability of generative AI models by sourcing factual information from external databases, which is extensively employed in document-grounded question-answering (QA) tasks. Off-the-shelf RAG flows are well pretrained on general-purpose documents, yet they encounter significant challenges when being applied to knowledge-intensive vertical domains, such as electronic design automation (EDA). This paper addresses such issue by proposing a customized RAG framework along with three domain-specific techniques for EDA tool documentation QA, including a contrastive learning scheme for text embedding model fine-tuning, a reranker distilled from proprietary LLM, and a generative LLM fine-tuned with high-quality domain corpus. Furthermore, we have developed and released a documentation QA evaluation benchmark, ORD-QA, for OpenROAD, an advanced RTL-to-GDSII design platform. Experimental results demonstrate that our proposed RAG flow and techniques have achieved superior performance on ORD-QA as well as on a commercial tool, compared with state-of-the-arts. The ORD-QA benchmark and the training dataset for our customized RAG flow are open-source at

[374] 2407.15354

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement.

[375] 2407.15355

Attention Beats Linear for Fast Implicit Neural Representation Generation

Implicit Neural Representation (INR) has gained increasing popularity as a data representation method, serving as a prerequisite for innovative generation models. Unlike gradient-based methods, which exhibit lower efficiency in inference, the adoption of hyper-network for generating parameters in Multi-Layer Perceptrons (MLP), responsible for executing INR functions, has surfaced as a promising and efficient alternative. However, as a global continuous function, MLP is challenging in modeling highly discontinuous signals, resulting in slow convergence during the training phase and inaccurate reconstruction performance. Moreover, MLP requires massive representation parameters, which implies inefficiencies in data representation. In this paper, we propose a novel Attention-based Localized INR (ANR) composed of a localized attention layer (LAL) and a global MLP that integrates coordinate features with data features and converts them to meaningful outputs. Subsequently, we design an instance representation framework that delivers a transformer-like hyper-network to represent data instances as a compact representation vector. With instance-specific representation vector and instance-agnostic ANR parameters, the target signals are well reconstructed as a continuous function. We further address aliasing artifacts with variational coordinates when obtaining the super-resolution inference results. Extensive experimentation across four datasets showcases the notable efficacy of our ANR method, e.g. enhancing the PSNR value from 37.95dB to 47.25dB on the CelebA dataset. Code is released at

[376] 2407.15356

X-Recon: Learning-based Patient-specific High-Resolution CT Reconstruction from Orthogonal X-Ray Images

Rapid and accurate diagnosis of pneumothorax, utilizing chest X-ray and computed tomography (CT), is crucial for assisted diagnosis. Chest X-ray is commonly used for initial localization of pneumothorax, while CT ensures accurate quantification. However, CT scans involve high radiation doses and can be costly. To achieve precise quantitative diagnosis while minimizing radiation exposure, we proposed X-Recon, a CT ultra-sparse reconstruction network based on ortho-lateral chest X-ray images. X-Recon integrates generative adversarial networks (GANs), including a generator with a multi-scale fusion rendering module and a discriminator enhanced by 3D coordinate convolutional layers, designed to facilitate CT reconstruction. To improve precision, a projective spatial transformer is utilized to incorporate multi-angle projection loss. Additionally, we proposed PTX-Seg, a zero-shot pneumothorax segmentation algorithm, combining image processing techniques with deep-learning models for the segmentation of air-accumulated regions and lung structures. Experiments on a large-scale dataset demonstrate its superiority over existing approaches. X-Recon achieved a significantly higher reconstruction resolution with a higher average spatial resolution and a lower average slice thickness. The reconstruction metrics achieved state-of-the-art performance in terms of several metrics including peak signal-to-noise ratio. The zero-shot segmentation algorithm, PTX-Seg, also demonstrated high segmentation precision for the air-accumulated region, the left lung, and the right lung. Moreover, the consistency analysis for the pneumothorax chest occupancy ratio between reconstructed CT and original CT obtained a high correlation coefficient. Code will be available at:

[377] 2407.15359

UF-HOBI at "Discharge Me!": A Hybrid Solution for Discharge Summary Generation Through Prompt-based Tuning of GatorTronGPT Models

Automatic generation of discharge summaries presents significant challenges due to the length of clinical documentation, the dispersed nature of patient information, and the diverse terminology used in healthcare. This paper presents a hybrid solution for generating discharge summary sections as part of our participation in the "Discharge Me!" Challenge at the BioNLP 2024 Shared Task. We developed a two-stage generation method using both extractive and abstractive techniques, in which we first apply name entity recognition (NER) to extract key clinical concepts, which are then used as input for a prompt-tuning-based GatorTronGPT model to generate coherent text for two important sections including "Brief Hospital Course" and "Discharge Instructions". Our system was ranked 5th in this challenge, achieving an overall score of 0.284. The results demonstrate the effectiveness of our hybrid solution in improving the quality of automated discharge section generation.

[378] 2407.15360

Dissecting Multiplication in Transformers: Insights into LLMs

Transformer-based large language models have achieved remarkable performance across various natural language processing tasks. However, they often struggle with seemingly easy tasks like arithmetic despite their vast capabilities. This stark disparity raise human's concerns about their safe and ethical use, hinder their widespread adoption.In this paper, we focus on a typical arithmetic task, integer multiplication, to explore and explain the imperfection of transformers in this domain. We provide comprehensive analysis of a vanilla transformer trained to perform n-digit integer multiplication. Our observations indicate that the model decomposes multiplication task into multiple parallel subtasks, sequentially optimizing each subtask for each digit to complete the final multiplication. Based on observation and analysis, we infer the reasons of transformers deficiencies in multiplication tasks lies in their difficulty in calculating successive carryovers and caching intermediate results, and confirmed this inference through experiments. Guided by these findings, we propose improvements to enhance transformers performance on multiplication tasks. These enhancements are validated through rigorous testing and mathematical modeling, not only enhance transformer's interpretability, but also improve its performance, e.g., we achieve over 99.9% accuracy on 5-digit integer multiplication with a tiny transformer, outperform LLMs GPT-4. Our method contributes to the broader fields of model understanding and interpretability, paving the way for analyzing more complex tasks and Transformer models. This work underscores the importance of explainable AI, helping to build trust in large language models and promoting their adoption in critical applications.

[379] 2407.15362

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H\&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.

[380] 2407.15363

Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRAD -- Extended Version

Modern organizations manage their data with a wide variety of specialized cloud database engines (e.g., Aurora, BigQuery, etc.). However, designing and managing such infrastructures is hard. Developers must consider many possible designs with non-obvious performance consequences; moreover, current software abstractions tightly couple applications to specific systems (e.g., with engine-specific clients), making it difficult to change after initial deployment. A better solution would virtualize cloud data management, allowing developers to declaratively specify their workload requirements and rely on automated solutions to design and manage the physical realization. In this paper, we present a technique called blueprint planning that achieves this vision. The key idea is to project data infrastructure design decisions into a unified design space (blueprints). We then systematically search over candidate blueprints using cost-based optimization, leveraging learned models to predict the utility of a blueprint on the workload. We use this technique to build BRAD, the first cloud data virtualization system. BRAD users issue queries to a single SQL interface that can be backed by multiple cloud database services. BRAD automatically selects the most suitable engine for each query, provisions and manages resources to minimize costs, and evolves the infrastructure to adapt to workload shifts. Our evaluation shows that BRAD meet user-defined performance targets and improve cost-savings by 1.6-13x compared to serverless auto-scaling or HTAP systems.

[381] 2407.15364

Hybrid PDE-Deep Neural Network Model for Calcium Dynamics in Neurons

Traditionally, calcium dynamics in neurons are modeled using partial differential equations (PDEs) and ordinary differential equations (ODEs). The PDE component focuses on reaction-diffusion processes, while the ODE component addresses transmission via ion channels on the cell's or organelle's membrane. However, analytically determining the underlying equations for ion channels is highly challenging due to the complexity and unknown factors inherent in biological processes. Therefore, we employ deep neural networks (DNNs) to model the open probability of ion channels, a task that can be intricate when approached with ODEs. This technique also reduces the number of unknowns required to model the open probability. When trained with valid data, the same neural network architecture can be used for different ion channels, such as sodium, potassium, and calcium. Furthermore, based on the given data, we can build more physiologically reasonable DNN models that can be customized. Subsequently, we integrated the DNN model into calcium dynamics in neurons with endoplasmic reticulum, resulting in a hybrid model that combines PDEs and DNNs. Numerical results are provided to demonstrate the flexibility and advantages of the PDE-DNN model.

[382] 2407.15365

Pseudo-Energy-Preserving Explicit Runge-Kutta Methods

Using a recent characterization of energy-preserving B-series, we derive the explicit conditions on the coefficients of a Runge-Kutta method that ensure energy preservation (for Hamiltonian systems) up to a given order in the step size, which we refer to as the pseudo-energy-preserving (PEP) order. We study explicit Runge-Kutta methods with PEP order higher than their classical order. We provide examples of such methods up to PEP order six, and test them on Hamiltonian ODE and PDE systems. We find that these methods behave similarly to exactly energy-conservative methods over moderate time intervals and exhibit significantly smaller errors, relative to other Runge-Kutta methods of the same order, for moderately long-time simulations.

[383] 2407.15366

Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias

The common toxicity and societal bias in contents generated by large language models (LLMs) necessitate strategies to reduce harm. Present solutions often demand white-box access to the model or substantial training, which is impractical for cutting-edge commercial LLMs. Moreover, prevailing prompting methods depend on external tool feedback and fail to simultaneously lessen toxicity and bias. Motivated by social psychology principles, we propose a novel strategy named \textbf{perspective-taking prompting (\textsc{PeT})} that inspires LLMs to integrate diverse human perspectives and self-regulate their responses. This self-correction mechanism can significantly diminish toxicity (up to $89\%$) and bias (up to $73\%$) in LLMs' responses. Rigorous evaluations and ablation studies are conducted on two commercial LLMs (ChatGPT and GLM) and three open-source LLMs, revealing \textsc{PeT}'s superiority in producing less harmful responses, outperforming five strong baselines.

[384] 2407.15369

Sparse Prior Is Not All You Need: When Differential Directionality Meets Saliency Coherence for Infrared Small Target Detection

Infrared small target detection is crucial for the efficacy of infrared search and tracking systems. Current tensor decomposition methods emphasize representing small targets with sparsity but struggle to separate targets from complex backgrounds due to insufficient use of intrinsic directional information and reduced target visibility during decomposition. To address these challenges, this study introduces a Sparse Differential Directionality prior (SDD) framework. SDD leverages the distinct directional characteristics of targets to differentiate them from the background, applying mixed sparse constraints on the differential directional images and continuity difference matrix of the temporal component, both derived from Tucker decomposition. We further enhance target detectability with a saliency coherence strategy that intensifies target contrast against the background during hierarchical decomposition. A Proximal Alternating Minimization-based (PAM) algorithm efficiently solves our proposed model. Experimental results on several real-world datasets validate our method's effectiveness, outperforming ten state-of-the-art methods in target detection and clutter suppression. Our code is available at

[385] 2407.15370

A Network Analysis Approach to Conlang Research Literature

The field of conlang has evidenced an important growth in the last decades. This has been the product of a wide interest in the use and study of conlangs for artistic purposes. However, one important question is what it is happening with conlang in the academic world. This paper aims to have an overall understanding of the literature on conlang research. With this we aim to give a realistic picture of the field in present days. We have implemented a computational linguistic approach, combining bibliometrics and network analysis to examine all publications available in the Scopus database. Analysing over 2300 academic publications since 1927 until 2022, we have found that Esperanto is by far the most documented conlang. Three main authors have contributed to this: Garv\'ia R., Fiedler S., and Blanke D. The 1970s and 1980s have been the decades where the foundations of current research have been built. In terms of methodologies, language learning and experimental linguistics are the ones contributing to most to the preferred approaches of study in the field. We present the results and discuss our limitations and future work.

[386] 2407.15373

avaTTAR: Table Tennis Stroke Training with On-body and Detached Visualization in Augmented Reality

Table tennis stroke training is a critical aspect of player development. We designed a new augmented reality (AR) system avaTTAR for table tennis stroke training. The system provides both "on-body" (first-person view) and "detached" (third-person view) visual cues, enabling users to visualize target strokes and correct their attempts effectively with this dual perspectives setup. By employing a combination of pose estimation algorithms and IMU sensors, avaTTAR captures and reconstructs the 3D body pose and paddle orientation of users during practice, allowing real-time comparison with expert strokes. Through a user study, we affirm avaTTAR's capacity to amplify player experience and training results.

[387] 2407.15374

ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts

Social Media platforms have offered invaluable opportunities for linguistic research. The availability of up-to-date data, coming from any part in the world, and coming from natural contexts, has allowed researchers to study language in real time. One of the fields that has made great use of social media platforms is Corpus Linguistics. There is currently a wide range of projects which have been able to successfully create corpora from social media. In this paper, we present the development and deployment of a linguistic corpus from Twitter posts in English, coming from 26 news agencies and 27 individuals. The main goal was to create a fully annotated English corpus for linguistic analysis. We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams. The information is presented through a range of powerful visualisations for users to explore linguistic patterns in the corpus. With this tool, we aim to contribute to the area of language technologies applied to linguistic research.

[388] 2407.15375

The Development of a Comprehensive Spanish Dictionary for Phonetic and Lexical Tagging in Socio-phonetic Research (ESPADA)

Pronunciation dictionaries are an important component in the process of speech forced alignment. The accuracy of these dictionaries has a strong effect on the aligned speech data since they help the mapping between orthographic transcriptions and acoustic signals. In this paper, I present the creation of a comprehensive pronunciation dictionary in Spanish (ESPADA) that can be used in most of the dialect variants of Spanish data. Current dictionaries focus on specific regional variants, but with the flexible nature of our tool, it can be readily applied to capture the most common phonetic differences across major dialectal variants. We propose improvements to current pronunciation dictionaries as well as mapping other relevant annotations such as morphological and lexical information. In terms of size, it is currently the most complete dictionary with more than 628,000 entries, representing words from 16 countries. All entries come with their corresponding pronunciations, morphological and lexical tagging, and other relevant information for phonetic analysis: stress patterns, phonotactics, IPA transcriptions, and more. This aims to equip socio-phonetic researchers with a complete open-source tool that enhances dialectal research within socio-phonetic frameworks in the Spanish language.

[389] 2407.15376

Structure-Aware Residual-Center Representation for Self-Supervised Open-Set 3D Cross-Modal Retrieval

Existing methods of 3D cross-modal retrieval heavily lean on category distribution priors within the training set, which diminishes their efficacy when tasked with unseen categories under open-set environments. To tackle this problem, we propose the Structure-Aware Residual-Center Representation (SRCR) framework for self-supervised open-set 3D cross-modal retrieval. To address the center deviation due to category distribution differences, we utilize the Residual-Center Embedding (RCE) for each object by nested auto-encoders, rather than directly mapping them to the modality or category centers. Besides, we perform the Hierarchical Structure Learning (HSL) approach to leverage the high-order correlations among objects for generalization, by constructing a heterogeneous hypergraph structure based on hierarchical inter-modality, intra-object, and implicit-category correlations. Extensive experiments and ablation studies on four benchmarks demonstrate the superiority of our proposed framework compared to state-of-the-art methods.

[390] 2407.15383

Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

This paper aims to adapt the source model to the target environment, leveraging small user feedback (i.e., labeled target data) readily available in real-world applications. We find that existing semi-supervised domain adaptation (SemiSDA) methods often suffer from poorly improved adaptation performance when directly utilizing such feedback data, as shown in Figure 1. We analyze this phenomenon via a novel concept called Negatively Biased Feedback (NBF), which stems from the observation that user feedback is more likely for data points where the model produces incorrect predictions. To leverage this feedback while avoiding the issue, we propose a scalable adapting approach, Retrieval Latent Defending. This approach helps existing SemiSDA methods to adapt the model with a balanced supervised signal by utilizing latent defending samples throughout the adaptation process. We demonstrate the problem caused by NBF and the efficacy of our approach across various benchmarks, including image classification, semantic segmentation, and a real-world medical imaging application. Our extensive experiments reveal that integrating our approach with multiple state-of-the-art SemiSDA methods leads to significant performance improvements.

[391] 2407.15385

Towards Robust Vision Transformer via Masked Adaptive Ensemble

Adversarial training (AT) can help improve the robustness of Vision Transformers (ViT) against adversarial attacks by intentionally injecting adversarial examples into the training data. However, this way of adversarial injection inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are still vulnerable to adaptive attacks. To tackle such shortcomings, this paper proposes a novel ViT architecture, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, we empirically discover that detecting adversarial examples can benefit from the Guided Backpropagation technique. Driven by this discovery, a novel Multi-head Self-Attention (MSA) mechanism is introduced to enhance our detector to sniff adversarial examples. Then, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples, with our adaptive ensemble to adaptively adjust the proportion of visual representations from the two encoders for accurate classification. This design enables our ViT architecture to achieve a better trade-off between standard accuracy and robustness. Besides, our adaptive ensemble technique allows us to mask off a random subset of image patches within input data, boosting our ViT's robustness against adaptive attacks, while maintaining high standard accuracy. Experimental results exhibit that our ViT architecture, on CIFAR-10, achieves the best standard accuracy and adversarial robustness of 90.3% and 49.8%, respectively.

[392] 2407.15389

Poisoning with A Pill: Circumventing Detection in Federated Learning

Without direct access to the client's data, federated learning (FL) is well-known for its unique strength in data privacy protection among existing distributed machine learning techniques. However, its distributive and iterative nature makes FL inherently vulnerable to various poisoning attacks. To counteract these threats, extensive defenses have been proposed to filter out malicious clients, using various detection metrics. Based on our analysis of existing attacks and defenses, we find that there is a lack of attention to model redundancy. In neural networks, various model parameters contribute differently to the model's performance. However, existing attacks in FL manipulate all the model update parameters with the same strategy, making them easily detectable by common defenses. Meanwhile, the defenses also tend to analyze the overall statistical features of the entire model updates, leaving room for sophisticated attacks. Based on these observations, this paper proposes a generic and attack-agnostic augmentation approach designed to enhance the effectiveness and stealthiness of existing FL poisoning attacks against detection in FL, pointing out the inherent flaws of existing defenses and exposing the necessity of fine-grained FL security. Specifically, we employ a three-stage methodology that strategically constructs, generates, and injects poison (generated by existing attacks) into a pill (a tiny subnet with a novel structure) during the FL training, named as pill construction, pill poisoning, and pill injection accordingly. Extensive experimental results show that FL poisoning attacks enhanced by our method can bypass all the popular defenses, and can gain an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.

[393] 2407.15390

ALLaM: Large Language Models for Arabic and English

We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.

[394] 2407.15396

Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation

The scene graph generation (SGG) task involves detecting objects within an image and predicting predicates that represent the relationships between the objects. However, in SGG benchmark datasets, each subject-object pair is annotated with a single predicate even though a single predicate may exhibit diverse semantics (i.e., semantic diversity), existing SGG models are trained to predict the one and only predicate for each pair. This in turn results in the SGG models to overlook the semantic diversity that may exist in a predicate, thus leading to biased predictions. In this paper, we propose a novel model-agnostic Semantic Diversity-aware Prototype-based Learning (DPL) framework that enables unbiased predictions based on the understanding of the semantic diversity of predicates. Specifically, DPL learns the regions in the semantic space covered by each predicate to distinguish among the various different semantics that a single predicate can represent. Extensive experiments demonstrate that our proposed model-agnostic DPL framework brings significant performance improvement on existing SGG models, and also effectively understands the semantic diversity of predicates.

[395] 2407.15399

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?

[396] 2407.15402

Tackling Selfish Clients in Federated Learning

Federated Learning (FL) is a distributed machine learning paradigm facilitating participants to collaboratively train a model without revealing their local data. However, when FL is deployed into the wild, some intelligent clients can deliberately deviate from the standard training process to make the global model inclined toward their local model, thereby prioritizing their local data distribution. We refer to this novel category of misbehaving clients as selfish. In this paper, we propose a Robust aggregation strategy for FL server to mitigate the effect of Selfishness (in short RFL-Self). RFL-Self incorporates an innovative method to recover (or estimate) the true updates of selfish clients from the received ones, leveraging robust statistics (median of norms) of the updates at every round. By including the recovered updates in aggregation, our strategy offers strong robustness against selfishness. Our experimental results, obtained on MNIST and CIFAR-10 datasets, demonstrate that just 2% of clients behaving selfishly can decrease the accuracy by up to 36%, and RFL-Self can mitigate that effect without degrading the global model performance.

[397] 2407.15403

Offline Imitation Learning Through Graph Search and Retrieval

Imitation learning is a powerful machine learning algorithm for a robot to acquire manipulation skills. Nevertheless, many real-world manipulation tasks involve precise and dexterous robot-object interactions, which make it difficult for humans to collect high-quality expert demonstrations. As a result, a robot has to learn skills from suboptimal demonstrations and unstructured interactions, which remains a key challenge. Existing works typically use offline deep reinforcement learning (RL) to solve this challenge, but in practice these algorithms are unstable and fragile due to the deadly triad issue. To overcome this problem, we propose GSR, a simple yet effective algorithm that learns from suboptimal demonstrations through Graph Search and Retrieval. We first use pretrained representation to organize the interaction experience into a graph and perform a graph search to calculate the values of different behaviors. Then, we apply a retrieval-based procedure to identify the best behavior (actions) on each state and use behavior cloning to learn that behavior. We evaluate our method in both simulation and real-world robotic manipulation tasks with complex visual inputs, covering various precise and dexterous manipulation skills with objects of different physical properties. GSR can achieve a 10% to 30% higher success rate and over 30% higher proficiency compared to baselines. Our project page is at

[398] 2407.15406

Automated Road Safety: Enhancing Sign and Surface Damage Detection with AI

Public transportation plays a crucial role in our lives, and the road network is a vital component in the implementation of smart cities. Recent advancements in AI have enabled the development of advanced monitoring systems capable of detecting anomalies in road surfaces and road signs, which, if unaddressed, can lead to serious road accidents. This paper presents an innovative approach to enhance road safety through the detection and classification of traffic signs and road surface damage using advanced deep learning techniques. This integrated approach supports proactive maintenance strategies, improving road safety and resource allocation for the Molise region and the city of Campobasso. The resulting system, developed as part of the Casa delle Tecnologie Emergenti (House of Emergent Technologies) Molise (Molise CTE) research project funded by the Italian Minister of Economic Growth (MIMIT), leverages cutting-edge technologies such as Cloud Computing and High Performance Computing with GPU utilization. It serves as a valuable tool for municipalities, enabling quick detection of anomalies and the prompt organization of maintenance operations

[399] 2407.15407

A Solution toward Transparent and Practical AI Regulation: Privacy Nutrition Labels for Open-source Generative AI-based Applications

The rapid development and widespread adoption of Generative Artificial Intelligence-based (GAI) applications have greatly enriched our daily lives, benefiting people by enhancing creativity, personalizing experiences, improving accessibility, and fostering innovation and efficiency across various domains. However, along with the development of GAI applications, concerns have been raised about transparency in their privacy practices. Traditional privacy policies often fail to effectively communicate essential privacy information due to their complexity and length, and open-source community developers often neglect privacy practices even more. Only 12.2% of examined open-source GAI apps provide a privacy policy. To address this, we propose a regulation-driven GAI Privacy Label and introduce Repo2Label, a novel framework for automatically generating these labels based on code repositories. Our user study indicates a common endorsement of the proposed GAI privacy label format. Additionally, Repo2Label achieves a precision of 0.81, recall of 0.88, and F1-score of 0.84 based on the benchmark dataset, significantly outperforming the developer self-declared privacy notices. We also discuss the common regulatory (in)compliance of open-source GAI apps, comparison with other privacy notices, and broader impacts to different stakeholders. Our findings suggest that Repo2Label could serve as a significant tool for bolstering the privacy transparency of GAI apps and make them more practical and responsible.

[400] 2407.15408

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance in terms of conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequence of events as negative samples during training to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider temporal elements in motion-language alignment.

[401] 2407.15411

Scalable Dynamic Embedding Size Search for Streaming Recommendation

Recommender systems typically represent users and items by learning their embeddings, which are usually set to uniform dimensions and dominate the model parameters. However, real-world recommender systems often operate in streaming recommendation scenarios, where the number of users and items continues to grow, leading to substantial storage resource consumption for these embeddings. Although a few methods attempt to mitigate this by employing embedding size search strategies to assign different embedding dimensions in streaming recommendations, they assume that the embedding size grows with the frequency of users/items, which eventually still exceeds the predefined memory budget over time. To address this issue, this paper proposes to learn Scalable Lightweight Embeddings for streaming recommendation, called SCALL, which can adaptively adjust the embedding sizes of users/items within a given memory budget over time. Specifically, we propose to sample embedding sizes from a probabilistic distribution, with the guarantee to meet any predefined memory budget. By fixing the memory budget, the proposed embedding size sampling strategy can increase and decrease the embedding sizes in accordance to the frequency of the corresponding users or items. Furthermore, we develop a reinforcement learning-based search paradigm that models each state with mean pooling to keep the length of the state vectors fixed, invariant to the changing number of users and items. As a result, the proposed method can provide embedding sizes to unseen users and items. Comprehensive empirical evaluations on two public datasets affirm the advantageous effectiveness of our proposed method.

[402] 2407.15414

Weights Shuffling for Improving DPSGD in Transformer-based Models

Differential Privacy (DP) mechanisms, especially in high-dimensional settings, often face the challenge of maintaining privacy without compromising the data utility. This work introduces an innovative shuffling mechanism in Differentially-Private Stochastic Gradient Descent (DPSGD) to enhance the utility of large models at the same privacy guarantee of the unshuffled case. Specifically, we reveal that random shuffling brings additional randomness to the trajectory of gradient descent while not impacting the model accuracy by the permutation invariance property -- the model can be equivalently computed in both forward and backward propagations under permutation. We show that permutation indeed improves the privacy guarantee of DPSGD in theory, but tracking the exact privacy loss on shuffled model is particularly challenging. Hence we exploit the approximation on sum of lognormal distributions to derive the condition for the shuffled DPSGD to meet the DP guarantee. Auditing results show that our condition offers a DP guarantee quite close to the audited privacy level, demonstrating our approach an effective estimation in practice. Experimental results have verified our theoretical derivation and illustrate that our mechanism improves the accuracy of DPSGD over the state-of-the-art baselines on a variety of models and tasks.

[403] 2407.15415

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in

[404] 2407.15417

Hemispheroidal parameterization and harmonic decomposition of simply connected open surfaces

Spectral analysis of open surfaces is gaining momentum for studying surface morphology in engineering, computer graphics, and medical domains. This analysis is enabled using proper parameterization approaches on the target analysis domain. In this paper, we propose the usage of customizable parameterization coordinates that allow mapping open surfaces into oblate or prolate hemispheroidal surfaces. For this, we proposed the usage of Tutte, conformal, area-preserving, and balanced mappings for parameterizing any given simply connected open surface onto an optimal hemispheroid. The hemispheroidal harmonic bases were introduced to spectrally expand these parametric surfaces by generalizing the known hemispherical ones. This approach uses the radius of the hemispheroid as a degree of freedom to control the size of the parameterization domain of the open surfaces while providing numerically stable basis functions. Several open surfaces have been tested using different mapping combinations. We also propose optimization-based mappings to serve various applications on the reconstruction problem. Altogether, our work provides an effective way to represent and analyze simply connected open surfaces.

[405] 2407.15420

Local All-Pair Correspondence for Point Tracking

We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.

[406] 2407.15421

Planning behavior in a recurrent neural network that plays Sokoban

To predict how advanced neural networks generalize to novel situations, it is essential to understand how they reason. Guez et al. (2019, "An investigation of model-free planning") trained a recurrent neural network (RNN) to play Sokoban with model-free reinforcement learning. They found that adding extra computation steps to the start of episodes at test time improves the RNN's success rate. We further investigate this phenomenon, finding that it rapidly emerges early on in training and then slowly fades, but only for comparatively easier levels. The RNN also often takes redundant actions at episode starts, and these are reduced by adding extra computation steps. Our results suggest that the RNN learns to take time to think by `pacing', despite the per-step penalties, indicating that training incentivizes planning capabilities. The small size (1.29M parameters) and interesting behavior of this model make it an excellent model organism for mechanistic interpretability.

[407] 2407.15424

Bidirectional skip-frame prediction for video anomaly detection with intra-domain disparity-driven attention

With the widespread deployment of video surveillance devices and the demand for intelligent system development, video anomaly detection (VAD) has become an important part of constructing intelligent surveillance systems. Expanding the discriminative boundary between normal and abnormal events to enhance performance is the common goal and challenge of VAD. To address this problem, we propose a Bidirectional Skip-frame Prediction (BiSP) network based on a dual-stream autoencoder, from the perspective of learning the intra-domain disparity between different features. The BiSP skips frames in the training phase to achieve the forward and backward frame prediction respectively, and in the testing phase, it utilizes bidirectional consecutive frames to co-predict the same intermediate frames, thus expanding the degree of disparity between normal and abnormal events. The BiSP designs the variance channel attention and context spatial attention from the perspectives of movement patterns and object scales, respectively, thus ensuring the maximization of the disparity between normal and abnormal in the feature extraction and delivery with different dimensions. Extensive experiments from four benchmark datasets demonstrate the effectiveness of the proposed BiSP, which substantially outperforms state-of-the-art competing methods.

[408] 2407.15425

Empirical Capacity Model for Self-Attention Neural Networks

Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large transformer models, which may have billions of parameters, in theory have a huge capacity to memorize content. However, the current algorithms for the optimization fall short of the theoretical capacity, and the capacity is also highly dependent on the content. In this paper, we focus on the memory capacity of these models obtained using common training algorithms and synthetic training data. Based on the results, we derive an empirical capacity model (ECM) for a generic transformer. The ECM can be used to design task-specific transformer models with an optimal number of parameters in cases where the target memorization capability of the task can be defined.

[409] 2407.15426

Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training

Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving training approaches such as federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models that demand significant resources. This presents a substantial challenge for FL clients operating with limited computational resources and communication bandwidth. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach, which decomposes the training process into multiple steps. Each step focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL scenarios and multimodal learning setups to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also introduce a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.

[410] 2407.15427

YOLO-pdd: A Novel Multi-scale PCB Defect Detection Method Using Deep Representations with Sequential Images

With the rapid growth of the PCB manufacturing industry, there is an increasing demand for computer vision inspection to detect defects during production. Improving the accuracy and generalization of PCB defect detection models remains a significant challenge. This paper proposes a high-precision, robust, and real-time end-to-end method for PCB defect detection based on deep Convolutional Neural Networks (CNN). Traditional methods often suffer from low accuracy and limited applicability. We propose a novel approach combining YOLOv5 and multiscale modules for hierarchical residual-like connections. In PCB defect detection, noise can confuse the background and small targets. The YOLOv5 model provides a strong foundation with its real-time processing and accurate object detection capabilities. The multi-scale module extends traditional approaches by incorporating hierarchical residual-like connections within a single block, enabling multiscale feature extraction. This plug-and-play module significantly enhances performance by extracting features at multiple scales and levels, which are useful for identifying defects of varying sizes and complexities. Our multi-scale architecture integrates feature extraction, defect localization, and classification into a unified network. Experiments on a large-scale PCB dataset demonstrate significant improvements in precision, recall, and F1-score compared to existing methods. This work advances computer vision inspection for PCB defect detection, providing a reliable solution for high-precision, robust, real-time, and domain-adaptive defect detection in the PCB manufacturing industry.

[411] 2407.15428

Decoding BACnet Packets: A Large Language Model Approach for Packet Interpretation

The Industrial Control System (ICS) environment encompasses a wide range of intricate communication protocols, posing substantial challenges for Security Operations Center (SOC) analysts tasked with monitoring, interpreting, and addressing network activities and security incidents. Conventional monitoring tools and techniques often struggle to provide a clear understanding of the nature and intent of ICS-specific communications. To enhance comprehension, we propose a software solution powered by a Large Language Model (LLM). This solution currently focused on BACnet protocol, processes a packet file data and extracts context by using a mapping database, and contemporary context retrieval methods for Retrieval Augmented Generation (RAG). The processed packet information, combined with the extracted context, serves as input to the LLM, which generates a concise packet file summary for the user. The software delivers a clear, coherent, and easily understandable summary of network activities, enabling SOC analysts to better assess the current state of the control system.

[412] 2407.15429

Learning at a Glance: Towards Interpretable Data-limited Continual Semantic Segmentation via Semantic-Invariance Modelling

Continual semantic segmentation (CSS) based on incremental learning (IL) is a great endeavour in developing human-like segmentation models. However, current CSS approaches encounter challenges in the trade-off between preserving old knowledge and learning new ones, where they still need large-scale annotated data for incremental training and lack interpretability. In this paper, we present Learning at a Glance (LAG), an efficient, robust, human-like and interpretable approach for CSS. Specifically, LAG is a simple and model-agnostic architecture, yet it achieves competitive CSS efficiency with limited incremental data. Inspired by human-like recognition patterns, we propose a semantic-invariance modelling approach via semantic features decoupling that simultaneously reconciles solid knowledge inheritance and new-term learning. Concretely, the proposed decoupling manner includes two ways, i.e., channel-wise decoupling and spatial-level neuron-relevant semantic consistency. Our approach preserves semantic-invariant knowledge as solid prototypes to alleviate catastrophic forgetting, while also constraining sample-specific contents through an asymmetric contrastive learning method to enhance model robustness during IL steps. Experimental results in multiple datasets validate the effectiveness of the proposed method. Furthermore, we introduce a novel CSS protocol that better reflects realistic data-limited CSS settings, and LAG achieves superior performance under multiple data-limited conditions.

[413] 2407.15431

Pre-Training and Prompting for Few-Shot Node Classification on Text-Attributed Graphs

The text-attributed graph (TAG) is one kind of important real-world graph-structured data with each node associated with raw texts. For TAGs, traditional few-shot node classification methods directly conduct training on the pre-processed node features and do not consider the raw texts. The performance is highly dependent on the choice of the feature pre-processing method. In this paper, we propose P2TAG, a framework designed for few-shot node classification on TAGs with graph pre-training and prompting. P2TAG first pre-trains the language model (LM) and graph neural network (GNN) on TAGs with self-supervised loss. To fully utilize the ability of language models, we adapt the masked language modeling objective for our framework. The pre-trained model is then used for the few-shot node classification with a mixed prompt method, which simultaneously considers both text and graph information. We conduct experiments on six real-world TAGs, including paper citation networks and product co-purchasing networks. Experimental results demonstrate that our proposed framework outperforms existing graph few-shot learning methods on these datasets with +18.98% ~ +35.98% improvements.

[414] 2407.15435

Enhancement of 3D Gaussian Splatting using Raw Mesh for Photorealistic Recreation of Architectures

The photorealistic reconstruction and rendering of architectural scenes have extensive applications in industries such as film, games, and transportation. It also plays an important role in urban planning, architectural design, and the city's promotion, especially in protecting historical and cultural relics. The 3D Gaussian Splatting, due to better performance over NeRF, has become a mainstream technology in 3D reconstruction. Its only input is a set of images but it relies heavily on geometric parameters computed by the SfM process. At the same time, there is an existing abundance of raw 3D models, that could inform the structural perception of certain buildings but cannot be applied. In this paper, we propose a straightforward method to harness these raw 3D models to guide 3D Gaussians in capturing the basic shape of the building and improve the visual quality of textures and details when photos are captured non-systematically. This exploration opens up new possibilities for improving the effectiveness of 3D reconstruction techniques in the field of architectural design.

[415] 2407.15439

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.

[416] 2407.15440

The Bicameral Cache: a split cache for vector architectures

The Bicameral Cache is a cache organization proposal for a vector architecture that segregates data according to their access type, distinguishing scalar from vector references. Its aim is to avoid both types of references from interfering in each other's data locality, with a special focus on prioritizing the performance on vector references. The proposed system incorporates an additional, non-polluting prefetching mechanism to help populate the long vector cache lines in advance to increase the hit rate by further exploiting the spatial locality on vector data. Its evaluation was conducted on the Cavatools simulator, comparing the performance to a standard conventional cache, over different typical vector benchmarks for several vector lengths. The results proved the proposed cache speeds up performance on stride-1 vector benchmarks, while hardly impacting non-stride-1's. In addition, the prefetching feature consistently provided an additional value.

[417] 2407.15441

Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned

Hallucination, a phenomenon where large language models (LLMs) produce output that is factually incorrect or unrelated to the input, is a major challenge for LLM applications that require accuracy and dependability. In this paper, we introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within LLMs. Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD), and an intricate decision tree-based process to reliably detect a wide range of hallucinations in LLM responses. Furthermore, our team has crafted a rewriting mechanism that maintains an optimal mix of precision, response time, and cost-effectiveness. We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics, which are crucial for real-world deployment of these technologies. Our extensive evaluation, utilizing offline data and live production traffic, confirms the efficacy of our proposed framework and service.

[418] 2407.15442

Preliminary approaches towards the integration of TSN communications into the NFV architectural framework

This paper presents a preliminary architecture for the integration of Time-Sensitive Networking (TSN) communications into the Network Functions Virtualization (NFV) architectural framework. Synergies between functional blocks and constructs of NFV, and components of TSN networks, are investigated in order to arrive at an integrated architecture. Additionally, mechanisms and configuration procedures to enable TSN-compliant, real-time, and virtualized end stations under the NFV framework are explored.

[419] 2407.15446

Text2Place: Affordance-aware Text Guided Human Placement

For a given scene, humans can easily reason for the locations and pose to place objects. Designing a computational model to reason about these affordances poses a significant challenge, mirroring the intuitive reasoning abilities of humans. This work tackles the problem of realistic human insertion in a given background scene termed as \textbf{Semantic Human Placement}. This task is extremely challenging given the diverse backgrounds, scale, and pose of the generated person and, finally, the identity preservation of the person. We divide the problem into the following two stages \textbf{i)} learning \textit{semantic masks} using text guidance for localizing regions in the image to place humans and \textbf{ii)} subject-conditioned inpainting to place a given subject adhering to the scene affordance within the \textit{semantic masks}. For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models and optimize a novel parameterization of the semantic mask, eliminating the need for large-scale training. To the best of our knowledge, we are the first ones to provide an effective solution for realistic human placements in diverse real-world scenes. The proposed method can generate highly realistic scene compositions while preserving the background and subject identity. Further, we present results for several downstream tasks - scene hallucination from a single or multiple generated persons and text-based attribute editing. With extensive comparisons against strong baselines, we show the superiority of our method in realistic human placement.

[420] 2407.15447

SIGMA:Sinkhorn-Guided Masked Video Modeling

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at:

[421] 2407.15451

Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the use of paired well-lit and low-light images with ground truths for training, which are impractical due to the inherent challenges associated with annotation on low-light images. To this end, we introduce a novel approach that eliminates the need for low-light ground truths. Our primary novelty lies in leveraging two complementary-teacher networks to generate more reliable pseudo labels, enabling our model achieves competitive performance on extremely low-light images without the need for training with low-light ground truths. Our framework consists of two stages. In the first stage, our model is trained on well-lit data with low-light augmentations. In the second stage, we propose a dual-teacher framework to utilize the unlabeled low-light data, where a center-based main teacher produces the pseudo labels for relatively visible cases, while a keypoints-based complementary teacher focuses on producing the pseudo labels for the missed persons of the main teacher. With the pseudo labels from both teachers, we propose a person-specific low-light augmentation to challenge a student model in training to outperform the teachers. Experimental results on real low-light dataset (ExLPose-OCN) show, our method achieves 6.8% (2.4 AP) improvement over the state-of-the-art (SOTA) method, despite no low-light ground-truth data is used in our approach, in contrast to the SOTA method. Our code will be available at:

[422] 2407.15452

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.

[423] 2407.15459

Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval

Recent studies have increasingly applied natural language processing (NLP) to automatically extract experimental research data from the extensive battery materials literature. Despite the complex process involved in battery manufacturing -- from material synthesis to cell assembly -- there has been no comprehensive study systematically organizing this information. In response, we propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for the automatic extraction of end-to-end battery recipes, validated using a case study on batteries containing LiFePO4 cathode material. We report machine learning-based paper filtering models, screening 2,174 relevant papers from the keyword-based search results, and unsupervised topic models to identify 2,876 paragraphs related to cathode synthesis and 2,958 paragraphs related to cell assembly. Then, focusing on the two topics, two deep learning-based named entity recognition models are developed to extract a total of 30 entities -- including precursors, active materials, and synthesis methods -- achieving F1 scores of 88.18% and 94.61%. The accurate extraction of entities enables the systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our protocol and results offer valuable insights into specific trends, such as associations between precursor materials and synthesis methods, or combinations between different precursor materials. We anticipate that our findings will serve as a foundational knowledge base for facilitating battery-recipe information retrieval. The proposed protocol will significantly accelerate the review of battery material literature and catalyze innovations in battery design and development.

[424] 2407.15462

Efficient Retrieval with Learned Similarities

Retrieval plays a fundamental role in recommendation systems, search, and natural language processing by efficiently finding relevant items from a large corpus given a query. Dot products have been widely used as the similarity function in such retrieval tasks, thanks to Maximum Inner Product Search (MIPS) that enabled efficient retrieval based on dot products. However, state-of-the-art retrieval algorithms have migrated to learned similarities. Such algorithms vary in form; the queries can be represented with multiple embeddings, complex neural networks can be deployed, the item ids can be decoded directly from queries using beam search, and multiple approaches can be combined in hybrid solutions. Unfortunately, we lack efficient solutions for retrieval in these state-of-the-art setups. Our work investigates techniques for approximate nearest neighbor search with learned similarity functions. We first prove that Mixture-of-Logits (MoL) is a universal approximator, and can express all learned similarity functions. We next propose techniques to retrieve the approximate top K results using MoL with a tight bound. We finally compare our techniques with existing approaches, showing that MoL sets new state-of-the-art results on recommendation retrieval tasks, and our approximate top-k retrieval with learned similarities outperforms baselines by up to two orders of magnitude in latency, while achieving > .99 recall rate of exact algorithms.

[425] 2407.15463

Integrated Access and Backhaul (IAB) in Low Altitude Platforms

In this paper, we explore the problem of utilizing Integrated Access and Backhaul (IAB) technology in Non-Terrestrial Networks (NTN), with a particular focus on aerial access networks. We consider an Uncrewed Aerial Vehicle (UAV)-based wireless network comprised of two layers of UAVs: (a) a lower layer consisting a number of flying users and a UAV Base Station (BS) that provides coverage for terrestrial users and, (b) an upper layer designated to provide both wireless access for flying users and backhaul connectivity for UAV BS. By adopting IAB technology, the backhaul and access links collaboratively share their resources, enabling aerial backhauling and the utilization of the same infrastructure and frequency resources for access links. A sum-rate maximization problem is formulated by considering aerial backhaul constraints to optimally allocate the frequency spectrum between aerial and terrestrial networks. We decompose the resulting non-convex optimization problem into two sub-problems of beamforming and spectrum allocation and then propose efficient solutions for each. Numerical results in different scenarios yield insightful findings about the effectiveness of using the IAB technique in aerial networks.

[426] 2407.15464

The Diversity Bonus: Learning from Dissimilar Distributed Clients in Personalized Federated Learning

Personalized Federated Learning (PFL) is a commonly used framework that allows clients to collaboratively train their personalized models. PFL is particularly useful for handling situations where data from different clients are not independent and identically distributed (non-IID). Previous research in PFL implicitly assumes that clients can gain more benefits from those with similar data distributions. Correspondingly, methods such as personalized weight aggregation are developed to assign higher weights to similar clients during training. We pose a question: can a client benefit from other clients with dissimilar data distributions and if so, how? This question is particularly relevant in scenarios with a high degree of non-IID, where clients have widely different data distributions, and learning from only similar clients will lose knowledge from many other clients. We note that when dealing with clients with similar data distributions, methods such as personalized weight aggregation tend to enforce their models to be close in the parameter space. It is reasonable to conjecture that a client can benefit from dissimilar clients if we allow their models to depart from each other. Based on this idea, we propose DiversiFed which allows each client to learn from clients with diversified data distribution in personalized federated learning. DiversiFed pushes personalized models of clients with dissimilar data distributions apart in the parameter space while pulling together those with similar distributions. In addition, to achieve the above effect without using prior knowledge of data distribution, we design a loss function that leverages the model similarity to determine the degree of attraction and repulsion between any two models. Experiments on several datasets show that DiversiFed can benefit from dissimilar clients and thus outperform the state-of-the-art methods.

[427] 2407.15472

Learning deep illumination-robust features from multispectral filter array images

Multispectral (MS) snapshot cameras equipped with a MS filter array (MSFA), capture multiple spectral bands in a single shot, resulting in a raw mosaic image where each pixel holds only one channel value. The fully-defined MS image is estimated from the raw one through $\textit{demosaicing}$, which inevitably introduces spatio-spectral artifacts. Moreover, training on fully-defined MS images can be computationally intensive, particularly with deep neural networks (DNNs), and may result in features lacking discrimination power due to suboptimal learning of spatio-spectral interactions. Furthermore, outdoor MS image acquisition occurs under varying lighting conditions, leading to illumination-dependent features. This paper presents an original approach to learn discriminant and illumination-robust features directly from raw images. It involves: $\textit{raw spectral constancy}$ to mitigate the impact of illumination, $\textit{MSFA-preserving}$ transformations suited for raw image augmentation to train DNNs on diverse raw textures, and $\textit{raw-mixing}$ to capture discriminant spatio-spectral interactions in raw images. Experiments on MS image classification show that our approach outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort.~The source code will be available.

[428] 2407.15475

A Multi-Level Corroborative Approach for Verification and Validation of Autonomous Robotic Swarms

Modelling and characterizing emergent behaviour within a swarm can pose significant challenges in terms of 'assurance'. Assurance tasks encompass adherence to standards, certification processes, and the execution of verification and validation (V&V) methods, such as model checking. In this study, we propose a holistic, multi-level modelling approach for formally verifying and validating autonomous robotic swarms, which are defined at the macroscopic formal modelling, low-fidelity simulation, high-fidelity simulation, and real-robot levels. Our formal macroscopic models, used for verification, are characterized by data derived from actual simulations, ensuring both accuracy and traceability across different system models. Furthermore, our work combines formal verification with experimental validation involving real robots. In this way, our corroborative approach for V&V seeks to enhance confidence in the evidence, in contrast to employing these methods separately. We explore our approach through a case study focused on a swarm of robots operating within a public cloakroom.

[429] 2407.15476

MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search

Traffic allocation is a process of redistributing natural traffic to products by adjusting their positions in the post-search phase, aimed at effectively fostering merchant growth, precisely meeting customer demands, and ensuring the maximization of interests across various parties within e-commerce platforms. Existing methods based on learning to rank neglect the long-term value of traffic allocation, whereas approaches of reinforcement learning suffer from balancing multiple objectives and the difficulties of cold starts within realworld data environments. To address the aforementioned issues, this paper propose a multi-objective deep reinforcement learning framework consisting of multi-objective Q-learning (MOQ), a decision fusion algorithm (DFM) based on the cross-entropy method(CEM), and a progressive data augmentation system(PDA). Specifically. MOQ constructs ensemble RL models, each dedicated to an objective, such as click-through rate, conversion rate, etc. These models individually determine the position of items as actions, aiming to estimate the long-term value of multiple objectives from an individual perspective. Then we employ DFM to dynamically adjust weights among objectives to maximize long-term value, addressing temporal dynamics in objective preferences in e-commerce scenarios. Initially, PDA trained MOQ with simulated data from offline logs. As experiments progressed, it strategically integrated real user interaction data, ultimately replacing the simulated dataset to alleviate distributional shifts and the cold start problem. Experimental results on real-world online e-commerce systems demonstrate the significant improvements of MODRL-TA, and we have successfully deployed MODRL-TA on an e-commerce search platform.

[430] 2407.15479

Affordance Labeling and Exploration: A Manifold-Based Approach

The advancement in computing power has significantly reduced the training times for deep learning, fostering the rapid development of networks designed for object recognition. However, the exploration of object utility, which is the affordance of the object, as opposed to object recognition, has received comparatively less attention. This work focuses on the problem of exploration of object affordances using existing networks trained on the object classification dataset. While pre-trained networks have proven to be instrumental in transfer learning for classification tasks, this work diverges from conventional object classification methods. Instead, it employs pre-trained networks to discern affordance labels without the need for specialized layers, abstaining from modifying the final layers through the addition of classification layers. To facilitate the determination of affordance labels without such modifications, two approaches, i.e. subspace clustering and manifold curvature methods are tested. These methods offer a distinct perspective on affordance label recognition. Especially, manifold curvature method has been successfully tested with nine distinct pre-trained networks, each achieving an accuracy exceeding 95%. Moreover, it is observed that manifold curvature and subspace clustering methods explore affordance labels that are not marked in the ground truth, but object affords in various cases.

[431] 2407.15481

Diverse Image Harmonization

Image harmonization aims to adjust the foreground illumination in a composite image to make it harmonious. The existing harmonization methods can only produce one deterministic result for a composite image, ignoring that a composite image could have multiple plausible harmonization results due to multiple plausible reflectances. In this work, we first propose a reflectance-guided harmonization network, which can achieve better performance with the guidance of ground-truth foreground reflectance. Then, we also design a diverse reflectance generation network to predict multiple plausible foreground reflectances, leading to multiple plausible harmonization results. The extensive experiments on the benchmark datasets demonstrate the effectiveness of our method.

[432] 2407.15483

Enhancing Wireless Networks with Attention Mechanisms: Insights from Mobile Crowdsensing

The increasing demand for sensing, collecting, transmitting, and processing vast amounts of data poses significant challenges for resource-constrained mobile users, thereby impacting the performance of wireless networks. In this regard, from a case of mobile crowdsensing (MCS), we aim at leveraging attention mechanisms in machine learning approaches to provide solutions for building an effective, timely, and secure MCS. Specifically, we first evaluate potential combinations of attention mechanisms and MCS by introducing their preliminaries. Then, we present several emerging scenarios about how to integrate attention into MCS, including task allocation, incentive design, terminal recruitment, privacy preservation, data collection, and data transmission. Subsequently, we propose an attention-based framework to solve network optimization problems with multiple performance indicators in large-scale MCS. The designed case study have evaluated the effectiveness of the proposed framework. Finally, we outline important research directions for advancing attention-enabled MCS.

[433] 2407.15484

6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an "a priori" pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.

[434] 2407.15487

In-Context Learning Improves Compositional Understanding of Vision-Language Models

Vision-Language Models (VLMs) have shown remarkable capabilities in a large number of downstream tasks. Nonetheless, compositional image understanding remains a rather difficult task due to the object bias present in training data. In this work, we investigate the reasons for such a lack of capability by performing an extensive bench-marking of compositional understanding in VLMs. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Furthermore, we leverage In-Context Learning (ICL) as a way to improve the ability of VLMs to perform more complex reasoning and understanding given an image. Our extensive experiments demonstrate that our proposed approach outperforms baseline models across multiple compositional understanding datasets.

[435] 2407.15488

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Diffusion models have made significant strides in text-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, including chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal "RGB+X" generation, called DiffX. We firstly construct the cross-modal image datasets with text descriptions using the LLaVA model for image captioning, supplemented by manual corrections. Notably, DiffX presents a simple yet effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space, facilitated by our Dual-Path Variational AutoEncoder (DP-VAE). Furthermore, we incorporate the gated cross-attention mechanism to connect the layout and text conditions, leveraging Long-CLIP for embedding long captions to enhance user guidance. Through extensive experiments, DiffX demonstrates robustness and flexibility in cross-modal generation across three RGB+X datasets: FLIR, MFNet, and COME15K, guided by various layout types. It also shows the potential for adaptive generation of "RGB+X+Y" or more diverse modalities. Our code and processed image captions are available at

[436] 2407.15489

Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing the best practices in pretraining has therefore become a major point of focus for much of NLP research -- especially since the insights developed for monolingual English models need not carry to more complex multilingual. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology. This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pre-training objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{}}.

[437] 2407.15492

Fast computation of 2-isogenies in dimension 4 and cryptographic applications

Dimension 4 isogenies have first been introduced in cryptography for the cryptanalysis of Supersingular Isogeny Diffie-Hellman (SIDH) and have been used constructively in several schemes, including SQIsignHD, a derivative of SQIsign isogeny based signature scheme. Unlike in dimensions 2 and 3, we can no longer rely on the Jacobian model and its derivatives to compute isogenies. In dimension 4 (and higher), we can only use theta-models. Previous works by Romain Cosset, David Lubicz and Damien Robert have focused on the computation of $\ell$-isogenies in theta-models of level $n$ coprime to $\ell$ (which requires to use $n^g$ coordinates in dimension $g$). For cryptographic applications, we need to compute chains of $2$-isogenies, requiring to use $\geq 3^g$ coordinates in dimension $g$ with state of the art algorithms. In this paper, we present algorithms to compute chains of $2$-isogenies between abelian varieties of dimension $g\geq 1$ with theta-coordinates of level $n=2$, generalizing a previous work by Pierrick Dartois, Luciano Maino, Giacomo Pope and Damien Robert in dimension $g=2$. We propose an implementation of these algorithms in dimension $g=4$ to compute endomorphisms of elliptic curve products derived from Kani's lemma with applications to SQIsignHD and SIDH cryptanalysis. We are now able to run a complete key recovery attack on SIDH when the endomorphism ring of the starting curve is unknown within a few seconds on a laptop for all NIST SIKE parameters.

[438] 2407.15498

Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction

Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) \textit{Random Replacement} with the guidance of confusion sets and (2) \textit{OCR/ASR-based Generation} that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).

[439] 2407.15500

TextureCrop: Enhancing Synthetic Image Detection through Texture-based Cropping

Generative AI technologies produce hyper-realistic imagery that can be used for nefarious purposes such as producing misleading or harmful content, among others. This makes Synthetic Image Detection (SID) an essential tool for defending against AI-generated harmful content. Current SID methods typically resize input images to a fixed resolution or perform center-cropping due to computational concerns, leading to challenges in effectively detecting artifacts in high-resolution images. To this end, we propose TextureCrop, a novel image pre-processing technique. By focusing on high-frequency image parts where generation artifacts are prevalent, TextureCrop effectively enhances SID accuracy while maintaining manageable memory requirements. Experimental results demonstrate a consistent improvement in AUC across various detectors by 5.7% compared to center cropping and by 14% compared to resizing, across high-resolution images from the Forensynths and Synthbuster datasets.

[440] 2407.15502

WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation

In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. WebRPG would contribute to a faster web development workflow. Since there is no existing benchmark available, we develop a new dataset for WebRPG through an automated pipeline. Moreover, we present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML. Extensive experiments, including customized quantitative evaluations for this specific task, are conducted to evaluate the quality of the generated results.

[441] 2407.15504

Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models

We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via the dual of the linear program. Using the distortion-rate function as the baseline, we study the performance of existing compression schemes on a synthetic dataset consisting of prompts generated from a Markov chain, natural language queries, and their respective answers. Our empirical analysis demonstrates the criticality of query-aware prompt compression, where the compressor has knowledge of the downstream task/query for the black-box LLM. We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy, and propose a query-aware, variable-rate adaptation of a prior work to close the gap. We extend our experiments to a small natural language dataset to further confirm our findings on our synthetic dataset.

[442] 2407.15507

SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time

Generating high-resolution images with generative models has recently been made widely accessible by leveraging diffusion models pre-trained on large-scale datasets. Various techniques, such as MultiDiffusion and SyncDiffusion, have further pushed image generation beyond training resolutions, i.e., from square images to panorama, by merging multiple overlapping diffusion paths or employing gradient descent to maintain perceptual coherence. However, these methods suffer from significant computational inefficiencies due to generating and averaging numerous predictions, which is required in practice to produce high-quality and seamless images. This work addresses this limitation and presents a novel approach that eliminates the need to generate and average numerous overlapping denoising predictions. Our method shifts non-overlapping denoising windows over time, ensuring that seams in one timestep are corrected in the next. This results in coherent, high-resolution images with fewer overall steps. We demonstrate the effectiveness of our approach through qualitative and quantitative evaluations, comparing it with MultiDiffusion, SyncDiffusion, and StitchDiffusion. Our method offers several key benefits, including improved computational efficiency and faster inference times while producing comparable or better image quality.

[443] 2407.15508

Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

Large Language Models (LLMs) showcase remarkable performance and robust deductive capabilities, yet their expansive size complicates deployment and raises environmental concerns due to substantial resource consumption. The recent development of a quantization technique known as Learnable Singular-value Increment (LSI) has addressed some of these quantization challenges. Leveraging insights from LSI and our extensive research, we have developed innovative methods that enhance the performance of quantized LLMs, particularly in low-bit settings. Our methods consistently deliver state-of-the-art results across various quantization scenarios and offer deep theoretical insights into the quantization process, elucidating the potential of quantized models for widespread application.

[444] 2407.15510

Algebraic anti-unification

Abstraction is key to human and artificial intelligence as it allows one to see common structure in otherwise distinct objects or situations and as such it is a key element for generality in AI. Anti-unification (or generalization) is \textit{the} part of theoretical computer science and AI studying abstraction. It has been successfully applied to various AI-related problems, most importantly inductive logic programming. Up to this date, anti-unification is studied only from a syntactic perspective in the literature. The purpose of this paper is to initiate an algebraic (i.e. semantic) theory of anti-unification within general algebras. This is motivated by recent applications to similarity and analogical proportions.

[445] 2407.15511

Inconsistencies in TeX-Produced Documents

TeX is a widely-used typesetting system adopted by most publishers and professional societies. While TeX is responsible for generating a significant number of documents, irregularities in the TeX ecosystem may produce inconsistent documents. These inconsistencies may occur across different TeX engines or different versions of TeX distributions, resulting in failures to adhere to formatting specifications, or the same document rendering differently for different authors. In this work, we investigate and quantify the robustness of the TeX ecosystem through a large-scale study of 432 documents. We developed an automated pipeline to evaluate the cross-engine and cross-version compatibility of the TeX ecosystem. We found significant inconsistencies in the outputs of different TeX engines: only 0.2% of documents compiled to identical output with XeTeX and PDFTeX due to a lack of cross-engine support in popular LaTeX packages and classes used in academic conferences. A smaller$\unicode{x2014}$yet significant$\unicode{x2014}$extent of inconsistencies was found across different TeX Live distributions, with only 42.1% of documents producing the same output from 2020 to 2023. Our automated pipeline additionally reduces the human effort in bug-finding: from a sample of 10 unique root causes of inconsistencies, we identified two new bugs in LaTeX packages and five existing bugs that were fixed independently of this study. We also observed potentially unintended inconsistencies across different TeX Live distributions beyond the updates listed in changelogs. We expect that this study will help authors of TeX documents to avoid unexpected outcomes by understanding how they may be affected by the often undocumented subtleties of the TeX ecosystem, while benefiting developers by demonstrating how different implementations result in unintended inconsistencies.

[446] 2407.15512

Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation

Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.

[447] 2407.15514

Twin-Width Meets Feedback Edges and Vertex Integrity

The approximate computation of twin-width has attracted significant attention already since the moment the parameter was introduced. A recently proposed approach (STACS 2024) towards obtaining a better understanding of this question is to consider the approximability of twin-width via fixed-parameter algorithms whose running time depends not on twin-width itself, but rather on parameters which impose stronger restrictions on the input graph. The first step that article made in this direction is to establish the fixed-parameter approximability of twin-width (with an additive error of 1) when the runtime parameter is the feedback edge number. Here, we make several new steps in this research direction and obtain: - An asymptotically tight bound between twin-width and the feedback edge number; - A significantly improved fixed-parameter approximation algorithm for twin-width under the same runtime parameter (i.e., the feedback edge number) which circumvents many of the technicalities of the original result and simultaneously avoids its formerly non-elementary runtime dependency; - An entirely new fixed-parameter approximation algorithm for twin-width when the runtime parameter is the vertex integrity of the graph.

[448] 2407.15516

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.

[449] 2407.15518

An Efficient Regularity Lemma for Semi-Algebraic Hypergraphs

We use the polynomial method of Guth and Katz to establish stronger and {\it more efficient} regularity and density theorems for such $k$-uniform hypergraphs $H=(P,E)$, where $P$ is a finite point set in ${\mathbb R}^d$, and the edge set $E$ is determined by a semi-algebraic relation of bounded description complexity. In particular, for any $0<\epsilon\leq 1$ we show that one can construct in $O\left(n\log (1/\epsilon)\right)$ time, an equitable partition $P=U_1\uplus \ldots\uplus U_K$ into $K=O(1/\epsilon^{d+1+\delta})$ subsets, for any $0<\delta$, so that all but $\epsilon$-fraction of the $k$-tuples $U_{i_1},\ldots,U_{i_k}$ are {\it homogeneous}: we have that either $U_{i_1}\times\ldots\times U_{i_k}\subseteq E$ or $(U_{i_1}\times\ldots\times U_{i_k})\cap E=\emptyset$. If the points of $P$ can be perturbed in a general position, the bound improves to $O(1/\epsilon^{d+1})$, and the partition is attained via a {\it single partitioning polynomial} (albeit, at expense of a possible increase in worst-case running time). In contrast to the previous such regularity lemmas which were established by Fox, Gromov, Lafforgue, Naor, and Pach and, subsequently, Fox, Pach and Suk, our partition of $P$ does not depend on the edge set $E$ provided its semi-algebraic description complexity does not exceed a certain constant. As a by-product, we show that in any $k$-partite $k$-uniform hypergraph $(P_1\uplus\ldots\uplus P_k,E)$ of bounded semi-algebraic description complexity in ${\mathbb R}^d$ and with $|E|\geq \epsilon \prod_{i=1}^k|P_i|$ edges, one can find, in expected time $O\left(\sum_{i=1}^k|P_i|\log (1/\epsilon)+(1/\epsilon)\log(1/\epsilon)\right)$, subsets $Q_i\subseteq P_i$ of cardinality $|Q_i|\geq |P_i|/\epsilon^{d+1+\delta}$, so that $Q_1\times\ldots\times Q_k\subseteq E$.

[450] 2407.15519

On the Automated Processing of User Feedback

User feedback is becoming an increasingly important source of information for requirements engineering, user interface design, and software engineering in general. Nowadays, user feedback is largely available and easily accessible in social media, product forums, or app stores. Over the last decade, research has shown that user feedback can help software teams: a) better understand how users are actually using specific product features and components, b) faster identify, reproduce, and fix defects, and b) get inspirations for improvements or new features. However, to tap the full potential of feedback, there are two main challenges that need to be solved. First, software vendors must cope with a large quantity of feedback data, which is hard to manage manually. Second, vendors must also cope with a varying quality of feedback as some items might be uninformative, repetitive, or simply wrong. This chapter summarises and pipelines various data mining, machine learning, and natural language processing techniques, including recent Large Language Models, to cope with the quantity and quality challenges. We guide researchers and practitioners through implementing effective, actionable analysis of user feedback for software and requirements engineering.

[451] 2407.15520

Future-Proofing Mobile Networks: A Digital Twin Approach to Multi-Signal Management

Digital Twins (DTs) are set to become a key enabling technology in future wireless networks, with their use in network management increasing significantly. We developed a DT framework that leverages the heterogeneity of network access technologies as a resource for enhanced network performance and management, enabling smart data handling in the physical network. Tested in a \textit{Campus Area Network} environment, our framework integrates diverse data sources to provide real-time, holistic insights into network performance and environmental sensing. We also envision that traditional analytics will evolve to rely on emerging AI models, such as Generative AI (GenAI), while leveraging current analytics capabilities. This capacity can simplify analytics processes through advanced ML models, enabling descriptive, diagnostic, predictive, and prescriptive analytics in a unified fashion. Finally, we present specific research opportunities concerning interoperability aspects and envision aligning advancements in DT technology with evolved AI integration.

[452] 2407.15523

TOM: A Development Platform For Wearable Intelligent Assistants

Advanced digital assistants can significantly enhance task performance, reduce user burden, and provide personalized guidance to improve users' abilities. However, the development of such intelligent digital assistants presents a formidable challenge. To address this, we introduce TOM, a conceptual architecture and software platform ( designed to support the development of intelligent wearable assistants that are contextually aware of both the user and the environment. This system was developed collaboratively with AR/MR researchers, HCI researchers, AI/Robotic researchers, and software developers, and it continues to evolve to meet the diverse requirements of these stakeholders. TOM facilitates the creation of intelligent assistive AR applications for daily activities and supports the recording and analysis of user interactions, integration of new devices, and the provision of assistance for various activities. Additionally, we showcase several proof-of-concept assistive services and discuss the challenges involved in developing such services.

[453] 2407.15524

Towards Efficient Transferable Preemptive Adversarial Defense

Deep learning technology has brought convenience and advanced developments but has become untrustworthy because of its sensitivity to inconspicuous perturbations (i.e., adversarial attacks). Attackers utilize this sensitivity to slightly manipulate transmitted messages. To defend against such attacks, we have devised a strategy for "attacking" the message before it is attacked. This strategy, dubbed Fast Preemption, provides an efficient transferable preemptive defense by using different models for labeling inputs and learning crucial features. A forward-backward cascade learning algorithm is used to compute protective perturbations, starting with forward propagation optimization to achieve rapid convergence, followed by iterative backward propagation learning to alleviate overfitting. This strategy offers state-of-the-art transferability and protection across various systems. With the running of only three steps, our Fast Preemption framework outperforms benchmark training-time, test-time, and preemptive adversarial defenses. We have also devised the first to our knowledge effective white-box adaptive reversion attack and demonstrate that the protection added by our defense strategy is irreversible unless the backbone model, algorithm, and settings are fully compromised. This work provides a new direction to developing active defenses against adversarial attacks.

[454] 2407.15525

Multiple importance sampling for stochastic gradient estimation

We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.

[455] 2407.15526

Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.

[456] 2407.15527

Interpretable Concept-Based Memory Reasoning

The lack of transparency in the decision-making processes of deep learning systems presents a significant challenge in modern artificial intelligence (AI), as it impairs users' ability to rely on and verify these systems. To address this challenge, Concept Bottleneck Models (CBMs) have made significant progress by incorporating human-interpretable concepts into deep learning architectures. This approach allows predictions to be traced back to specific concept patterns that users can understand and potentially intervene on. However, existing CBMs' task predictors are not fully interpretable, preventing a thorough analysis and any form of formal verification of their decision-making process prior to deployment, thereby raising significant reliability concerns. To bridge this gap, we introduce Concept-based Memory Reasoner (CMR), a novel CBM designed to provide a human-understandable and provably-verifiable task prediction process. Our approach is to model each task prediction as a neural selection mechanism over a memory of learnable logic rules, followed by a symbolic evaluation of the selected rule. The presence of an explicit memory and the symbolic evaluation allow domain experts to inspect and formally verify the validity of certain global properties of interest for the task prediction process. Experimental results demonstrate that CMR achieves comparable accuracy-interpretability trade-offs to state-of-the-art CBMs, discovers logic rules consistent with ground truths, allows for rule interventions, and allows pre-deployment verification.

[457] 2407.15531

Double Deep Learning-based Event Data Coding and Classification

Event cameras have the ability to capture asynchronous per-pixel brightness changes, called "events", offering advantages over traditional frame-based cameras for computer vision applications. Efficiently coding event data is critical for transmission and storage, given the significant volume of events. This paper proposes a novel double deep learning-based architecture for both event data coding and classification, using a point cloud-based representation for events. In this context, the conversions from events to point clouds and back to events are key steps in the proposed solution, and therefore its impact is evaluated in terms of compression and classification performance. Experimental results show that it is possible to achieve a classification performance of compressed events which is similar to one of the original events, even after applying a lossy point cloud codec, notably the recent learning-based JPEG Pleno Point Cloud Coding standard, with a clear rate reduction. Experimental results also demonstrate that events coded using JPEG PCC achieve better classification performance than those coded using the conventional lossy MPEG Geometry-based Point Cloud Coding standard. Furthermore, the adoption of learning-based coding offers high potential for performing computer vision tasks in the compressed domain, which allows skipping the decoding stage while mitigating the impact of coding artifacts.

[458] 2407.15537

Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with a convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.

[459] 2407.15540

Differentiable Product Quantization for Memory Efficient Camera Relocalization

Camera relocalization relies on 3D models of the scene with a large memory footprint that is incompatible with the memory budget of several applications. One solution to reduce the scene memory size is map compression by removing certain 3D points and descriptor quantization. This achieves high compression but leads to performance drop due to information loss. To address the memory performance trade-off, we train a light-weight scene-specific auto-encoder network that performs descriptor quantization-dequantization in an end-to-end differentiable manner updating both product quantization centroids and network parameters through back-propagation. In addition to optimizing the network for descriptor reconstruction, we encourage it to preserve the descriptor-matching performance with margin-based metric loss functions. Results show that for a local descriptor memory of only 1MB, the synergistic combination of the proposed network and map compression achieves the best performance on the Aachen Day-Night compared to existing compression methods.

[460] 2407.15545

Inverted Activations

The scaling of neural networks with increasing data and model sizes necessitates more efficient deep learning algorithms. This paper addresses the memory footprint challenge in neural network training by proposing a modification to the handling of activation tensors in pointwise nonlinearity layers. Traditionally, these layers save the entire input tensor for the backward pass, leading to substantial memory use. Our method involves saving the output tensor instead, reducing the memory required when the subsequent layer also saves its input tensor. This approach is particularly beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. Application of our method involves taken an inverse function of nonlinearity. To the best of our knowledge, that can not be done analitically and instead we buid an accurate approximations using simpler functions. Experimental results confirm that our method significantly reduces memory usage without affecting training accuracy. The implementation is available at

[461] 2407.15546

Personalization of Dataset Retrieval Results using a Metadata-based Data Valuation Method

In this paper, we propose a novel data valuation method for a Dataset Retrieval (DR) use case in Ireland's National mapping agency. To the best of our knowledge, data valuation has not yet been applied to Dataset Retrieval. By leveraging metadata and a user's preferences, we estimate the personal value of each dataset to facilitate dataset retrieval and filtering. We then validated the data value-based ranking against the stakeholders' ranking of the datasets. The proposed data valuation method and use case demonstrated that data valuation is promising for dataset retrieval. For instance, the outperforming dataset retrieval based on our approach obtained 0.8207 in terms of NDCG@5 (the truncated Normalized Discounted Cumulative Gain at 5). This study is unique in its exploration of a data valuation-based approach to dataset retrieval and stands out because, unlike most existing methods, our approach is validated using the stakeholders ranking of the datasets.

[462] 2407.15549

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of `jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

[463] 2407.15551

MoXIchecker: An Extensible Model Checker for MoXI

MoXI is a new intermediate verification language introduced in 2024 to promote the standardization and open-source implementations for symbolic model checking by extending the SMT-LIB 2 language with constructs to define state-transition systems. The tool suite of MoXI provides a translator from MoXI to Btor2, which is a lower-level intermediate language for hardware verification, and a translation-based model checker, which invokes mature hardware model checkers for Btor2 to analyze the translated verification tasks. The extensibility of such a translation-based model checker is restricted because more complex theories, such as integer or real arithmetics, cannot be precisely expressed with bit-vectors of fixed lengths in Btor2. We present MoXIchecker, the first model checker that solves MoXI verification tasks directly. Instead of translating MoXI to lower-level languages, MoXIchecker uses the solver-agnostic library PySMT for SMT solvers as backend for its verification algorithms. MoXIchecker is extensible because it accommodates verification tasks involving more complex theories, not limited by lower-level languages, facilitates the implementation of new algorithms, and is solver-agnostic by using the API of PySMT. In our evaluation, MoXIchecker uniquely solved tasks that use integer or real arithmetics, and achieved a comparable performance against the translation-based model checker from the MoXI tool suite.

[464] 2407.15554

Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping

Learning efficient representations of local features is a key challenge in feature volume-based 3D neural mapping, especially in large-scale environments. In this paper, we introduce Decomposition-based Neural Mapping (DNMap), a storage-efficient large-scale 3D mapping method that employs a discrete representation based on a decomposition strategy. This decomposition strategy aims to efficiently capture repetitive and representative patterns of shapes by decomposing each discrete embedding into component vectors that are shared across the embedding space. Our DNMap optimizes a set of component vectors, rather than entire discrete embeddings, and learns composition rather than indexing the discrete embeddings. Furthermore, to complement the mapping quality, we additionally learn low-resolution continuous embeddings that require tiny storage space. By combining these representations with a shallow neural network and an efficient octree-based feature volume, our DNMap successfully approximates signed distance functions and compresses the feature volume while preserving mapping quality. Our source code is available at

[465] 2407.15556

SETTP: Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning

Text style transfer, an important research direction in natural language processing, aims to adapt the text to various preferences but often faces challenges with limited resources. In this work, we introduce a novel method termed Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning (SETTP) for effective style transfer in low-resource scenarios. First, SETTP learns source style-level prompts containing fundamental style characteristics from high-resource style transfer. During training, the source style-level prompts are transferred through an attention module to derive a target style-level prompt for beneficial knowledge provision in low-resource style transfer. Additionally, we propose instance-level prompts obtained by clustering the target resources based on the semantic content to reduce semantic bias. We also propose an automated evaluation approach of style similarity based on alignment with human evaluations using ChatGPT-4. Our experiments across three resourceful styles show that SETTP requires only 1/20th of the data volume to achieve performance comparable to state-of-the-art methods. In tasks involving scarce data like writing style and role style, SETTP outperforms previous methods by 16.24\%.

[466] 2407.15557

Hierarchical Alternating Least Squares Methods for Quaternion Nonnegative Matrix Factorizations

In this report, we discuss a simple model for RGB color and polarization images under a unified framework of quaternion nonnegative matrix factorization (QNMF) and present a hierarchical nonnegative least squares method to solve the factor matrices. The convergence analysis of the algorithm is discussed as well. We test the proposed method in the polarization image and color facial image representation. Compared to the state-of-the-art methods, the experimental results demonstrate the effectiveness of the hierarchical nonnegative least squares method for the QNMF model.

[467] 2407.15562

Towards an Engineering Discipline for Resilient Cyber-Physical Systems

Resilient cyber-physical systems comprise computing systems able to continuously interact with the physical environment in which they operate, despite runtime errors. The term resilience refers to the ability to cope with unexpected inputs while delivering correct service. Examples of resilient computing systems are Google's PageRank and the Bubblesort algorithm. Engineering for resilient cyber-physical systems requires a paradigm shift, prioritizing adaptability to dynamic environments. Software as a tool for self-management is a key instrument for dealing with uncertainty and embedding resilience in these systems. Yet, software engineers encounter the ongoing challenge of ensuring resilience despite environmental dynamic change. My thesis aims to pioneer an engineering discipline for resilient cyber-physical systems. Over four years, we conducted studies, built methods and tools, delivered software packages, and a website offering guidance to practitioners. This paper provides a condensed overview of the problems tackled, our methodology, key contributions, and results highlights. Seeking feedback from the community, this paper serves both as preparation for the thesis defense and as insight into future research prospects.

[468] 2407.15566

Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval

The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.

[469] 2407.15567

A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

In federated learning (FL), data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence rate. In particular, for many FL algorithms, the convergence rate grows dramatically when the number of local updates becomes large, especially when the product of the gradient divergence and local Lipschitz constant is large. However, empirical studies can show that more local updates can improve the convergence rate even when these two parameters are large, which is inconsistent with the theoretical findings. This paper aims to bridge this gap between theoretical understanding and practical performance by providing a theoretical analysis from a new perspective on data heterogeneity. In particular, we propose a new and weaker assumption compared to the local Lipschitz gradient assumption, named the heterogeneity-driven pseudo-Lipschitz assumption. We show that this and the gradient divergence assumptions can jointly characterize the effect of data heterogeneity. By deriving a convergence upper bound for FedAvg and its extensions, we show that, compared to the existing works, local Lipschitz constant is replaced by the much smaller heterogeneity-driven pseudo-Lipschitz constant and the corresponding convergence upper bound can be significantly reduced for the same number of local updates, although its order stays the same. In addition, when the local objective function is quadratic, more insights on the impact of data heterogeneity can be obtained using the heterogeneity-driven pseudo-Lipschitz constant. For example, we can identify a region where FedAvg can outperform mini-batch SGD even when the gradient divergence can be arbitrarily large. Our findings are validated using experiments.

[470] 2407.15568

Empowering Agile-Based Generative Software Development through Human-AI Teamwork

In software development, the raw requirements proposed by users are frequently incomplete, which impedes the complete implementation of application functionalities. With the emergence of large language models, recent methods with the top-down waterfall model employ a questioning approach for requirement completion, attempting to explore further user requirements. However, users, constrained by their domain knowledge, lack effective acceptance criteria, which fail to capture the implicit needs of the user. Moreover, the cumulative errors of the waterfall model can lead to discrepancies between the generated code and user requirements. The Agile methodologies reduce cumulative errors through lightweight iteration and collaboration with users, but the challenge lies in ensuring semantic consistency between user requirements and the code generated. We propose AgileGen, an agile-based generative software development through human-AI teamwork. AgileGen attempts for the first time to use testable requirements by Gherkin for semantic consistency between requirements and code. Additionally, we innovate in human-AI teamwork, allowing users to participate in decision-making processes they do well and enhancing the completeness of application functionality. Finally, to improve the reliability of user scenarios, a memory pool mechanism is used to collect user decision-making scenarios and recommend them to new users. AgileGen, as a user-friendly interactive system, significantly outperformed existing best methods by 16.4% and garnered higher user satisfaction.

[471] 2407.15569

An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought

Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.

[472] 2407.15570

Hybrid STAR-RIS Enabled Integrated Sensing and Communication

Integrated sensing and communication (ISAC) is recognized as one of the key enabling technologies for sixth-generation (6G) wireless communication networks, facilitating diverse emerging applications and services in an energy and cost-efficient manner. This paper proposes a multi-user multi-target ISAC system to enable full-space coverage for communication and sensing tasks. The proposed system employs a hybrid simultaneous transmission and reflection reconfigurable intelligent surface (STAR-RIS) comprising active transmissive and passive reflective elements. In the proposed scheme, the passive reflective elements support communication and sensing links for nearby communication users and sensing targets, while low-power active transmissive elements are deployed to improve sensing performance and overcome high path attenuation due to multi-hop transmission for remote targets. Moreover, to optimize the transmissive/reflective coefficients of the hybrid STAR-RIS, a semi-definite relaxation (SDR)-based algorithm is proposed. Furthermore, to evaluate sensing performance, signal-to-interference-noise ratio (SINR) and Cramer-Rao bound (CRB) metrics have been derived and investigated via conducting extensive computer simulations.

[473] 2407.15580

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

[474] 2407.15581

vLSM: Low tail latency and I/O amplification in LSM-based KV stores

LSM-based key-value (KV) stores are an important component in modern data infrastructures. However, they suffer from high tail latency, in the order of several seconds, making them less attractive for user-facing applications. In this paper, we introduce the notion of compaction chains and we analyse how they affect tail latency. Then, we show that modern designs reduce tail latency, by trading I/O amplification or require large amounts of memory. Based on our analysis, we present vLSM, a new KV store design that improves tail latency significantly without compromising on memory or I/O amplification. vLSM reduces (a) compaction chain width by using small SSTs and eliminating the tiering compaction required in L0 by modern systems and (b) compaction chain length by using a larger than typical growth factor between L1 and L2 and introducing overlap-aware SSTs in L1. We implement vLSM in RocksDB and evaluate it using db_bench and YCSB. Our evaluation highlights the underlying trade-off among memory requirements, I/O amplification, and tail latency, as well as the advantage of vLSM over current approaches. vLSM improves P99 tail latency by up to 4.8x for writes and by up to 12.5x for reads, reduces cumulative write stalls by up to 60% while also slightly improves I/O amplification at the same memory budget.

[475] 2407.15588

Unsupervised Robust Cross-Lingual Entity Alignment via Joint Modeling of Entity and Relation Texts

Cross-lingual entity alignment (EA) enables the integration of multiple knowledge graphs (KGs) across different languages, providing users with seamless access to diverse and comprehensive knowledge.Existing methods, mostly supervised, face challenges in obtaining labeled entity pairs. To address this, recent studies have shifted towards a self-supervised and unsupervised frameworks. Despite their effectiveness, these approaches have limitations: (1) they mainly focus on entity features, neglecting the semantic information of relations, (2) they assume isomorphism between source and target graphs, leading to noise and reduced alignment accuracy, and (3) they are susceptible to noise in the textual features, especially when encountering inconsistent translations or Out-Of-Vocabulary (OOV) problems. In this paper, we propose ERAlign, an unsupervised and robust cross-lingual EA framework that jointly performs Entity-level and Relation-level Alignment using semantic textual features of relations and entities. Its refinement process iteratively enhances results by fusing entity-level and relation-level alignments based on neighbor triple matching. The additional verification process examines the entities' neighbor triples as the linearized text. This \textit{Align-and-Verify} pipeline that rigorously assesses alignment results, achieving near-perfect alignment even in the presence of noisy textual features of entities. Our extensive experiments demonstrate that robustness and general applicability of \proposed improved the accuracy and effectiveness of EA tasks, contributing significantly to knowledge-oriented applications.

[476] 2407.15589

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Object-centric (OC) representations, which represent the state of a visual scene by modeling it as a composition of objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have not been thoroughly analyzed yet. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains from language to computer vision, marking them as a potential cornerstone of future research for a multitude of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, and demonstrate a viable way to achieve the best of both worlds. The extensiveness of our study, encompassing over 800 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.

[477] 2407.15590

All rivers run into the sea: Unified Modality Brain-like Emotional Central Mechanism

In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses inherent prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while inherent prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the SSF module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and inherent prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition (DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of Modality Missingness and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information.

[478] 2407.15591

FAIR evaluation of ten widely used chemical datasets: Lessons learned and recommendations

This document focuses on databases disseminating data on (hazardous) substances found on the North American and the European (EU) market. The goal is to analyse the FAIRness (Findability, Accessibility, Interoperability and Reusability) of published open data on these substances and to qualitatively evaluate to what extend the selected databases already fulfil the criteria set out in the commission draft regulation on a common data chemicals platform. We implemented two complementary approaches: Manual, and Automatic. The manual approach is based on online questionnaires. These questionnaires provide a structured approach to evaluating FAIRness by guiding users through a series of questions related to the FAIR principles. They are particularly useful for initiating discussions on FAIR implementation within research teams and for identifying areas that require further attention. Automated tools for FAIRness assessment, such as F-UJI and FAIR Checker, are gaining prominence and are continuously under development. Unlike manual tools, automated tools perform a series of tests automatically starting from a dereferenceable URL to the data resource to be evaluated. We analysed ten widely adopted datasets managed in Europe and North America. The highest score from automatic analysis was 54/100. The manual analysis shows that several FAIR metrics were satisfied, but not detectable by automatic tools because there is no metadata, or the format of the information was not a standard one. Thus, it was not interpretable by the tool. We present the details of the analysis and tables summarizing the outcomes, the issues, and the suggestions to address these issues.

[479] 2407.15593

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Accurate localization in diverse environments is a fundamental challenge in computer vision and robotics. The task involves determining a sensor's precise position and orientation, typically a camera, within a given space. Traditional localization methods often rely on passive sensing, which may struggle in scenarios with limited features or dynamic environments. In response, this paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy. Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications. Our results demonstrate that our method performs better than the existing one, targeting similar problems and generalizing on synthetic and real data. We also release an open-source implementation to benefit the community.

[480] 2407.15595

Discrete Flow Matching

Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this work, we present Discrete Flow Matching, a novel discrete flow paradigm designed specifically for generating discrete data. Discrete Flow Matching offers several key contributions: (i) it works with a general family of probability paths interpolating between source and target distributions; (ii) it allows for a generic formula for sampling from these probability paths using learned posteriors such as the probability denoiser ($x$-prediction) and noise-prediction ($\epsilon$-prediction); (iii) practically, focusing on specific probability paths defined with different schedulers considerably improves generative perplexity compared to previous discrete diffusion and flow models; and (iv) by scaling Discrete Flow Matching models up to 1.7B parameters, we reach 6.7% Pass@1 and 13.4% Pass@10 on HumanEval and 6.7% Pass@1 and 20.6% Pass@10 on 1-shot MBPP coding benchmarks. Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion, significantly closing the gap between autoregressive models and discrete flow models.

[481] 2407.15596

Towards a Universal Evaluation Model for Careful and Competent Autonomous Driving

Virtual scenario-based testing methods to validate autonomous driving systems are predominantly centred around collision avoidance, and lack a comprehensive approach to evaluate optimal driving behaviour holistically. Furthermore, current validation approaches do not align with authorisation and monitoring requirements put forth by regulatory bodies. We address these validation gaps by outlining a universal evaluation framework that: incorporates the notion of careful and competent driving, unifies behavioural competencies and evaluation criteria, and is amenable at a scenario-specific and aggregate behaviour level. This framework can be leveraged to evaluate optimal driving in scenario-based testing, and for post-deployment monitoring to ensure continual compliance with regulation and safety standards.

[482] 2407.15599

Online String Attractors

In today's data-centric world, fast and effective compression of data is paramount. To measure success towards the second goal, Kempa and Prezza [STOC2018] introduce the string attractor, a combinatorial object unifying dictionary-based compression. Given a string $T \in \Sigma^n$, a string attractor ($k$-attractor) is a set of positions $\Gamma\subseteq [1,n]$, such that every distinct substring (of length at most $k$) has at least one occurrence that contains one of the selected positions. String attractors are shown to be approximated by and thus measure the quality of many important dictionary compression algorithms such as Lempel-Ziv 77, the Burrows-Wheeler transform, straight line programs, and macro schemes. In order to handle massive amounts of data, compression often has to be achieved in a streaming fashion. Thus, practically applied compression algorithms, such as Lempel-Ziv 77, have been extensively studied in an online setting. To the best of our knowledge, there has been no such work, and therefore are no theoretical underpinnings, for the string attractor problem. We introduce a natural online variant of both the $k$-attractor and the string attractor problem. First, we show that the Lempel-Ziv factorization corresponds to the best online algorithm for this problem, resulting in an upper bound of $\mathcal{O}(\log(n))$ on the competitive ratio. On the other hand, there are families of sparse strings which have constant-size optimal attractors, e.g., prefixes of the infinite Sturmian words and Thue-Morse words, which are created by iterative application of a morphism. We consider the most famous of these Sturmian words, the Fibonacci word, and show that any online algorithm has a cost growing with the length of the word, for a matching lower bound of $\Omega(\log(n))$. For the online $k$-attractor problem, we show tight (strict) $k$-competitiveness.

[483] 2407.15600

A Pairwise Comparison Relation-assisted Multi-objective Evolutionary Neural Architecture Search Method with Multi-population Mechanism

Neural architecture search (NAS) enables re-searchers to automatically explore vast search spaces and find efficient neural networks. But NAS suffers from a key bottleneck, i.e., numerous architectures need to be evaluated during the search process, which requires a lot of computing resources and time. In order to improve the efficiency of NAS, a series of methods have been proposed to reduce the evaluation time of neural architectures. However, they are not efficient enough and still only focus on the accuracy of architectures. In addition to the classification accuracy, more efficient and smaller network architectures are required in real-world applications. To address the above problems, we propose the SMEM-NAS, a pairwise com-parison relation-assisted multi-objective evolutionary algorithm based on a multi-population mechanism. In the SMEM-NAS, a surrogate model is constructed based on pairwise compari-son relations to predict the accuracy ranking of architectures, rather than the absolute accuracy. Moreover, two populations cooperate with each other in the search process, i.e., a main population guides the evolution, while a vice population expands the diversity. Our method aims to provide high-performance models that take into account multiple optimization objectives. We conduct a series of experiments on the CIFAR-10, CIFAR-100 and ImageNet datasets to verify its effectiveness. With only a single GPU searching for 0.17 days, competitive architectures can be found by SMEM-NAS which achieves 78.91% accuracy with the MAdds of 570M on the ImageNet. This work makes a significant advance in the important field of NAS.

[484] 2407.15603

Semi-Supervised Learning for Anomaly Detection in Blockchain-based Supply Chains

Blockchain-based supply chain (BSC) systems have tremendously been developed recently and can play an important role in our society in the future. In this study, we develop an anomaly detection model for BSC systems. Our proposed model can detect cyber-attacks at various levels, including the network layer, consensus layer, and beyond, by analyzing only the traffic data at the network layer. To do this, we first build a BSC system at our laboratory to perform experiments and collect datasets. We then propose a novel semi-supervised DAE-MLP (Deep AutoEncoder-Multilayer Perceptron) that combines the advantages of supervised and unsupervised learning to detect anomalies in BSC systems. The experimental results demonstrate the effectiveness of our model for anomaly detection within BSCs, achieving a detection accuracy of 96.5%. Moreover, DAE-MLP can effectively detect new attacks by improving the F1-score up to 33.1% after updating the MLP component.

[485] 2407.15605

Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models

Foundation models (FMs) are large neural networks trained on broad datasets, excelling in downstream tasks with minimal fine-tuning. Human activity recognition in video has advanced with FMs, driven by competition among different architectures. However, high accuracies on standard benchmarks can draw an artificially rosy picture, as they often overlook real-world factors like changing camera perspectives. Popular benchmarks, mostly from YouTube or movies, offer diverse views but only coarse actions, which are insufficient for use-cases needing fine-grained, domain-specific actions. Domain-specific datasets (e.g., for industrial assembly) typically use data from limited static perspectives. This paper empirically evaluates how perspective changes affect different FMs in fine-grained human activity recognition. We compare multiple backbone architectures and design choices, including image- and video- based models, and various strategies for temporal information fusion, including commonly used score averaging and more novel attention-based temporal aggregation mechanisms. This is the first systematic study of different foundation models and specific design choices for human activity recognition from unknown views, conducted with the goal to provide guidance for backbone- and temporal- fusion scheme selection. Code and models will be made publicly available to the community.

[486] 2407.15608

StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation

In this study, we introduce StylusAI, a novel architecture leveraging diffusion models in the domain of handwriting style generation. StylusAI is specifically designed to adapt and integrate the stylistic nuances of one language's handwriting into another, particularly focusing on blending English handwriting styles into the context of the German writing system. This approach enables the generation of German text in English handwriting styles and German handwriting styles into English, enriching machine-generated handwriting diversity while ensuring that the generated text remains legible across both languages. To support the development and evaluation of StylusAI, we present the \lq{Deutscher Handschriften-Datensatz}\rq~(DHSD), a comprehensive dataset encompassing 37 distinct handwriting styles within the German language. This dataset provides a fundamental resource for training and benchmarking in the realm of handwritten text generation. Our results demonstrate that StylusAI not only introduces a new method for style adaptation in handwritten text generation but also surpasses existing models in generating handwriting samples that improve both text quality and stylistic fidelity, evidenced by its performance on the IAM database and our newly proposed DHSD. Thus, StylusAI represents a significant advancement in the field of handwriting style generation, offering promising avenues for future research and applications in cross-linguistic style adaptation for languages with similar scripts.

[487] 2407.15611

Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets

Feature selection poses a challenge in small-sample high-dimensional datasets, where the number of features exceeds the number of observations, as seen in microarray, gene expression, and medical datasets. There isn't a universally optimal feature selection method applicable to any data distribution, and as a result, the literature consistently endeavors to address this issue. One recent approach in feature selection is termed frequency-based feature selection. However, existing methods in this domain tend to overlook feature values, focusing solely on the distribution in the response variable. In response, this paper introduces the Distance-based Mutual Congestion (DMC) as a filter method that considers both the feature values and the distribution of observations in the response variable. DMC sorts the features of datasets, and the top 5% are retained and clustered by KMeans to mitigate multicollinearity. This is achieved by randomly selecting one feature from each cluster. The selected features form the feature space, and the search space for the Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using this feature space. GAwAR approximates the combination of the top 10 features that maximizes prediction accuracy within a wrapper scheme. To prevent premature convergence, GAwAR adaptively updates the crossover and mutation rates. The hybrid DMC-GAwAR is applicable to binary classification datasets, and experimental results demonstrate its superiority over some recent works. The implementation and corresponding data are available at

[488] 2407.15612

Can GPT-4 learn to analyze moves in research article abstracts?

One of the most powerful and enduring ideas in written discourse analysis is that genres can be described in terms of the moves which structure a writer's purpose. Considerable research has sought to identify these distinct communicative acts, but analyses have been beset by problems of subjectivity, reliability and the time-consuming need for multiple coders to confirm analyses. In this paper we employ the affordances of GPT-4 to automate the annotation process by using natural language prompts. Focusing on abstracts from articles in four applied linguistics journals, we devise prompts which enable the model to identify moves effectively. The annotated outputs of these prompts were evaluated by two assessors with a third addressing disagreements. The results show that an 8-shot prompt was more effective than one using two, confirming that the inclusion of examples illustrating areas of variability can enhance GPT-4's ability to recognize multiple moves in a single sentence and reduce bias related to textual position. We suggest that GPT-4 offers considerable potential in automating this annotation process, when human actors with domain specific linguistic expertise inform the prompting process.

[489] 2407.15613

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.

[490] 2407.15614

Experimenting with Adaptive Bitrate Algorithms for Virtual Reality Streaming over Wi-Fi

Interactive Virtual Reality (VR) streaming over Wi-Fi networks encounters significant challenges due to bandwidth fluctuations caused by channel contention and user mobility. Adaptive BitRate (ABR) algorithms dynamically adjust the video encoding bitrate based on the available network capacity, aiming to maximize image quality while mitigating congestion and preserving the user's Quality of Experience (QoE). In this paper, we experiment with ABR algorithms for VR streaming using Air Light VR (ALVR), an open-source VR streaming solution. We extend ALVR with a comprehensive set of metrics that provide a robust characterization of the network's state, enabling more informed bitrate adjustments. To demonstrate the utility of these performance indicators, we develop and test the Network-aware Step-wise ABR algorithm for VR streaming (NeSt-VR). Results validate the accuracy of the newly implemented network performance metrics and demonstrate NeSt-VR's video bitrate adaptation capabilities.

[491] 2407.15615

Vehicle-to-Everything: Looking into the Future of Flexibility Services

The primary aim of this paper is to illuminate potential Vehicle-to-Everything (V2X) flexibility services that can be activated at the three levels of home, community, and grid. To do this, the potential practical services that can be provided by EVs, flexibility requesters, and the required exchange mechanisms at these three levels are identified. At the home level, the two main services that EVs can provide to households are explored. The initial service focuses on cost reduction for homes by employing smart charging and discharging methods, and the second service underscores the capability of an EV equipped with bidirectional chargers to function as a backup resource during grid outages. At the community level, three flexibility services are introduced and outlined: community cost reduction, energy sharing, and backup resources. There is more than one flexibility requester, including the community manager, EV owners, and other end users. Accordingly, at this level, having a fair and transparent market mechanism can optimise the activation of different V2X flexibility services. At the grid level, flexibility providers can offer two main flexibility services, namely load profile adjustment and real-time voltage and frequency control, to a wide range of flexibility requesters, including distribution network/systems operators (DNOs/DSOs) to overcome technical challenges of the distribution network, energy suppliers to manage their energy portfolio, and TSOs to support transmission network. In addition, a review of trial V2X projects and V2X supporting regulations is presented to contextualize the feasibility and regulatory framework for implementing these services. This analysis offers insights into the challenges and opportunities that need to be addressed for the effective integration of V2X flexibility services across the three identified levels.

[492] 2407.15616

Sustainable broadcasting in Blockchain Network with Reinforcement Learning

Recent estimates put the carbon footprint of Bitcoin and Ethereum at an average of 64 and 26 million tonnes of CO2 per year, respectively. To address this growing problem, several possible approaches have been proposed in the literature: creating alternative blockchain consensus mechanisms, applying redundancy reduction techniques, utilizing renewable energy sources, and employing energy-efficient devices, etc. In this paper, we follow the second avenue and propose an efficient approach based on reinforcement learning that improves the block broadcasting scheme in blockchain networks. The analysis and experimental results confirmed that the proposed improvement of the block propagation scheme could cleverly handle network dynamics and achieve better results than the default approach. Additionally, our technical integration of the simulator and developed RL environment can be used as a complete solution for further study of new schemes and protocols that use RL or other ML techniques.

[493] 2407.15617

Norface: Improving Facial Expression Analysis by Identity Normalization

Facial Expression Analysis remains a challenging task due to unexpected task-irrelevant noise, such as identity, head pose, and background. To address this issue, this paper proposes a novel framework, called Norface, that is unified for both Action Unit (AU) analysis and Facial Emotion Recognition (FER) tasks. Norface consists of a normalization network and a classification network. First, the carefully designed normalization network struggles to directly remove the above task-irrelevant noise, by maintaining facial expression consistency but normalizing all original images to a common identity with consistent pose, and background. Then, these additional normalized images are fed into the classification network. Due to consistent identity and other factors (e.g. head pose, background, etc.), the normalized images enable the classification network to extract useful expression information more effectively. Additionally, the classification network incorporates a Mixture of Experts to refine the latent representation, including handling the input of facial representations and the output of multiple (AU or emotion) labels. Extensive experiments validate the carefully designed framework with the insight of identity normalization. The proposed method outperforms existing SOTA methods in multiple facial expression analysis tasks, including AU detection, AU intensity estimation, and FER tasks, as well as their cross-dataset tasks. For the normalized datasets and code please visit {}.

[494] 2407.15620

Dual Test-time Training for Out-of-distribution Recommender System

Deep learning has been widely applied in recommender systems, which has achieved revolutionary progress recently. However, most existing learning-based methods assume that the user and item distributions remain unchanged between the training phase and the test phase. However, the distribution of user and item features can naturally shift in real-world scenarios, potentially resulting in a substantial decrease in recommendation performance. This phenomenon can be formulated as an Out-Of-Distribution (OOD) recommendation problem. To address this challenge, we propose a novel Dual Test-Time-Training framework for OOD Recommendation, termed DT3OR. In DT3OR, we incorporate a model adaptation mechanism during the test-time phase to carefully update the recommendation model, allowing the model to specially adapt to the shifting user and item features. To be specific, we propose a self-distillation task and a contrastive task to assist the model learning both the user's invariant interest preferences and the variant user/item characteristics during the test-time phase, thus facilitating a smooth adaptation to the shifting features. Furthermore, we provide theoretical analysis to support the rationale behind our dual test-time training framework. To the best of our knowledge, this paper is the first work to address OOD recommendation via a test-time-training strategy. We conduct experiments on three datasets with various backbones. Comprehensive experimental results have demonstrated the effectiveness of DT3OR compared to other state-of-the-art baselines.

[495] 2407.15621

RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation

Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.

[496] 2407.15622

HyperSurf: Quadruped Robot Leg Capable of Surface Recognition with GRU and Real-to-Sim Transferring

This paper introduces a system of data collection acceleration and real-to-sim transferring for surface recognition on a quadruped robot. The system features a mechanical single-leg setup capable of stepping on various easily interchangeable surfaces. Additionally, it incorporates a GRU-based Surface Recognition System, inspired by the system detailed in the Dog-Surf paper. This setup facilitates the expansion of dataset collection for model training, enabling data acquisition from hard-to-reach surfaces in laboratory conditions. Furthermore, it opens avenues for transferring surface properties from reality to simulation, thereby allowing the training of optimal gaits for legged robots in simulation environments using a pre-prepared library of digital twins of surfaces. Moreover, enhancements have been made to the GRU-based Surface Recognition System, allowing for the integration of data from both the quadruped robot and the single-leg setup. The dataset and code have been made publicly available.

[497] 2407.15625

Dressed to Gamble: How Poker Drives the Dynamics of Wearables and Visits on Decentraland's Social Virtual World

Decentraland is a blockchain-based social virtual world touted to be a creative space owned by its community, unlike previous virtual worlds. Its users can create and publish wearables, virtual garments to customize avatars, which can be then sold or given away via the blockchain. Decentral Games (DG), a single project owning two prominent in-world casinos, has by far created the most wearables, with these being necessary to earn cryptocurrency in their flagship game ICE Poker. We thus present a comprehensive study that investigates how DG and ICE Poker influence the overall dynamics of Decentraland wearables and in-world visits. To this end, we analyzed 5.9 million wearable transfers made on the Polygon blockchain (and related sales) over a two-year period, and 677 million log events of in-world user positions in an overlapping 10-month period. We found that the influence of DG and Ice Poker is not only significant, but also substantial for transfers and sales monetary value of wearables, and very large for daily unique visitors and time spent in the virtual world. Despite several alternative in-world economic and artistic initiatives in Decentraland, some of which have attracted much attention from the general public, online poker appears to be the main driver of the analyzed dynamics. Our work thus contributes to the current understanding of user behavior in social virtual worlds and it is among the first to study the emerging phenomenon of blockchain-based online gambling in virtual spaces.

[498] 2407.15626

Reinforcement Learning Meets Visual Odometry

Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics.

[499] 2407.15642

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.

[500] 2407.15643

Link Polarity Prediction from Sparse and Noisy Labels via Multiscale Social Balance

Signed Graph Neural Networks (SGNNs) have recently gained attention as an effective tool for several learning tasks on signed networks, i.e., graphs where edges have an associated polarity. One of these tasks is to predict the polarity of the links for which this information is missing, starting from the network structure and the other available polarities. However, when the available polarities are few and potentially noisy, such a task becomes challenging. In this work, we devise a semi-supervised learning framework that builds around the novel concept of \emph{multiscale social balance} to improve the prediction of link polarities in settings characterized by limited data quantity and quality. Our model-agnostic approach can seamlessly integrate with any SGNN architecture, dynamically reweighting the importance of each data sample while making strategic use of the structural information from unlabeled edges combined with social balance theory. Empirical validation demonstrates that our approach outperforms established baseline models, effectively addressing the limitations imposed by noisy and sparse data. This result underlines the benefits of incorporating multiscale social balance into SGNNs, opening new avenues for robust and accurate predictions in signed network analysis.

[501] 2407.15645

Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.

[502] 2407.15646

SS-SFR: Synthetic Scenes Spatial Frequency Response on Virtual KITTI and Degraded Automotive Simulations for Object Detection

Automotive simulation can potentially compensate for a lack of training data in computer vision applications. However, there has been little to no image quality evaluation of automotive simulation and the impact of optical degradations on simulation is little explored. In this work, we investigate Virtual KITTI and the impact of applying variations of Gaussian blur on image sharpness. Furthermore, we consider object detection, a common computer vision application on three different state-of-the-art models, thus allowing us to characterize the relationship between object detection and sharpness. It was found that while image sharpness (MTF50) degrades from an average of 0.245cy/px to approximately 0.119cy/px; object detection performance stays largely robust within 0.58\%(Faster RCNN), 1.45\%(YOLOF) and 1.93\%(DETR) across all respective held-out test sets.

[503] 2407.15647

The Impact of Responsible AI Research on Innovation and Development

Translational research, especially in the fast-evolving field of Artificial Intelligence (AI), is key to converting scientific findings into practical innovations. In Responsible AI (RAI) research, translational impact is often viewed through various pathways, including research papers, blogs, news articles, and the drafting of forthcoming AI legislation (e.g., the EU AI Act). However, the real-world impact of RAI research remains an underexplored area. Our study aims to capture it through two pathways: \emph{patents} and \emph{code repositories}, both of which provide a rich and structured source of data. Using a dataset of 200,000 papers from 1980 to 2022 in AI and related fields, including Computer Vision, Natural Language Processing, and Human-Computer Interaction, we developed a Sentence-Transformers Deep Learning framework to identify RAI papers. This framework calculates the semantic similarity between paper abstracts and a set of RAI keywords, which are derived from the NIST's AI Risk Management Framework; a framework that aims to enhance trustworthiness considerations in the design, development, use, and evaluation of AI products, services, and systems. We identified 1,747 RAI papers published in top venues such as CHI, CSCW, NeurIPS, FAccT, and AIES between 2015 and 2022. By analyzing these papers, we found that a small subset that goes into patents or repositories is highly cited, with the translational process taking between 1 year for repositories and up to 8 years for patents. Interestingly, impactful RAI research is not limited to top U.S. institutions, but significant contributions come from European and Asian institutions. Finally, the multidisciplinary nature of RAI papers, often incorporating knowledge from diverse fields of expertise, was evident as these papers tend to build on unconventional combinations of prior knowledge.

[504] 2407.15648

TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly

Inferring step-wise actions to assemble 3D objects with primitive bricks from images is a challenging task due to complex constraints and the vast number of possible combinations. Recent studies have demonstrated promising results on sequential LEGO brick assembly through the utilization of LEGO-Graph modeling to predict sequential actions. However, existing approaches are class-specific and require significant computational and 3D annotation resources. In this work, we first propose a computationally efficient breadth-first search (BFS) LEGO-Tree structure to model the sequential assembly actions by considering connections between consecutive layers. Based on the LEGO-Tree structure, we then design a class-agnostic tree-transformer framework to predict the sequential assembly actions from the input multi-view images. A major challenge of the sequential brick assembly task is that the step-wise action labels are costly and tedious to obtain in practice. We mitigate this problem by leveraging synthetic-to-real transfer learning. Specifically, our model is first pre-trained on synthetic data with full supervision from the available action labels. We then circumvent the requirement for action labels in the real data by proposing an action-to-silhouette projection that replaces action labels with input image silhouettes for self-supervision. Without any annotation on the real data, our model outperforms existing methods with 3D supervision by 7.8% and 11.3% in mIoU on the MNIST and ModelNet Construction datasets, respectively.

[505] 2407.15656

Evaluation of Reinforcement Learning for Autonomous Penetration Testing using A3C, Q-learning and DQN

Penetration testing is the process of searching for security weaknesses by simulating an attack. It is usually performed by experienced professionals, where scanning and attack tools are applied. By automating the execution of such tools, the need for human interaction and decision-making could be reduced. In this work, a Network Attack Simulator (NASim) was used as an environment to train reinforcement learning agents to solve three predefined security scenarios. These scenarios cover techniques of exploitation, post-exploitation and wiretapping. A large hyperparameter grid search was performed to find the best hyperparameter combinations. The algorithms Q-learning, DQN and A3C were used, whereby A3C was able to solve all scenarios and achieve generalization. In addition, A3C could solve these scenarios with fewer actions than the baseline automated penetration testing. Although the training was performed on rather small scenarios and with small state and action spaces for the agents, the results show that a penetration test can successfully be performed by the RL agent.

[506] 2407.15660

MuTT: A Multimodal Trajectory Transformer for Robot Skills

High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.

[507] 2407.15661

DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving

In autonomous driving, deep models have shown remarkable performance across various visual perception tasks with the demand of high-quality and huge-diversity training datasets. Such datasets are expected to cover various driving scenarios with adverse weather, lighting conditions and diverse moving objects. However, manually collecting these data presents huge challenges and expensive cost. With the rapid development of large generative models, we propose DriveDiTFit, a novel method for efficiently generating autonomous Driving data by Fine-tuning pre-trained Diffusion Transformers (DiTs). Specifically, DriveDiTFit utilizes a gap-driven modulation technique to carefully select and efficiently fine-tune a few parameters in DiTs according to the discrepancy between the pre-trained source data and the target driving data. Additionally, DriveDiTFit develops an effective weather and lighting condition embedding module to ensure diversity in the generated data, which is initialized by a nearest-semantic-similarity initialization approach. Through progressive tuning scheme to refined the process of detail generation in early diffusion process and enlarging the weights corresponding to small objects in training loss, DriveDiTFit ensures high-quality generation of small moving objects in the generated data. Extensive experiments conducted on driving datasets confirm that our method could efficiently produce diverse real driving data. The source codes will be available at

[508] 2407.15663

MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics

Place recognition is a challenging task in computer vision, crucial for enabling autonomous vehicles and robots to navigate previously visited environments. While significant progress has been made in learnable multimodal methods that combine onboard camera images and LiDAR point clouds, the full potential of these methods remains largely unexplored in localization applications. In this paper, we study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition, incorporating explicit visual semantics and text descriptions. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors. We employ a late fusion approach to integrate these modalities, providing a unified representation. Through extensive experiments on the Oxford RobotCar and NCLT datasets, we systematically analyze the impact of each data source on the overall quality of place descriptors. Our experiments demonstrate that combining data from multiple sensors significantly improves place recognition model performance compared to single modality approaches and leads to state-of-the-art quality. We also show that separate usage of visual or textual semantics (which are more compact representations of sensory data) can achieve promising results in place recognition. The code for our method is publicly available:

[509] 2407.15665

A spatiotemporal deep learning framework for prediction of crack dynamics in heterogeneous solids: efficient mapping of concrete microstructures to its fracture properties

A spatiotemporal deep learning framework is proposed that is capable of 2D full-field prediction of fracture in concrete mesostructures. This framework not only predicts fractures but also captures the entire history of the fracture process, from the crack initiation in the interfacial transition zone to the subsequent propagation of the cracks in the mortar matrix. In addition, a convolutional neural network is developed which can predict the averaged stress-strain curve of the mesostructures. The UNet modeling framework, which comprises an encoder-decoder section with skip connections, is used as the deep learning surrogate model. Training and test data are generated from high-fidelity fracture simulations of randomly generated concrete mesostructures. These mesostructures include geometric variabilities such as different aggregate particle geometrical features, spatial distribution, and the total volume fraction of aggregates. The fracture simulations are carried out in Abaqus, utilizing the cohesive phase-field fracture modeling technique as the fracture modeling approach. In this work, to reduce the number of training datasets, the spatial distribution of three sets of material properties for three-phase concrete mesostructures, along with the spatial phase-field damage index, are fed to the UNet to predict the corresponding stress and spatial damage index at the subsequent step. It is shown that after the training process using this methodology, the UNet model is capable of accurately predicting damage on the unseen test dataset by using 470 datasets. Moreover, another novel aspect of this work is the conversion of irregular finite element data into regular grids using a developed pipeline. This approach allows for the implementation of less complex UNet architecture and facilitates the integration of phase-field fracture equations into surrogate models for future developments.

[510] 2407.15668

SLVideo: A Sign Language Video Moment Retrieval Framework

Sign Language Recognition has been studied and developed throughout the years to help the deaf and hard-of-hearing people in their day-to-day lives. These technologies leverage manual sign recognition algorithms, however, most of them lack the recognition of facial expressions, which are also an essential part of Sign Language as they allow the speaker to add expressiveness to their dialogue or even change the meaning of certain manual signs. SLVideo is a video moment retrieval software for Sign Language videos with a focus on both hands and facial signs. The system extracts embedding representations for the hand and face signs from video frames to capture the language signs in full. This will then allow the user to search for a specific sign language video segment with text queries, or to search by similar sign language videos. To test this system, a collection of five hours of annotated Sign Language videos is used as the dataset, and the initial results are promising in a zero-shot setting.SLVideo is shown to not only address the problem of searching sign language videos but also supports a Sign Language thesaurus with a search by similarity technique. Project web page:

[511] 2407.15670

Coca4ai: checking energy behaviors on AI data centers

Monitoring energy behaviors in AI data centers is crucial, both to reduce their energy consumption and to raise awareness among their users which are key actors in the AI field. This paper shows a proof of concept of easy and lightweight monitoring of energy behaviors at the scale of a whole data center, a user or a job submission. Our system uses software wattmeters and we validate our setup with per node accurate external wattmeters. Results show that there is an interesting potential from the efficiency point of view, providing arguments to create user engagement thanks to energy monitoring.

[512] 2407.15671

Problems in AI, their roots in philosophy, and implications for science and society

Artificial Intelligence (AI) is one of today's most relevant emergent technologies. In view thereof, this paper proposes that more attention should be paid to the philosophical aspects of AI technology and its use. It is argued that this deficit is generally combined with philosophical misconceptions about the growth of knowledge. To identify these misconceptions, reference is made to the ideas of the philosopher of science Karl Popper and the physicist David Deutsch. The works of both thinkers aim against mistaken theories of knowledge, such as inductivism, empiricism, and instrumentalism. This paper shows that these theories bear similarities to how current AI technology operates. It also shows that these theories are very much alive in the (public) discourse on AI, often called Bayesianism. In line with Popper and Deutsch, it is proposed that all these theories are based on mistaken philosophies of knowledge. This includes an analysis of the implications of these mistaken philosophies for the use of AI in science and society, including some of the likely problem situations that will arise. This paper finally provides a realistic outlook on Artificial General Intelligence (AGI) and three propositions on A(G)I and philosophy (i.e., epistemology).

[513] 2407.15672

Computer Audition: From Task-Specific Machine Learning to Foundation Models

Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-available interaction with human users. Naturally, these promises have created substantial excitement in the audio community, and have led to a wave of early attempts to build new, general-purpose foundation models for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines towards auditory foundation models. Our work highlights the key operating principles that underpin those models, and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.

[514] 2407.15673

IDA: Breaking Barriers in No-code UI Automation Through Large Language Models and Human-Centric Design

Business users dedicate significant amounts of time to repetitive tasks within enterprise digital platforms, highlighting a critical need for automation. Despite advancements in low-code tools for UI automation, their complexity remains a significant barrier to adoption among non-technical business users. However, recent advancements in large language models (LLMs) have created new opportunities to overcome this barrier by offering more powerful, yet simpler and more human-centric programming environments. This paper presents IDA (Intelligent Digital Apprentice), a novel no-code Web UI automation tool designed specifically to empower business users with no technical background. IDA incorporates human-centric design principles, including guided programming by demonstration, semantic programming model, and teacher-student learning metaphor which is tailored to the skill set of business users. By leveraging LLMs, IDA overcomes some of the key technical barriers that have traditionally limited the possibility of no-code solutions. We have developed a prototype of IDA and conducted a user study involving real world business users and enterprise applications. The promising results indicate that users could effectively utilize IDA to create automation. The qualitative feedback indicates that IDA is perceived as user-friendly and trustworthy. This study contributes to unlocking the potential of AI assistants to enhance the productivity of business users through no-code user interface automation.

[515] 2407.15675

Flow-guided Motion Prediction with Semantics and Dynamic Occupancy Grid Maps

Accurate prediction of driving scenes is essential for road safety and autonomous driving. Occupancy Grid Maps (OGMs) are commonly employed for scene prediction due to their structured spatial representation, flexibility across sensor modalities and integration of uncertainty. Recent studies have successfully combined OGMs with deep learning methods to predict the evolution of scene and learn complex behaviours. These methods, however, do not consider prediction of flow or velocity vectors in the scene. In this work, we propose a novel multi-task framework that leverages dynamic OGMs and semantic information to predict both future vehicle semantic grids and the future flow of the scene. This incorporation of semantic flow not only offers intermediate scene features but also enables the generation of warped semantic grids. Evaluation on the real-world NuScenes dataset demonstrates improved prediction capabilities and enhanced ability of the model to retain dynamic vehicles within the scene.

[516] 2407.15676

Preventing Out-of-Gas Exceptions by Typing

We continue the development of TinySol, a minimal object-oriented language based on Solidity, the standard smart-contract language used for the Ethereum platform. We first extend TinySol with exceptions and a gas mechanism, and equip it with a small-step operational semantics. Introducing the gas mechanism is fundamental for modelling real-life smart contracts in TinySol, since this is the way in which termination of Ethereum smart contracts is usually ensured. We then devise a type system for smart contracts guaranteeing that such programs never run out of gas at runtime. This is a desirable property for smart contracts, since a transaction that runs out of gas is aborted, but the price paid to run the code is not returned to the invoker.

[517] 2407.15677

Language models are robotic planners: reframing plans as goal refinement graphs

Successful application of large language models (LLMs) to robotic planning and execution may pave the way to automate numerous real-world tasks. Promising recent research has been conducted showing that the knowledge contained in LLMs can be utilized in making goal-driven decisions that are enactable in interactive, embodied environments. Nonetheless, there is a considerable drop in correctness of programs generated by LLMs. We apply goal modeling techniques from software engineering to large language models generating robotic plans. Specifically, the LLM is prompted to generate a step refinement graph for a task. The executability and correctness of the program converted from this refinement graph is then evaluated. The approach results in programs that are more correct as judged by humans in comparison to previous work.

[518] 2407.15680

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

[519] 2407.15683

Enhancing Transferability of Targeted Adversarial Examples: A Self-Universal Perspective

Transfer-based targeted adversarial attacks against black-box deep neural networks (DNNs) have been proven to be significantly more challenging than untargeted ones. The impressive transferability of current SOTA, the generative methods, comes at the cost of requiring massive amounts of additional data and time-consuming training for each targeted label. This results in limited efficiency and flexibility, significantly hindering their deployment in practical applications. In this paper, we offer a self-universal perspective that unveils the great yet underexplored potential of input transformations in pursuing this goal. Specifically, transformations universalize gradient-based attacks with intrinsic but overlooked semantics inherent within individual images, exhibiting similar scalability and comparable results to time-consuming learning over massive additional data from diverse classes. We also contribute a surprising empirical insight that one of the most fundamental transformations, simple image scaling, is highly effective, scalable, sufficient, and necessary in enhancing targeted transferability. We further augment simple scaling with orthogonal transformations and block-wise applicability, resulting in the Simple, faSt, Self-universal yet Strong Scale Transformation (S$^4$ST) for self-universal TTA. On the ImageNet-Compatible benchmark dataset, our method achieves a 19.8% improvement in the average targeted transfer success rate against various challenging victim models over existing SOTA transformation methods while only consuming 36% time for attacking. It also outperforms resource-intensive attacks by a large margin in various challenging settings.

[520] 2407.15685

The Atlas of AI Incidents in Mobile Computing: Visualizing the Risks and Benefits of AI Gone Mobile

Today's visualization tools for conveying the risks and benefits of AI technologies are largely tailored for those with technical expertise. To bridge this gap, we have developed a visualization that employs narrative patterns and interactive elements, enabling the broader public to gradually grasp the diverse risks and benefits associated with AI. Using a dataset of 54 real-world incidents involving AI in mobile computing, we examined design choices that enhance public understanding and provoke reflection on how certain AI applications - even those deemed low-risk by law - can still lead to significant incidents. Visualization:

[521] 2407.15686

Differentiable Convex Polyhedra Optimization from Multi-view Images

This paper presents a novel approach for the differentiable rendering of convex polyhedra, addressing the limitations of recent methods that rely on implicit field supervision. Our technique introduces a strategy that combines non-differentiable computation of hyperplane intersection through duality transform with differentiable optimization for vertex positioning with three-plane intersection, enabling gradient-based optimization without the need for 3D implicit fields. This allows for efficient shape representation across a range of applications, from shape parsing to compact mesh reconstruction. This work not only overcomes the challenges of previous approaches but also sets a new standard for representing shapes with convex polyhedra.

[522] 2407.15688

AI-Driven Fast and Early Detection of IoT Botnet Threats: A Comprehensive Network Traffic Analysis Approach

In the rapidly evolving landscape of cyber threats targeting the Internet of Things (IoT) ecosystem, and in light of the surge in botnet-driven Distributed Denial of Service (DDoS) and brute force attacks, this study focuses on the early detection of IoT bots. It specifically addresses the detection of stealth bot communication that precedes and orchestrates attacks. This study proposes a comprehensive methodology for analyzing IoT network traffic, including considerations for both unidirectional and bidirectional flow, as well as packet formats. It explores a wide spectrum of network features critical for representing network traffic and characterizing benign IoT traffic patterns effectively. Moreover, it delves into the modeling of traffic using various semi-supervised learning techniques. Through extensive experimentation with the IoT-23 dataset - a comprehensive collection featuring diverse botnet types and traffic scenarios - we have demonstrated the feasibility of detecting botnet traffic corresponding to different operations and types of bots, specifically focusing on stealth command and control (C2) communications. The results obtained have demonstrated the feasibility of identifying C2 communication with a 100% success rate through packet-based methods and 94% via flow based approaches, with a false positive rate of 1.53%.

[523] 2407.15694

Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi -- Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)

The widespread adoption of large language models (LLMs) and awareness around multilingual LLMs have raised concerns regarding the potential risks and repercussions linked to the misapplication of AI-generated text, necessitating increased vigilance. While these models are primarily trained for English, their extensive training on vast datasets covering almost the entire web, equips them with capabilities to perform well in numerous other languages. AI-Generated Text Detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by the emergence of techniques to bypass detection. In this paper, we report our investigation on AGTD for an indic language Hindi. Our major contributions are in four folds: i) examined 26 LLMs to evaluate their proficiency in generating Hindi text, ii) introducing the AI-generated news article in Hindi ($AG_{hi}$) dataset, iii) evaluated the effectiveness of five recently proposed AGTD techniques: ConDA, J-Guard, RADAR, RAIDAR and Intrinsic Dimension Estimation for detecting AI-generated Hindi text, iv) proposed Hindi AI Detectability Index ($ADI_{hi}$) which shows a spectrum to understand the evolving landscape of eloquence of AI-generated text in Hindi. We will make the codes and datasets available to encourage further research.

[524] 2407.15695

Supporting the Digital Autonomy of Elders Through LLM Assistance

The internet offers tremendous access to services, social connections, and needed products. However, to those without sufficient experience, engaging with businesses and friends across the internet can be daunting due to the ever present danger of scammers and thieves, to say nothing of the myriad of potential computer viruses. Like a forest rich with both edible and poisonous plants, those familiar with the norms inhabit it safely with ease while newcomers need a guide. However, reliance on a human digital guide can be taxing and often impractical. We propose and pilot a simple but unexplored idea: could an LLM provide the necessary support to help the elderly who are separated by the digital divide safely achieve digital autonomy?

[525] 2407.15700

A Life-long Learning Intrusion Detection System for 6G-Enabled IoV

The introduction of 6G technology into the Internet of Vehicles (IoV) promises to revolutionize connectivity with ultra-high data rates and seamless network coverage. However, this technological leap also brings significant challenges, particularly for the dynamic and diverse IoV landscape, which must meet the rigorous reliability and security requirements of 6G networks. Furthermore, integrating 6G will likely increase the IoV's susceptibility to a spectrum of emerging cyber threats. Therefore, it is crucial for security mechanisms to dynamically adapt and learn new attack patterns, keeping pace with the rapid evolution and diversification of these threats - a capability currently lacking in existing systems. This paper presents a novel intrusion detection system leveraging the paradigm of life-long (or continual) learning. Our methodology combines class-incremental learning with federated learning, an approach ideally suited to the distributed nature of the IoV. This strategy effectively harnesses the collective intelligence of Connected and Automated Vehicles (CAVs) and edge computing capabilities to train the detection system. To the best of our knowledge, this study is the first to synergize class-incremental learning with federated learning specifically for cyber attack detection. Through comprehensive experiments on a recent network traffic dataset, our system has exhibited a robust adaptability in learning new cyber attack patterns, while effectively retaining knowledge of previously encountered ones. Additionally, it has proven to maintain high accuracy and a low false positive rate.

[526] 2407.15701

Robotic Shepherding in Cluttered and Unknown Environments using Control Barrier Functions

This paper introduces a novel control methodology designed to guide a collective of robotic-sheep in a cluttered and unknown environment using robotic-dogs. The dog-agents continuously scan the environment and compute a safe trajectory to guide the sheep to their final destination. The proposed optimization-based controller guarantees that the sheep reside within a desired distance from the reference trajectory through the use of Control Barrier Functions (CBF). Additional CBF constraints are employed simultaneously to ensure inter-agent and obstacle collision avoidance. The efficacy of the proposed approach is rigorously tested in simulation, which demonstrates the successful herding of the robotic-sheep within complex and cluttered environments.

[527] 2407.15703

Estimating Probability Densities with Transformer and Denoising Diffusion

Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining full probabilistic outputs is crucial to many fields of science, where the probability distribution of the answer can be non-Gaussian and multimodal. In this work, we demonstrate that training a probabilistic model using a denoising diffusion head on top of the Transformer provides reasonable probability density estimation even for high-dimensional inputs. The combined Transformer+Denoising Diffusion model allows conditioning the output probability density on arbitrary combinations of inputs and it is thus a highly flexible density function emulator of all possible input/output combinations. We illustrate our Transformer+Denoising Diffusion model by training it on a large dataset of astronomical observations and measured labels of stars within our Galaxy and we apply it to a variety of inference tasks to show that the model can infer labels accurately with reasonable distributions.

[528] 2407.15706

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at:

[529] 2407.15707

Predicting the Best of N Visual Trackers

We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on

[530] 2407.15708

SwinSF: Image Reconstruction from Spatial-Temporal Spike Streams

The spike camera, with its high temporal resolution, low latency, and high dynamic range, addresses high-speed imaging challenges like motion blur. It captures photons at each pixel independently, creating binary spike streams rich in temporal information but challenging for image reconstruction. Current algorithms, both traditional and deep learning-based, still need to be improved in the utilization of the rich temporal detail and the restoration of the details of the reconstructed image. To overcome this, we introduce Swin Spikeformer (SwinSF), a novel model for dynamic scene reconstruction from spike streams. SwinSF is composed of Spike Feature Extraction, Spatial-Temporal Feature Extraction, and Final Reconstruction Module. It combines shifted window self-attention and proposed temporal spike attention, ensuring a comprehensive feature extraction that encapsulates both spatial and temporal dynamics, leading to a more robust and accurate reconstruction of spike streams. Furthermore, we build a new synthesized dataset for spike image reconstruction which matches the resolution of the latest spike camera, ensuring its relevance and applicability to the latest developments in spike camera imaging. Experimental results demonstrate that the proposed network SwinSF sets a new benchmark, achieving state-of-the-art performance across a series of datasets, including both real-world and synthesized data across various resolutions. Our codes and proposed dataset will be available soon.

[531] 2407.15711

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

[532] 2407.15714

Mamba meets crack segmentation

Cracks pose safety risks to infrastructure and cannot be overlooked. The prevailing structures in existing crack segmentation networks predominantly consist of CNNs or Transformers. However, CNNs exhibit a deficiency in global modeling capability, hindering the representation to entire crack features. Transformers can capture long-range dependencies but suffer from high and quadratic complexity. Recently, Mamba has garnered extensive attention due to its linear spatial and computational complexity and its powerful global perception. This study explores the representation capabilities of Mamba to crack features. Specifically, this paper uncovers the connection between Mamba and the attention mechanism, providing a profound insight, an attention perspective, into interpreting Mamba and devising a novel Mamba module following the principles of attention blocks, namely CrackMamba. We compare CrackMamba with the most prominent visual Mamba modules, Vim and Vmamba, on two datasets comprising asphalt pavement and concrete pavement cracks, and steel cracks, respectively. The quantitative results show that CrackMamba stands out as the sole Mamba block consistently enhancing the baseline model's performance across all evaluation measures, while reducing its parameters and computational costs. Moreover, this paper substantiates that Mamba can achieve global receptive fields through both theoretical analysis and visual interpretability. The discoveries of this study offer a dual contribution. First, as a plug-and-play and simple yet effective Mamba module, CrackMamba exhibits immense potential for integration into various crack segmentation models. Second, the proposed innovative Mamba design concept, integrating Mamba with the attention mechanism, holds significant reference value for all Mamba-based computer vision models, not limited to crack segmentation networks, as investigated in this study.

[533] 2407.15715

Cryptoeconomics and Tokenomics as Economics: A Survey with Opinions

This paper surveys products and studies on cryptoeconomics and tokenomics from an economic perspective, as these terms are still (i) ill-defined and (ii) disconnected from economic disciplines. We first suggest that they can be novel when integrated; we then conduct a literature review and case study following consensus-building for decentralization and token value for autonomy. Integration requires simultaneous consideration of strategic behavior, spamming, Sybil attacks, free-riding, marginal cost, marginal utility and stabilizers. This survey is the first systematization of knowledge on cryptoeconomics and tokenomics, aiming to bridge the contexts of economics and blockchain.

[534] 2407.15716

CrashEventLLM: Predicting System Crashes with Large Language Models

As the dependence on computer systems expands across various domains, focusing on personal, industrial, and large-scale applications, there arises a compelling need to enhance their reliability to sustain business operations seamlessly and ensure optimal user satisfaction. System logs generated by these devices serve as valuable repositories of historical trends and past failures. The use of machine learning techniques for failure prediction has become commonplace, enabling the extraction of insights from past data to anticipate future behavior patterns. Recently, large language models have demonstrated remarkable capabilities in tasks including summarization, reasoning, and event prediction. Therefore, in this paper, we endeavor to investigate the potential of large language models in predicting system failures, leveraging insights learned from past failure behavior to inform reasoning and decision-making processes effectively. Our approach involves leveraging data from the Intel Computing Improvement Program (ICIP) system crash logs to identify significant events and develop CrashEventLLM. This model, built upon a large language model framework, serves as our foundation for crash event prediction. Specifically, our model utilizes historical data to forecast future crash events, informed by expert annotations. Additionally, it goes beyond mere prediction, offering insights into potential causes for each crash event. This work provides the preliminary insights into prompt-based large language models for the log-based event prediction task.

[535] 2407.15717

Harmonizing Flows: Leveraging normalizing flows for unsupervised and source-free MRI harmonization

Lack of standardization and various intrinsic parameters for magnetic resonance (MR) image acquisition results in heterogeneous images across different sites and devices, which adversely affects the generalization of deep neural networks. To alleviate this issue, this work proposes a novel unsupervised harmonization framework that leverages normalizing flows to align MR images, thereby emulating the distribution of a source domain. The proposed strategy comprises three key steps. Initially, a normalizing flow network is trained to capture the distribution characteristics of the source domain. Then, we train a shallow harmonizer network to reconstruct images from the source domain via their augmented counterparts. Finally, during inference, the harmonizer network is updated to ensure that the output images conform to the learned source domain distribution, as modeled by the normalizing flow network. Our approach, which is unsupervised, source-free, and task-agnostic is assessed in the context of both adults and neonatal cross-domain brain MRI segmentation, as well as neonatal brain age estimation, demonstrating its generalizability across tasks and population demographics. The results underscore its superior performance compared to existing methodologies. The code is available at

[536] 2407.15718

Integrating AI Tutors in a Programming Course

RAGMan is an LLM-powered tutoring system that can support a variety of course-specific and homework-specific AI tutors. RAGMan leverages Retrieval Augmented Generation (RAG), as well as strict instructions, to ensure the alignment of the AI tutors' responses. By using RAGMan's AI tutors, students receive assistance with their specific homework assignments without directly obtaining solutions, while also having the ability to ask general programming-related questions. RAGMan was deployed as an optional resource in an introductory programming course with an enrollment of 455 students. It was configured as a set of five homework-specific AI tutors. This paper describes the interactions the students had with the AI tutors, the students' feedback, and a comparative grade analysis. Overall, about half of the students engaged with the AI tutors, and the vast majority of the interactions were legitimate homework questions. When students posed questions within the intended scope, the AI tutors delivered accurate responses 98% of the time. Within the students used AI tutors, 78% reported that the tutors helped their learning. Beyond AI tutors' ability to provide valuable suggestions, students reported appreciating them for fostering a safe learning environment free from judgment.

[537] 2407.15719

GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI