New articles on Computer Science

[1] 2312.00003

Transport Equation based Physics Informed Neural Network to predict the Yield Strength of Architected Materials

In this research, the application of the Physics-Informed Neural Network (PINN) model is explored to solve transport equation-based Partial Differential Equations (PDEs). The primary objective is to analyze the impact of different activation functions incorporated within the PINN model on its predictive performance, specifically assessing the Mean Squared Error (MSE) and Mean Absolute Error (MAE). The dataset used in the study consists of a varied set of input parameters related to strut diameter, unit cell size, and the corresponding yield stress values. Through this investigation the aim is to understand the effectiveness of the PINN model and the significance of choosing appropriate activation functions for solving complex PDEs in real-world applications. The outcomes suggest that the choice of activation function may have minimal influence on the model's predictive accuracy for this particular problem. The PINN model showcases exceptional generalization capabilities, indicating its capacity to avoid overfitting with the provided dataset. The research underscores the importance of striking a balance between performance and computational efficiency while selecting an activation function for specific real-world applications. These valuable findings contribute to advancing the understanding and potential adoption of PINN as an effective tool for solving challenging PDEs in diverse scientific and engineering domains.

[2] 2312.00005

NumCalc: An open source BEM code for solving acoustic scattering problems

The calculation of the acoustic field in or around objects is an important task in acoustic engineering. To numerically solve this task, the boundary element method (BEM) is a commonly used method especially for infinite domains. The open source tool Mesh2HRTF and its BEM core NumCalc provide users with a collection of free software for acoustic simulations without the need of having an in-depth knowledge into numerical methods. However, we feel that users should have a basic understanding with respect to the methods behind the software they are using. We are convinced that this basic understanding helps in avoiding common mistakes and also helps to understand the requirements to use the software. To provide this background is the first motivation for this paper. A second motivation for this paper is to demonstrate the accuracy of NumCalc when solving benchmark problems. Thus, users can get an idea about the accuracy they can expect when using NumCalc as well as the memory and CPU requirements of NumCalc. A third motivation for this paper is to give users detailed information about some parts of the actual implementation that are usually not mentioned in literature, e.g., the specific version of the fast multipole method and its clustering process or how to use frequency-dependent admittance boundary conditions.

[3] 2312.00006

Enhancing ML-Based DoS Attack Detection Through Combinatorial Fusion Analysis

Mitigating Denial-of-Service (DoS) attacks is vital for online service security and availability. While machine learning (ML) models are used for DoS attack detection, new strategies are needed to enhance their performance. We suggest an innovative method, combinatorial fusion, which combines multiple ML models using advanced algorithms. This includes score and rank combinations, weighted techniques, and diversity strength of scoring systems. Through rigorous evaluations, we demonstrate the effectiveness of this fusion approach, considering metrics like precision, recall, and F1-score. We address the challenge of low-profiled attack classification by fusing models to create a comprehensive solution. Our findings emphasize the potential of this approach to improve DoS attack detection and contribute to stronger defense mechanisms.

[4] 2312.00007

Space-Time Decomposition of Kalman Filter

We present an innovative interpretation of Kalman Filter (KF, for short) combining the ideas of Schwarz Domain Decomposition (DD) and Parallel in Time (PinT) approaches. Thereafter we call it DD-KF. In contrast to standard DD approaches which are already incorporated in KF and other state estimation models, implementing a straightforward data parallelism inside the loop over time, DD-KF ab-initio partitions the whole model, including filter equations and dynamic model along both space and time directions/steps. As a consequence, we get local KFs reproducing the original filter at smaller dimensions on local domains. Also, sub problems could be solved in parallel. In order to enforce the matching of local solutions on overlapping regions, and then to achieve the same global solution of KF, local KFs are slightly modified by adding a correction term keeping track of contributions of adjacent subdomains to overlapping regions. Such a correction term balances localization errors along overlapping regions, acting as a regularization constraint on local solutions. Furthermore, such a localization excludes remote observations from each analyzed location improving the conditioning of the error covariance matrices. As dynamic model we consider Shallow Water equations which can be regarded a consistent tool to get a proof of concept of the reliability assessment of DD-KF in monitoring and forecasting of weather systems and ocean currents

[5] 2312.00009

Risk-Aware and Explainable Framework for Ensuring Guaranteed Coverage in Evolving Hardware Trojan Detection

As the semiconductor industry has shifted to a fabless paradigm, the risk of hardware Trojans being inserted at various stages of production has also increased. Recently, there has been a growing trend toward the use of machine learning solutions to detect hardware Trojans more effectively, with a focus on the accuracy of the model as an evaluation metric. However, in a high-risk and sensitive domain, we cannot accept even a small misclassification. Additionally, it is unrealistic to expect an ideal model, especially when Trojans evolve over time. Therefore, we need metrics to assess the trustworthiness of detected Trojans and a mechanism to simulate unseen ones. In this paper, we generate evolving hardware Trojans using our proposed novel conformalized generative adversarial networks and offer an efficient approach to detecting them based on a non-invasive algorithm-agnostic statistical inference framework that leverages the Mondrian conformal predictor. The method acts like a wrapper over any of the machine learning models and produces set predictions along with uncertainty quantification for each new detected Trojan for more robust decision-making. In the case of a NULL set, a novel method to reject the decision by providing a calibrated explainability is discussed. The proposed approach has been validated on both synthetic and real chip-level benchmarks and proven to pave the way for researchers looking to find informed machine learning solutions to hardware security problems.

[6] 2312.00010

Efficient calculation of the integral equation for simulating 2D TE scattering in a homogeneous medium using the Ewald method and a Gabor frame discretization

We utilize the domain integral equation formulation to simulate two-dimensional transverse electric scattering in a homogeneous medium and a summation of modulated Gaussian functions to approximate the dual Gabor window. Then we apply Ewald Green function transformation to separate the integrals related to x and z in the integral equation, which produce Gaussian functions. These Gaussian functions in the integrands can be integrated analytically, which greatly simplifies the calculation process. Finally, we discuss the convergence and the selection of the Ewald splitting parameter E.

[7] 2312.00011

The Bivariate Normal Integral via Owen's T Function as a Modified Euler's Arctangent Series

The Owen's T function is presented in four new ways, one of them as a series similar to the Euler's arctangent series divided by $2\pi$, which is its majorant series. All possibilities enable numerically stable and fast convergent computation of the bivariate normal integral with simple recursion. When tested $\Phi_\varrho^2(x,y)$ computation on a random sample of one million parameter triplets with uniformly distributed components and using double precision arithmetic, the maximum absolute error was $3.45\cdot 10^{-16}$. In additional testing, focusing on cases with correlation coefficients close to one in absolute value, when the computation may be very sensitive to small rounding errors, the accuracy was retained. In rare potentially critical cases, a simple adjustment to the computation procedure was performed - one potentially critical computation was replaced with two equivalent non-critical ones. All new series are suitable for vector and high-precision computation, assuming they are supplemented with appropriate efficient and accurate computation of the arctangent and standard normal cumulative distribution functions. They are implemented by the R package Phi2rho, available on CRAN. Its functions allow vector arguments and are ready to work with the Rmpfr package, which enables the use of arbitrary precision instead of double precision numbers. A special test with up to 1024-bit precision computation is also presented.

[8] 2312.00012

Multi-fidelity uncertainty quantification for homogenization problems in structure-property relationships from crystal plasticity finite elements

Crystal plasticity finite element method (CPFEM) has been an integrated computational materials engineering (ICME) workhorse to study materials behaviors and structure-property relationships for the last few decades. These relations are mappings from the microstructure space to the materials properties space. Due to the stochastic and random nature of microstructures, there is always some uncertainty associated with materials properties, for example, in homogenized stress-strain curves. For critical applications with strong reliability needs, it is often desirable to quantify the microstructure-induced uncertainty in the context of structure-property relationships. However, this uncertainty quantification (UQ) problem often incurs a large computational cost because many statistically equivalent representative volume elements (SERVEs) are needed. In this paper, we apply a multi-level Monte Carlo (MLMC) method to CPFEM to study the uncertainty in stress-strain curves, given an ensemble of SERVEs at multiple mesh resolutions. By using the information at coarse meshes, we show that it is possible to approximate the response at fine meshes with a much reduced computational cost. We focus on problems where the model output is multi-dimensional, which requires us to track multiple quantities of interest (QoIs) at the same time. Our numerical results show that MLMC can accelerate UQ tasks around 2.23x, compared to the classical Monte Carlo (MC) method, which is widely known as the ensemble average in the CPFEM literature.

[9] 2312.00013

Biometric Technologies and the Law: Developing a Taxonomy for Guiding Policymakers

Despite the increasing adoption of biometric technologies, their regulation has not kept up with the same pace, particularly with regard to safeguarding individuals' privacy and personal data. Policymakers may struggle to comprehend the technology behind biometric systems and their potential impact on fundamental rights, resulting in insufficient or inadequate legal regulation. This study seeks to bridge this gap by proposing a taxonomy of biometric technologies that can aid in their effective deployment and supervision. Through a literature review, the technical characteristics of biometric systems were identified and categorised. The resulting taxonomy can enhance the understanding of biometric technologies and facilitate the development of regulation that prioritises privacy and personal data protection.

[10] 2312.00014

A class of fractional differential equations via power non-local and non-singular kernels: existence, uniqueness and numerical approximations

We prove a useful formula and new properties for the recently introduced power fractional calculus with non-local and non-singular kernels. In particular, we prove a new version of Gronwall's inequality involving the power fractional integral; and we establish existence and uniqueness results for nonlinear power fractional differential equations using fixed point techniques. Moreover, based on Lagrange polynomial interpolation, we develop a new explicit numerical method in order to approximate the solutions of a rich class of fractional differential equations. The approximation error of the proposed numerical scheme is analyzed. For illustrative purposes, we apply our method to a fractional differential equation for which the exact solution is computed, as well as to a nonlinear problem for which no exact solution is known. The numerical simulations show that the proposed method is very efficient, highly accurate and converges quickly.

[11] 2312.00016

Preserving The Safety And Confidentiality Of Data Mining Information In Health Care: A literature review

Daily, massive volume of data are produced due to the internet of things' rapid development, which has now permeated the healthcare industry. Recent advances in data mining have spawned a new field of a study dubbed privacy-preserving data mining (PPDM). PPDM technique or approach enables the extraction of actionable insight from enormous volume of data while safeguarding the privacy of individual information and benefiting the entire society Medical research has taken a new course as a result of data mining with healthcare data to detect diseases earlier and improve patient care. Data integration necessitates the sharing of sensitive patient information. However, substantial privacy issues are raised in connection with the storage and transmission of potentially sensitive information. Disclosing sensitive information infringes on patients' privacy. This paper aims to conduct a review of related work on privacy-preserving mechanisms, data protection regulations, and mitigating tactics. The review concluded that no single strategy outperforms all others. Hence, future research should focus on adequate techniques for privacy solutions in the age of massive medical data and the standardization of evaluation standards.

[12] 2312.00017

A constructive approach for investigating the stability of incommensurate fractional differential systems

This paper is devoted to studying the asymptotic behaviour of solutions to generalized non-commensurate fractional systems. To this end, we first consider fractional systems with rational orders and introduce a criterion that is necessary and sufficient to ensure the stability of such systems. Next, from the fractional-order pseudospectrum definition proposed by \v{S}anca et al., we formulate the concept of a rational approximation for the fractional spectrum of a noncommensurate fractional systems with general, not necessarily rational, orders. Our first important new contribution is to show the equivalence between the fractional spectrum of a noncommensurate linear system and its rational approximation. With this result in hand, we use ideas developed in our earlier work to demonstrate the stability of an equilibrium point to nonlinear systems in arbitrary finite-dimensional spaces. A second novel aspect of our work is the fact that the approach is constructive. Finally, we give numerical simulations to illustrate the merit of the proposed theoretical results.

[13] 2312.00018

Security Challenges in Autonomous Systems Design

Autonomous systems are emerging in many application domains. With the recent advancements in artificial intelligence and machine learning, sensor technology, perception algorithms and robotics, scenarios previously requiring strong human involvement can be handled by autonomous systems. With the independence from human control, cybersecurity of such systems becomes even more critical as no human intervention in case of undesired behavior is possible. In this context, this paper discusses emerging security challenges in autonomous systems design which arise in many domains such as autonomous incident response, risk assessment, data availability, systems interaction, trustworthiness, updatability, access control, as well as the reliability and explainability of machine learning methods. In all these areas, this paper thoroughly discusses the state of the art, identifies emerging security challenges and proposes research directions to address these challenges for developing secure autonomous systems.

[14] 2312.00019

The theoretical limits of biometry

Biometry has proved its capability in terms of recognition accuracy. Now, it is widely used for automated border control with the biometric passport, to unlock a smartphone or a computer with a fingerprint or a face recognition algorithm. While identity verification is widely democratized, pure identification with no additional clues is still a work in progress. The identification difficulty depends on the population size, as the larger the group is, the larger the confusion risk. For collision prevention, biometric traits must be sufficiently distinguishable to scale to considerable groups, and algorithms should be able to capture their differences accurately. Most biometric works are purely experimental, and it is impossible to extrapolate the results to a smaller or a larger group. In this work, we propose a theoretical analysis of the distinguishability problem, which governs the error rates of biometric systems. We demonstrate simple relationships between the population size and the number of independent bits necessary to prevent collision in the presence of noise. This work provides the lowest lower bound for memory requirements. The results are very encouraging, as the biometry of the whole Earth population can fit in a regular disk, leaving some space for noise and redundancy.

[15] 2312.00020

A new numerical technique based on Chelyshkov polynomials for solving two-dimensional stochastic Itô-Volterra Fredholm integral equation

In this paper, a two-dimensional operational matrix method based on Chelyshkov polynomials is implemented to numerically solve the two-dimensional stochastic It\^o-Volterra Fredholm integral equations. These equations arise in several problems such as an exponential population growth model with several independent white noise sources. In this paper a new stochastic operational matrix has been derived first time ever by using Chelyshkov polynomials. After that, the operational matrices are used to transform the It\^o-Volterra Fredholm integral equation into a system of linear algebraic equations by using Newton cotes nodes as collocation point that can be easily solved. Furthermore, the convergence and error bound of the suggested method are well established. In order to illustrate the effectiveness, plausibility, reliability, and applicability of the existing technique, two typical examples have been presented.

[16] 2312.00021

Technical Report relating to CVE-2022-46480, CVE-2023-26941, CVE-2023-26942, and CVE-2023-26943

The following technical report provides background information relating to four CVEs found in the following products: Ultraloq UL3 BT (CVE-2022-46480); Yale Conexis L1 Smart Lock (CVE-2023-26941); Yale IA-210 Intruder Alarm (CVE-2023-26942); Yale Keyless Smart Lock (CVE-2023-26943). The work discussed here was carried out by Ash Allen, Dr. Alexios Mylonas, and Dr. Stilianos Vidalis as part of a wider research project into smart device security. Responsible disclosure of all four issues has been made with the appropriate vendors, and they have been acknowledged as vulnerabilities.

[17] 2312.00022

A finite element method for stochastic diffusion equations using fluctuating hydrodynamics

We present a finite element approach for diffusion problems with thermal fluctuations based on a fluctuating hydrodynamics model. The governing transport equations are stochastic partial differential equations with a fluctuating forcing term. We propose a discrete formulation of the stochastic forcing term that has the correct covariance matrix up to a standard discretization error. Furthermore, to obtain a numerical solution with spatial correlations that converge to those of the continuum equation, we derive a linear mapping to transform the finite element solution into an equivalent discrete solution that is free from the artificial correlations introduced by the spatial discretization. The method is validated by applying it to two diffusion problems: a second-order diffusion equation and a fourth-order diffusion equation. The theoretical (continuum) solution to the first case presents spatially decorrelated fluctuations, while the second case presents fluctuations correlated over a finite length. In both cases, the numerical solution presents a structure factor that approximates well the continuum one.

[18] 2312.00023

Hypergraph Topological Features for Autoencoder-Based Intrusion Detection for Cybersecurity Data

In this position paper, we argue that when hypergraphs are used to capture multi-way local relations of data, their resulting topological features describe global behaviour. Consequently, these features capture complex correlations that can then serve as high fidelity inputs to autoencoder-driven anomaly detection pipelines. We propose two such potential pipelines for cybersecurity data, one that uses an autoencoder directly to determine network intrusions, and one that de-noises input data for a persistent homology system, PHANTOM. We provide heuristic justification for the use of the methods described therein for an intrusion detection pipeline for cyber data. We conclude by showing a small example over synthetic cyber attack data.

[19] 2312.00024

Can LLMs Patch Security Issues?

Large Language Models (LLMs) have shown impressive proficiency in code generation. Nonetheless, similar to human developers, these models might generate code that contains security vulnerabilities and flaws. Writing secure code remains a substantial challenge, as vulnerabilities often arise during interactions between programs and external systems or services, such as databases and operating systems. In this paper, we propose a novel approach, Feedback-Driven Solution Synthesis (FDSS), designed to explore the use of LLMs in receiving feedback from Bandit, which is a static code analysis tool, and then the LLMs generate potential solutions to resolve security vulnerabilities. Each solution, along with the vulnerable code, is then sent back to the LLM for code refinement. Our approach shows a significant improvement over the baseline and outperforms existing approaches. Furthermore, we introduce a new dataset, PythonSecurityEval, collected from real-world scenarios on Stack Overflow to evaluate the LLMs' ability to generate secure code. Code and data are available at \url{}

[20] 2312.00025

Secure Transformer Inference

We present a three-party protocol that can protect both Transformer parameters and user data during the inference phase. For each feedforward inference process, our protocol only introduces permutation computation of input and output data on the user side. Our protocol, Secure Transformer Inference Protocol (STIP), can be applied to real-world services like ChatGPT.

[21] 2312.00026

A Quality-of-Service Compliance System using Federated Learning and Optimistic Rollups

Edge computing brings a new paradigm in which the sharing of computing, storage, and bandwidth resources as close as possible to the mobile devices or sensors generating a large amount of data. A parallel trend is the rise of phones and tablets as primary computing devices for many people. The powerful sensors present on these devices combined with the fact that they are mobile, mean they have access to data of an unprecedentedly diverse and private nature. Models learned on such data hold the promise of greatly improving usability by powering more intelligent applications, but the sensitive nature of the data means there are risks and responsibilities to storing it in a centralized location. To address the data privacy required for some data in these devices we propose the use of Federated Learning (FL) so that specific data about services performed by clients do not leave the source machines. Instead of sharing data, users collaboratively train a model by only sending weight updates to a server. However, the naive use of FL in those scenarios exposes it to a risk of corruption, whether intentional or not, during the training phase. To improve the security of the FL structure, we propose a decentralized Blockchain-based FL in an edge computing scenario. We also apply blockchain to create a reward mechanism in FL to enable incentive strategy for trainers.

[22] 2312.00027

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Recent developments in Large Language Models (LLMs) have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or red-teaming can easily expose the potential weaknesses of the unaligned models, thereby precluding their release/use. (2) non-persistence, the unaligned LLMs can be easily repaired through re-alignment, i.e., fine-tuning again with aligned data points. In this work, we show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. We also provide a novel understanding on the relationship between the backdoor persistence and the activation pattern and further provide guidelines for potential trigger design. Through extensive experiments, we demonstrate that our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.

[23] 2312.00028

Constructive Representation of Functions in $N$-Dimensional Sobolev Space

We propose a new representation of functions in Sobolev spaces on an $N$-dimensional hyper-rectangle, expressing such functions in terms of their admissible derivatives, evaluated along lower-boundaries of the domain. These boundary values are either finite-dimensional or exist in the space $L_2$ of square-integrable functions -- free of the continuity constraints inherent to Sobolev space. Moreover, we show that the map from this space of boundary values to the Sobolev space is given by an integral operator with polynomial kernel, and we prove that this map is invertible. Using this result, we propose a method for polynomial approximation of functions in Sobolev space, reconstructing such an approximation from polynomial projections of the boundary values. We prove that this approximation is optimal with respect to a discrete-continuous Sobolev norm, and show through numerical examples that it exhibits better convergence behavior than direct projection of the function. Finally, we show that this approach may also be adapted to use a basis of step functions, to construct accurate piecewise polynomial approximations that do not suffer from e.g. Gibbs phenomenon.

[24] 2312.00029

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Modern Large language models (LLMs) can still generate responses that may not be aligned with human expectations or values. While many weight-based alignment methods have been proposed, many of them still leave models vulnerable to attacks when used on their own. To help mitigate this issue, we introduce Bergeron, a framework designed to improve the robustness of LLMs against adversarial attacks. Bergeron employs a two-tiered architecture. Here, a secondary LLM serves as a simulated conscience that safeguards a primary LLM. We do this by monitoring for and correcting potentially harmful text within both the prompt inputs and the generated outputs of the primary LLM. Empirical evaluation shows that Bergeron can improve the alignment and robustness of several popular LLMs without costly fine-tuning. It aids both open-source and black-box LLMs by complementing and reinforcing their existing alignment training.

[25] 2312.00030

Artificial Intelligence in Sustainable Vertical Farming

As global challenges of population growth, climate change, and resource scarcity intensify, the agricultural landscape is at a critical juncture. Sustainable vertical farming emerges as a transformative solution to address these challenges by maximizing crop yields in controlled environments. This paradigm shift necessitates the integration of cutting-edge technologies, with Artificial Intelligence (AI) at the forefront. The paper provides a comprehensive exploration of the role of AI in sustainable vertical farming, investigating its potential, challenges, and opportunities. The review synthesizes the current state of AI applications, encompassing machine learning, computer vision, the Internet of Things (IoT), and robotics, in optimizing resource usage, automating tasks, and enhancing decision-making. It identifies gaps in research, emphasizing the need for optimized AI models, interdisciplinary collaboration, and the development of explainable AI in agriculture. The implications extend beyond efficiency gains, considering economic viability, reduced environmental impact, and increased food security. The paper concludes by offering insights for stakeholders and suggesting avenues for future research, aiming to guide the integration of AI technologies in sustainable vertical farming for a resilient and sustainable future in agriculture.

[26] 2312.00031

Crypto analysis of the key distribution scheme using noise-free resistances

Known key exchange schemes offering information-theoretic (unconditional) security are complex and costly to implement. Nonetheless, they remain the only known methods for achieving unconditional security in key exchange. Therefore, the explorations for simpler solutions for information-theoretic security are highly justified. Lin et al. [1] proposed an interesting hardware key distribution scheme that utilizes thermal-noise-free resistances and DC voltages. A crypto analysis of this system is presented. It is shown that, if Eve gains access to the initial shared secret at any time in the past or future, she can successfully crack all the generated keys in the past and future, even retroactively, using passively obtained and recorded voltages and currents. Therefore, the scheme is not a secure key exchanger, but it is rather a key expander with no more information entropy than the originally shared secret at the beginning. We also point out that the proposed defense methods against active attacks do not function when the original shared secret is compromised because then the communication cannot be efficiently authenticated. However, they do work when an unconditionally secure key exchanger is applied to enable the authenticated communication protocol.

[27] 2312.00032

Revolutionizing Forensic Toolmark Analysis: An Objective and Transparent Comparison Algorithm

Forensic toolmark comparisons are currently performed subjectively by humans, which leads to a lack of consistency and accuracy. There is little evidence that examiners can determine whether pairs of marks were made by the same tool or different tools. There is also little evidence that they can make this classification when marks are made under different conditions, such as different angles of attack or direction of mark generation. We generate original toolmark data in 3D, extract the signal from each toolmarks, and train an algorithm to compare toolmark signals objectively. We find that toolmark signals cluster by tool, and not by angle or direction. That is, the variability within tool, regardless of angle/direction, is smaller than the variability between tools. The known-match and known-non-match densities of the similarities of pairs of marks have a small overlap, even when accounting for dependencies in the data, making them a useful instrument for determining whether a new pair of marks was made by the same tool. We provide a likelihood ratio approach as a formal method for comparing toolmark signals with a measure of uncertainty. This empirically trained, open-source method can be used by forensic examiners to compare toolmarks objectively and thus improve the reliability of toolmark comparisons. This can, in turn, reduce miscarriages of justice in the criminal justice system.

[28] 2312.00033

DeFi Security: Turning The Weakest Link Into The Strongest Attraction

The primary innovation we pioneer -- focused on blockchain information security -- is called the Safe-House. The Safe-House is badly needed since there are many ongoing hacks and security concerns in the DeFi space right now. The Safe-House is a piece of engineering sophistication that utilizes existing blockchain principles to bring about greater security when customer assets are moved around. The Safe-House logic is easily implemented as smart contracts on any decentralized system. The amount of funds at risk from both internal and external parties -- and hence the maximum one time loss -- is guaranteed to stay within the specified limits based on cryptographic fundamentals. To improve the safety of the Safe-House even further, we adapt the one time password (OPT) concept to operate using blockchain technology. Well suited to blockchain cryptographic nuances, our secondary advancement can be termed the one time next time password (OTNTP) mechanism. The OTNTP is designed to complement the Safe-House making it even more safe. We provide a detailed threat assessment model -- discussing the risks faced by DeFi protocols and the specific risks that apply to blockchain fund management -- and give technical arguments regarding how these threats can be overcome in a robust manner. We discuss how the Safe-House can participate with other external yield generation protocols in a secure way. We provide reasons for why the Safe-House increases safety without sacrificing the efficiency of operation. We start with a high level intuitive description of the landscape, the corresponding problems and our solutions. We then supplement this overview with detailed discussions including the corresponding mathematical formulations and pointers for technological implementation. This approach ensures that the article is accessible to a broad audience.

[29] 2312.00034

Enhancing IoT Security via Automatic Network Traffic Analysis: The Transition from Machine Learning to Deep Learning

This work provides a comparative analysis illustrating how Deep Learning (DL) surpasses Machine Learning (ML) in addressing tasks within Internet of Things (IoT), such as attack classification and device-type identification. Our approach involves training and evaluating a DL model using a range of diverse IoT-related datasets, allowing us to gain valuable insights into how adaptable and practical these models can be when confronted with various IoT configurations. We initially convert the unstructured network traffic data from IoT networks, stored in PCAP files, into images by processing the packet data. This conversion process adapts the data to meet the criteria of DL classification methods. The experiments showcase the ability of DL to surpass the constraints tied to manually engineered features, achieving superior results in attack detection and maintaining comparable outcomes in device-type identification. Additionally, a notable feature extraction time difference becomes evident in the experiments: traditional methods require around 29 milliseconds per data packet, while DL accomplishes the same task in just 2.9 milliseconds. The significant time gap, DL's superior performance, and the recognized limitations of manually engineered features, presents a compelling call to action within the IoT community. This encourages us to shift from exploring new IoT features for each dataset to addressing the challenges of integrating DL into IoT, making it a more efficient solution for real-world IoT scenarios.

[30] 2312.00035

FBChain: A Blockchain-based Federated Learning Model with Efficiency and Secure Communication

Privacy and security in the parameter transmission process of federated learning are currently among the most prominent concerns. However, there are two thorny problems caused by unprotected communication methods: "parameter-leakage" and "inefficient-communication". This article proposes Blockchain-based Federated Learning (FBChain) model for federated learning parameter communication to overcome the above two problems. First, we utilize the immutability of blockchain to store the global model and hash value of local model parameters in case of tampering during the communication process, protect data privacy by encrypting parameters, and verify data consistency by comparing the hash values of local parameters, thus addressing the "parameter-leakage" problem. Second, the Proof of Weighted Link Speed (PoWLS) consensus algorithm comprehensively selects nodes with the higher weighted link speed to aggregate global model and package blocks, thereby solving the "inefficient-communication" problem. Experimental results demonstrate the effectiveness of our proposed FBChain model and its ability to improve model communication efficiency in federated learning.

[31] 2312.00036

Privacy-Preserving Load Forecasting via Personalized Model Obfuscation

The widespread adoption of smart meters provides access to detailed and localized load consumption data, suitable for training building-level load forecasting models. To mitigate privacy concerns stemming from model-induced data leakage, federated learning (FL) has been proposed. This paper addresses the performance challenges of short-term load forecasting models trained with FL on heterogeneous data, emphasizing privacy preservation through model obfuscation. Our proposed algorithm, Privacy Preserving Federated Learning (PPFL), incorporates personalization layers for localized training at each smart meter. Additionally, we employ a differentially private mechanism to safeguard against data leakage from shared layers. Simulations on the NREL ComStock dataset corroborate the effectiveness of our approach.

[32] 2312.00039

Acoustic Cybersecurity: Exploiting Voice-Activated Systems

In this study, we investigate the emerging threat of inaudible acoustic attacks targeting digital voice assistants, a critical concern given their projected prevalence to exceed the global population by 2024. Our research extends the feasibility of these attacks across various platforms like Amazon's Alexa, Android, iOS, and Cortana, revealing significant vulnerabilities in smart devices. The twelve attack vectors identified include successful manipulation of smart home devices and automotive systems, potential breaches in military communication, and challenges in critical infrastructure security. We quantitatively show that attack success rates hover around 60%, with the ability to activate devices remotely from over 100 feet away. Additionally, these attacks threaten critical infrastructure, emphasizing the need for multifaceted defensive strategies combining acoustic shielding, advanced signal processing, machine learning, and robust user authentication to mitigate these risks.

[33] 2312.00040

Presentation Attack detection using Wavelet Transform and Deep Residual Neural Net

Biometric authentication is becoming more prevalent for secured authentication systems. However, the biometric substances can be deceived by the imposters in several ways. Among other imposter attacks, print attacks, mask attacks, and replay attacks fall under the presentation attack category. The bio-metric images, especially the iris and face, are vulnerable to different presentation attacks. This research applies deep learning approaches to mitigate presentation attacks in a biometric access control system. Our contribution in this paper is two-fold: First, we applied the wavelet transform to extract the features from the biometric images. Second, we modified the deep residual neural net and applied it to the spoof datasets in an attempt to detect the presentation attacks. This research applied the proposed approach to biometric spoof datasets, namely ATVS, CASIA two class, and CASIA cropped image sets. The datasets used in this research contain images that are captured in both a controlled and uncontrolled environment along with different resolutions and sizes. We obtained the best accuracy of 93% on the ATVS Iris datasets. For CASIA two class and CASIA cropped datasets, we achieved test accuracies of 91% and 82%, respectively.

[34] 2312.00041

Presentation Attack Detection using Convolutional Neural Networks and Local Binary Patterns

The use of biometrics to authenticate users and control access to secure areas has become extremely popular in recent years, and biometric access control systems are frequently used by both governments and private corporations. However, these systems may represent risks to security when deployed without considering the possibility of biometric presentation attacks (also known as spoofing). Presentation attacks are a serious threat because they do not require significant time, expense, or skill to carry out while remaining effective against many biometric systems in use today. This research compares three different software-based methods for facial and iris presentation attack detection in images. The first method uses Inception-v3, a pre-trained deep Convolutional Neural Network (CNN) made by Google for the ImageNet challenge, which is retrained for this problem. The second uses a shallow CNN based on a modified Spoofnet architecture, which is trained normally. The third is a texture-based method using Local Binary Patterns (LBP). The datasets used are the ATVS-FIr dataset, which contains real and fake iris images, and the CASIA Face Anti-Spoofing Dataset, which contains real images as well as warped photos, cut photos, and video replay presentation attacks. We also present a third set of results, based on cropped versions of the CASIA images.

[35] 2312.00043

Who is leading in AI? An analysis of industry AI research

AI research is increasingly industry-driven, making it crucial to understand company contributions to this field. We compare leading AI companies by research publications, citations, size of training runs, and contributions to algorithmic innovations. Our analysis reveals the substantial role played by Google, OpenAI and Meta. We find that these three companies have been responsible for some of the largest training runs, developed a large fraction of the algorithmic innovations that underpin large language models, and led in various metrics of citation impact. In contrast, leading Chinese companies such as Tencent and Baidu had a lower impact on many of these metrics compared to US counterparts. We observe many industry labs are pursuing large training runs, and that training runs from relative newcomers -- such as OpenAI and Anthropic -- have matched or surpassed those of long-standing incumbents such as Google. The data reveals a diverse ecosystem of companies steering AI progress, though US labs such as Google, OpenAI and Meta lead across critical metrics.

[36] 2312.00044

Advancing AI Audits for Enhanced AI Governance

As artificial intelligence (AI) is integrated into various services and systems in society, many companies and organizations have proposed AI principles, policies, and made the related commitments. Conversely, some have proposed the need for independent audits, arguing that the voluntary principles adopted by the developers and providers of AI services and systems insufficiently address risk. This policy recommendation summarizes the issues related to the auditing of AI services and systems and presents three recommendations for promoting AI auditing that contribute to sound AI governance. Recommendation1.Development of institutional design for AI audits. Recommendation2.Training human resources for AI audits. Recommendation3. Updating AI audits in accordance with technological progress. In this policy recommendation, AI is assumed to be that which recognizes and predicts data with the last chapter outlining how generative AI should be audited.

[37] 2312.00045

AI-driven E-Liability Knowledge Graphs: A Comprehensive Framework for Supply Chain Carbon Accounting and Emissions Liability Management

While carbon accounting plays a fundamental role in our fight against climate change, it is not without its challenges. We begin the paper with a critique of the conventional carbon accounting practices, after which we proceed to introduce the E-liability carbon accounting methodology and Emissions Liability Management (ELM) originally proposed by Kaplan and Ramanna, highlighting their strengths. Recognizing the immense value of this novel approach for real-world carbon accounting improvement, we introduce a novel data-driven integrative framework that leverages AI and computation - the E-Liability Knowledge Graph framework - to achieve real-world implementation of the E-liability carbon accounting methodology. In addition to providing a path-to-implementation, our proposed framework brings clarity to the complex environmental interactions within supply chains, thus enabling better informed and more responsible decision-making. We analyze the implementation aspects of this framework and conclude with a discourse on the role of this AI-aided knowledge graph in ensuring the transparency and decarbonization of global supply chains.

[38] 2312.00046

Retail Analytics in the New Normal: The Influence of Artificial Intelligence and the Covid-19 Pandemic

The COVID-19 pandemic has severely disrupted the retail landscape and has accelerated the adoption of innovative technologies. A striking example relates to the proliferation of online grocery orders and the technology deployed to facilitate such logistics. In fact, for many retailers, this disruption was a wake-up call after which they started recognizing the power of data analytics and artificial intelligence (AI). In this article, we discuss the opportunities that AI can offer to retailers in the new normal retail landscape. Some of the techniques described have been applied at scale to adapt previously deployed AI models, whereas in other instances, fresh solutions needed to be developed to help retailers cope with recent disruptions, such as unexpected panic buying, retraining predictive models, and leveraging online-offline synergies.

[39] 2312.00047

chatGPT for generating questions and assessments based on accreditations

This research aims to take advantage of artificial intelligence techniques in producing students assessment that is compatible with the different academic accreditations of the same program. The possibility of using generative artificial intelligence technology was studied to produce an academic accreditation compliant test the National Center for Academic Accreditation of Kingdom of Saudi Arabia and Accreditation Board for Engineering and Technology. A novel method was introduced to map the verbs used to create the questions introduced in the tests. The method allows a possibility of using the generative artificial intelligence technology to produce and check the validity of questions that measure educational outcomes. A questionnaire was distributed to ensure that the use of generative artificial intelligence to create exam questions is acceptable by the faculty members, as well as to ask about the acceptance of assistance in validating questions submitted by faculty members and amending them in accordance with academic accreditations. The questionnaire was distributed to faculty members of different majors in the Kingdom of Saudi Arabias universities. one hundred twenty responses obtained with eight five percentile approval percentage for generate complete exam questions by generative artificial intelligence . Whereas ninety eight percentage was the approval percentage for editing and improving already existed questions.

[40] 2312.00048

Tokenized Model: A Blockchain-Empowered Decentralized Model Ownership Verification Platform

With the development of practical deep learning models like generative AI, their excellent performance has brought huge economic value. For instance, ChatGPT has attracted more than 100 million users in three months. Since the model training requires a lot of data and computing power, a well-performing deep learning model is behind a huge effort and cost. Facing various model attacks, unauthorized use and abuse from the network that threaten the interests of model owners, in addition to considering legal and other administrative measures, it is equally important to protect the model's copyright from the technical means. By using the model watermarking technology, we point out the possibility of building a unified platform for model ownership verification. Given the application history of blockchain in copyright verification and the drawbacks of a centralized third-party, this paper considers combining model watermarking technology and blockchain to build a unified model copyright protection platform. By a new solution we called Tokenized Model, it protects the model's copyright by reliable ownership record and verification mechanism. It also promotes the financial value of model by constructing the model's transaction process and contribution shares of a model. In the typical case study, we also study the various performance under usual scenario to verify the effectiveness of this platform.

[41] 2312.00050

Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift

Diffusion models (DM) have become state-of-the-art generative models because of their capability to generate high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework Elijah on hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility.

[42] 2312.00051

MIA-BAD: An Approach for Enhancing Membership Inference Attack and its Mitigation with Federated Learning

The membership inference attack (MIA) is a popular paradigm for compromising the privacy of a machine learning (ML) model. MIA exploits the natural inclination of ML models to overfit upon the training data. MIAs are trained to distinguish between training and testing prediction confidence to infer membership information. Federated Learning (FL) is a privacy-preserving ML paradigm that enables multiple clients to train a unified model without disclosing their private data. In this paper, we propose an enhanced Membership Inference Attack with the Batch-wise generated Attack Dataset (MIA-BAD), a modification to the MIA approach. We investigate that the MIA is more accurate when the attack dataset is generated batch-wise. This quantitatively decreases the attack dataset while qualitatively improving it. We show how training an ML model through FL, has some distinct advantages and investigate how the threat introduced with the proposed MIA-BAD approach can be mitigated with FL approaches. Finally, we demonstrate the qualitative effects of the proposed MIA-BAD methodology by conducting extensive experiments with various target datasets, variable numbers of federated clients, and training batch sizes.

[43] 2312.00052

A Case for Competent AI Systems $-$ A Concept Note

The efficiency of an AI system is contingent upon its ability to align with the specified requirements of a given task. How-ever, the inherent complexity of tasks often introduces the potential for harmful implications or adverse actions. This note explores the critical concept of capability within AI systems, representing what the system is expected to deliver. The articulation of capability involves specifying well-defined out-comes. Yet, the achievement of this capability may be hindered by deficiencies in implementation and testing, reflecting a gap in the system's competency (what it can do vs. what it does successfully). A central challenge arises in elucidating the competency of an AI system to execute tasks effectively. The exploration of system competency in AI remains in its early stages, occasionally manifesting as confidence intervals denoting the probability of success. Trust in an AI system hinges on the explicit modeling and detailed specification of its competency, connected intricately to the system's capability. This note explores this gap by proposing a framework for articulating the competency of AI systems. Motivated by practical scenarios such as the Glass Door problem, where an individual inadvertently encounters a glass obstacle due to a failure in their competency, this research underscores the imperative of delving into competency dynamics. Bridging the gap between capability and competency at a detailed level, this note contributes to advancing the discourse on bolstering the reliability of AI systems in real-world applications.

[44] 2312.00053

Anti-Sexism Alert System: Identification of Sexist Comments on Social Media Using AI Techniques

Social relationships in the digital sphere are becoming more usual and frequent, and they constitute a very important aspect for all of us. {Violent interactions in this sphere are very frequent, and have serious effects on the victims}. Within this global scenario, there is one kind of digital violence that is becoming really worrying: sexism against women. Sexist comments that are publicly posted in social media (newspaper comments, social networks, etc.), usually obtain a lot of attention and become viral, with consequent damage to the persons involved. In this paper, we introduce an anti-sexism alert system, based on natural language processing (NLP) and artificial intelligence (AI), that analyzes any public post, and decides if it could be considered a sexist comment or not. Additionally, this system also works on analyzing all the public comments linked to any multimedia content (piece of news, video, tweet, etc.) and decides, using a color-based system similar to traffic lights, if there is sexism in the global set of posts. We have created a labeled data set in Spanish, since the majority of studies focus on English, to train our system, which offers a very good performance after the validation experiments.

[45] 2312.00055

LEAP: LLM-Generation of Egocentric Action Programs

We introduce LEAP (illustrated in Figure 1), a novel method for generating video-grounded action programs through use of a Large Language Model (LLM). These action programs represent the motoric, perceptual, and structural aspects of action, and consist of sub-actions, pre- and post-conditions, and control flows. LEAP's action programs are centered on egocentric video and employ recent developments in LLMs both as a source for program knowledge and as an aggregator and assessor of multimodal video information. We apply LEAP over a majority (87\%) of the training set of the EPIC Kitchens dataset, and release the resulting action programs as a publicly available dataset here ( We employ LEAP as a secondary source of supervision, using its action programs in a loss term applied to action recognition and anticipation networks. We demonstrate sizable improvements in performance in both tasks due to training with the LEAP dataset. Our method achieves 1st place on the EPIC Kitchens Action Recognition leaderboard as of November 17 among the networks restricted to RGB-input (see Supplementary Materials).

[46] 2312.00057

Probabilistic Copyright Protection Can Fail for Text-to-Image Generative Models

The booming use of text-to-image generative models has raised concerns about their high risk of producing copyright-infringing content. While probabilistic copyright protection methods provide a probabilistic guarantee against such infringement, in this paper, we introduce Virtually Assured Amplification Attack (VA3), a novel online attack framework that exposes the vulnerabilities of these protection mechanisms. The proposed framework significantly amplifies the probability of generating infringing content on the sustained interactions with generative models and a lower-bounded success probability of each engagement. Our theoretical and experimental results demonstrate the effectiveness of our approach and highlight the potential risk of implementing probabilistic copyright protection in practical applications of text-to-image generative models. Code is available at

[47] 2312.00058

Uniform estimates for a fully discrete scheme integrating the linear heat equation on a bounded interval with pure Neumann boundary conditions

This manuscript deals with the analysis of numerical methods for the full discretization (in time and space) of the linear heat equation with Neumann boundary conditions, and it provides the reader with error estimates that are uniform in time. First, we consider the homogeneous equation with homogeneous Neumann boundary conditions over a finite interval. Using finite differences in space and the Euler method in time, we prove that our method is of order 1 in space, uniformly in time, under a classical CFL condition, and despite its lack of consistency at the boundaries. Second, we consider the nonhomogeneous equation with nonhomogeneous Neumann boundary conditions over a finite interval. Using a tailored similar scheme, we prove that our method is also of order 1 in space, uniformly in time, under a classical CFL condition. We indicate how this numerical method allows for a new way to compute steady states of such equations when they exist. We conclude by several numerical experiments to illustrate the sharpness and relevance of our theoretical results, as well as to examine situations that do not meet the hypotheses of our theoretical results, and to illustrate how our results extend to higher dimensions.

[48] 2312.00060

Open data ecosystems: what models to co-create service innovations in smart cities?

While smart cities are recently providing open data, how to organise the collective creation of data, knowledge and related products and services produced from this collective resource, still remains to be thought. This paper aims at gathering the literature review on open data ecosystems to tackle the following research question: what models can be imagined to stimulate the collective co-creation of services between smart cities' stakeholders acting as providers and users of open data? Such issue is currently at stake in many municipalities such as Lisbon which decided to position itself as a platform (O'Reilly, 2010) in the local digital ecosystem. With the implementation of its City Operation Center (COI), Lisbon's municipality provides an Information Infrastructure (Bowker et al., 2009) to many different types of actors such as telecom companies, municipalities, energy utilities or transport companies. Through this infrastructure, Lisbon encourages such actors to gather, integrate and release heterogeneous datasets and tries to orchestrate synergies among them so data-driven solution to urban problems can emerge (Carvalho and Vale, 2018). The remaining question being: what models for the municipalities such as Lisbon to lean on so as to drive this cutting-edge type of service innovation?

[49] 2312.00063

MoMask: Generative Masked Modeling of 3D Human Motions

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

[50] 2312.00065

Unsupervised Keypoints from Pretrained Diffusion Models

Unsupervised learning of keypoints and landmarks has seen significant progress with the help of modern neural network architectures, but performance is yet to match the supervised counterpart, making their practicability questionable. We leverage the emergent knowledge within text-to-image diffusion models, towards more robust unsupervised keypoints. Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). To do so, we simply optimize the text embedding such that the cross-attention maps within the denoising network are localized as Gaussians with small standard deviations. We validate our performance on multiple datasets: the CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m datasets. We achieve significantly improved accuracy, sometimes even outperforming supervised ones, particularly for data that is non-aligned and less curated. Our code is publicly available and can be found through our project page:

[51] 2312.00066

Exploring Factors Affecting Pedestrian Crash Severity Using TabNet: A Deep Learning Approach

This study presents the first investigation of pedestrian crash severity using the TabNet model, a novel tabular deep learning method exceptionally suited for analyzing the tabular data inherent in transportation safety research. Through the application of TabNet to a comprehensive dataset from Utah covering the years 2010 to 2022, we uncover intricate factors contributing to pedestrian crash severity. The TabNet model, capitalizing on its compatibility with structured data, demonstrates remarkable predictive accuracy, eclipsing that of traditional models. It identifies critical variables, such as pedestrian age, involvement in left or right turns, lighting conditions, and alcohol consumption, which significantly influence crash outcomes. The utilization of SHapley Additive exPlanations (SHAP) enhances our ability to interpret the TabNet model's predictions, ensuring transparency and understandability in our deep learning approach. The insights derived from our analysis provide a valuable compass for transportation safety engineers and policymakers, enabling the identification of pivotal factors that affect pedestrian crash severity. Such knowledge is instrumental in formulating precise, data-driven interventions aimed at bolstering pedestrian safety across diverse urban and rural settings.

[52] 2312.00068

GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds

Sparse LiDAR point clouds cause severe loss of detail of static structures and reduce the density of static points available for navigation. Reduced density can be detrimental to navigation under several scenarios. We observe that despite high sparsity, in most cases, the global topology of LiDAR outlining the static structures can be inferred. We utilize this property to obtain a backbone skeleton of a static LiDAR scan in the form of a single connected component that is a proxy to its global topology. We utilize the backbone to augment new points along static structures to overcome sparsity. Newly introduced points could correspond to existing static structures or to static points that were earlier obstructed by dynamic objects. To the best of our knowledge, we are the first to use this strategy for sparse LiDAR point clouds. Existing solutions close to our approach fail to identify and preserve the global static LiDAR topology and generate sub-optimal points. We propose GLiDR, a Graph Generative network that is topologically regularized using 0-dimensional Persistent Homology (PH) constraints. This enables GLiDR to introduce newer static points along a topologically consistent global static LiDAR backbone. GLiDR generates precise static points using 32x sparser dynamic scans and performs better than the baselines across three datasets. The newly introduced static points allow GLiDR to outperform LiDAR-based navigation using SLAM in several settings. GLiDR generates a valuable byproduct - an accurate binary segmentation mask of static and dynamic objects that is helpful for navigation planning and safety in constrained environments.

[53] 2312.00069

SICKLE: A Multi-Sensor Satellite Imagery Dataset Annotated with Multiple Key Cropping Parameters

The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation data in agriculture, there is a scarcity of curated and labelled datasets, which limits the potential of its use in training ML models for remote sensing (RS) in agriculture. To this end, we introduce a first-of-its-kind dataset called SICKLE, which constitutes a time-series of multi-resolution imagery from 3 distinct satellites: Landsat-8, Sentinel-1 and Sentinel-2. Our dataset constitutes multi-spectral, thermal and microwave sensors during January 2018 - March 2021 period. We construct each temporal sequence by considering the cropping practices followed by farmers primarily engaged in paddy cultivation in the Cauvery Delta region of Tamil Nadu, India; and annotate the corresponding imagery with key cropping parameters at multiple resolutions (i.e. 3m, 10m and 30m). Our dataset comprises 2,370 season-wise samples from 388 unique plots, having an average size of 0.38 acres, for classifying 21 crop types across 4 districts in the Delta, which amounts to approximately 209,000 satellite images. Out of the 2,370 samples, 351 paddy samples from 145 plots are annotated with multiple crop parameters; such as the variety of paddy, its growing season and productivity in terms of per-acre yields. Ours is also one among the first studies that consider the growing season activities pertinent to crop phenology (spans sowing, transplanting and harvesting dates) as parameters of interest. We benchmark SICKLE on three tasks: crop type, crop phenology (sowing, transplanting, harvesting), and yield prediction

[54] 2312.00072

CRAFT: Contextual Re-Activation of Filters for face recogntion Training

The first layer of a deep CNN backbone applies filters to an image to extract the basic features available to later layers. During training, some filters may go inactive, mean ing all weights in the filter approach zero. An inactive fil ter in the final model represents a missed opportunity to extract a useful feature. This phenomenon is especially prevalent in specialized CNNs such as for face recogni tion (as opposed to, e.g., ImageNet). For example, in one the most widely face recognition model (ArcFace), about half of the convolution filters in the first layer are inactive. We propose a novel approach designed and tested specif ically for face recognition networks, known as "CRAFT: Contextual Re-Activation of Filters for Face Recognition Training". CRAFT identifies inactive filters during training and reinitializes them based on the context of strong filters at that stage in training. We show that CRAFT reduces fraction of inactive filters from 44% to 32% on average and discovers filter patterns not found by standard training. Compared to standard training without reactivation, CRAFT demonstrates enhanced model accuracy on standard face-recognition benchmark datasets including AgeDB-30, CPLFW, LFW, CALFW, and CFP-FP, as well as on more challenging datasets like IJBB and IJBC.

[55] 2312.00075

Accelerating Neural Field Training via Soft Mining

We present an approach to accelerate Neural Field training by efficiently selecting sampling locations. While Neural Fields have recently become popular, it is often trained by uniformly sampling the training domain, or through handcrafted heuristics. We show that improved convergence and final training quality can be achieved by a soft mining technique based on importance sampling: rather than either considering or ignoring a pixel completely, we weigh the corresponding loss by a scalar. To implement our idea we use Langevin Monte-Carlo sampling. We show that by doing so, regions with higher error are being selected more frequently, leading to more than 2x improvement in convergence speed. The code and related resources for this study are publicly available at

[56] 2312.00076

Towards A Foundation Model For Trajectory Intelligence

We present the results of training a large trajectory model using real-world user check-in data. Our approach follows a pre-train and fine-tune paradigm, where a base model is pre-trained via masked trajectory modeling and then adapted through fine-tuning for various downstream tasks. To address challenges posed by noisy data and large spatial vocabularies, we propose a novel spatial tokenization block. Our empirical analysis utilizes a comprehensive dataset of over 2 billion check-ins generated by more than 6 million users. Through fine-tuning on 3 downstream tasks we demonstrate that our base model has effectively learned valuable underlying patterns in raw data, enabling its application in meaningful trajectory intelligence tasks. Despite some limitations, we believe this work represents an important step forward in the realization of a foundation model for trajectory intelligence.

[57] 2312.00078

Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation

Cross-domain CTR (CDCTR) prediction is an important research topic that studies how to leverage meaningful data from a related domain to help CTR prediction in target domain. Most existing CDCTR works design implicit ways to transfer knowledge across domains such as parameter-sharing that regularizes the model training in target domain. More effectively, recent researchers propose explicit techniques to extract user interest knowledge and transfer this knowledge to target domain. However, the proposed method mainly faces two issues: 1) it usually requires a super domain, i.e. an extremely large source domain, to cover most users or items of target domain, and 2) the extracted user interest knowledge is static no matter what the context is in target domain. These limitations motivate us to develop a more flexible and efficient technique to explicitly transfer knowledge. In this work, we propose a cross-domain augmentation network (CDAnet) being able to perform explicit knowledge transfer between two domains. Specifically, CDAnet contains a designed translation network and an augmentation network which are trained sequentially. The translation network computes latent features from two domains and learns meaningful cross-domain knowledge of each input in target domain by using a designed cross-supervised feature translator. Later the augmentation network employs the explicit cross-domain knowledge as augmented information to boost the target domain CTR prediction. Through extensive experiments on two public benchmarks and one industrial production dataset, we show CDAnet can learn meaningful translated features and largely improve the performance of CTR prediction. CDAnet has been conducted online A/B test in image2product retrieval at Taobao app, bringing an absolute 0.11 point CTR improvement, a relative 0.64% deal growth and a relative 1.26% GMV increase.

[58] 2312.00079

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.

[59] 2312.00081

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simply yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach.

[60] 2312.00083

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches have shown notable progress by decoding the center and length of a target moment from learnable queries. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we introduce a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the onset and offset are directly estimated. Based on this idea, we design a Boundary-Aligned Moment Detection Transformer (BAM-DETR), equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Extensive experiments verify the advantages of our methods, where our model records new state-of-the-art results on three benchmarks. Code is at

[61] 2312.00084

Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?

Stable Diffusion has established itself as a foundation model in generative AI artistic applications, receiving widespread research and application. Some recent fine-tuning methods have made it feasible for individuals to implant personalized concepts onto the basic Stable Diffusion model with minimal computational costs on small datasets. However, these innovations have also given rise to issues like facial privacy forgery and artistic copyright infringement. In recent studies, researchers have explored the addition of imperceptible adversarial perturbations to images to prevent potential unauthorized exploitation and infringements when personal data is used for fine-tuning Stable Diffusion. Although these studies have demonstrated the ability to protect images, it is essential to consider that these methods may not be entirely applicable in real-world scenarios. In this paper, we systematically evaluate the use of perturbations to protect images within a practical threat model. The results suggest that these approaches may not be sufficient to safeguard image privacy and copyright effectively. Furthermore, we introduce a purification method capable of removing protected perturbations while preserving the original image structure to the greatest extent possible. Experiments reveal that Stable Diffusion can effectively learn from purified images over all protective methods.

[62] 2312.00085

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: .

[63] 2312.00087

Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle

Generative artificial intelligence (GenAI), exemplified by ChatGPT, Midjourney, and other state-of-the-art large language models and diffusion models, holds significant potential for transforming education and enhancing human productivity. While the prevalence of GenAI in education has motivated numerous research initiatives, integrating these technologies within the learning analytics (LA) cycle and their implications for practical interventions remain underexplored. This paper delves into the prospective opportunities and challenges GenAI poses for advancing LA. We present a concise overview of the current GenAI landscape and contextualise its potential roles within Clow's generic framework of the LA cycle. We posit that GenAI can play pivotal roles in analysing unstructured data, generating synthetic learner data, enriching multimodal learner interactions, advancing interactive and explanatory analytics, and facilitating personalisation and adaptive interventions. As the lines blur between learners and GenAI tools, a renewed understanding of learners is needed. Future research can delve deep into frameworks and methodologies that advocate for human-AI collaboration. The LA community can play a pivotal role in capturing data about human and AI contributions and exploring how they can collaborate most effectively. As LA advances, it is essential to consider the pedagogical implications and broader socioeconomic impact of GenAI for ensuring an inclusive future.

[64] 2312.00088

Anomaly Detection via Learning-Based Sequential Controlled Sensing

In this paper, we address the problem of detecting anomalies among a given set of binary processes via learning-based controlled sensing. Each process is parameterized by a binary random variable indicating whether the process is anomalous. To identify the anomalies, the decision-making agent is allowed to observe a subset of the processes at each time instant. Also, probing each process has an associated cost. Our objective is to design a sequential selection policy that dynamically determines which processes to observe at each time with the goal to minimize the delay in making the decision and the total sensing cost. We cast this problem as a sequential hypothesis testing problem within the framework of Markov decision processes. This formulation utilizes both a Bayesian log-likelihood ratio-based reward and an entropy-based reward. The problem is then solved using two approaches: 1) a deep reinforcement learning-based approach where we design both deep Q-learning and policy gradient actor-critic algorithms; and 2) a deep active inference-based approach. Using numerical experiments, we demonstrate the efficacy of our algorithms and show that our algorithms adapt to any unknown statistical dependence pattern of the processes.

[65] 2312.00090

Tree-based Forecasting of Day-ahead Solar Power Generation from Granular Meteorological Features

Accurate forecasts for day-ahead photovoltaic (PV) power generation are crucial to support a high PV penetration rate in the local electricity grid and to assure stability in the grid. We use state-of-the-art tree-based machine learning methods to produce such forecasts and, unlike previous studies, we hereby account for (i) the effects various meteorological as well as astronomical features have on PV power production, and this (ii) at coarse as well as granular spatial locations. To this end, we use data from Belgium and forecast day-ahead PV power production at an hourly resolution. The insights from our study can assist utilities, decision-makers, and other stakeholders in optimizing grid operations, economic dispatch, and in facilitating the integration of distributed PV power into the electricity grid.

[66] 2312.00091

Sound Terminology Describing Production and Perception of Sonification

Sonification research is intrinsically interdisciplinary. Consequently, a proper documentation of, and interdisciplinary discourse about a sonification is often hindered by terminology discrepancies between involved disciplines, i.e., the lack of a common sound terminology in sonification research. Without a common ground, a researcher from one discipline may have troubles understanding the implementation and imagining the resulting sound perception of a sonification, if the sonification is described by a researcher from another discipline. To find a common ground, I consulted literature on interdisciplinary research and discourse, identified problems that occur in sonification, and applied the recommended solutions. As a result, I recommend considering three aspects of sonification individually, namely 1.) Sound Design Concept, 2.) Objective and 3.) Method, clarifying which discipline is involved in which aspect, and sticking to this discipline's terminology. As two requirements of sonifications are that they are a) reproducible and b) interpretable, I recommend documenting and discussing every sonification design once using audio engineering terminology, and once using psychoacoustic terminology. The appendix provides comprehensive lists of sound terms from both disciplines, together with relevant literature and a clarification of often misunderstood and misused terms.

[67] 2312.00092

Mixture of Gaussian-distributed Prototypes with Generative Modelling for Interpretable Image Classification

Prototypical-part interpretable methods, e.g., ProtoPNet, enhance interpretability by connecting classification predictions to class-specific training prototypes, thereby offering an intuitive insight into their decision-making. Current methods rely on a discriminative classifier trained with point-based learning techniques that provide specific values for prototypes. Such prototypes have relatively low representation power due to their sparsity and potential redundancy, with each prototype containing no variability measure. In this paper, we present a new generative learning of prototype distributions, named Mixture of Gaussian-distributed Prototypes (MGProto), which are represented by Gaussian mixture models (GMM). Such an approach enables the learning of more powerful prototype representations since each learned prototype will own a measure of variability, which naturally reduces the sparsity given the spread of the distribution around each prototype, and we also integrate a prototype diversity objective function into the GMM optimisation to reduce redundancy. Incidentally, the generative nature of MGProto offers a new and effective way for detecting out-of-distribution samples. To improve the compactness of MGProto, we further propose to prune Gaussian-distributed prototypes with a low prior. Experiments on CUB-200-2011, Stanford Cars, Stanford Dogs, and Oxford-IIIT Pets datasets show that MGProto achieves state-of-the-art classification and OoD detection performances with encouraging interpretability results.

[68] 2312.00093

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

[69] 2312.00094

Fast ODE-based Sampling for Diffusion Models in Around 5 Steps

Sampling from diffusion models can be treated as solving the corresponding ordinary differential equations (ODEs), with the aim of obtaining an accurate solution with as few number of function evaluations (NFE) as possible. Recently, various fast samplers utilizing higher-order ODE solvers have emerged and achieved better performance than the initial first-order one. However, these numerical methods inherently result in certain approximation errors, which significantly degrades sample quality with extremely small NFE (e.g., around 5). In contrast, based on the geometric observation that each sampling trajectory almost lies in a two-dimensional subspace embedded in the ambient space, we propose Approximate MEan-Direction Solver (AMED-Solver) that eliminates truncation errors by directly learning the mean direction for fast diffusion sampling. Besides, our method can be easily used as a plugin to further improve existing ODE-based samplers. Extensive experiments on image synthesis with the resolution ranging from 32 to 256 demonstrate the effectiveness of our method. With only 5 NFE, we achieve 7.14 FID on CIFAR-10, 13.75 FID on ImageNet 64$\times$64, and 12.79 FID on LSUN Bedroom. Our code is available at

[70] 2312.00095

Textual-Knowledge-Guided Numerical Feature Discovery Method for Power Demand Forecasting

Power demand forecasting is a crucial and challenging task for new power system and integrated energy system. However, as public feature databases and the theoretical mechanism of power demand changes are unavailable, the known features of power demand fluctuation are much limited. Recently, multimodal learning approaches have shown great vitality in machine learning and AIGC. In this paper, we interact two modal data and propose a textual-knowledge-guided numerical feature discovery (TKNFD) method for short-term power demand forecasting. TKNFD extensively accumulates qualitative textual knowledge, expands it into a candidate feature-type set, collects numerical data of these features, and eventually builds four-dimensional multivariate source-tracking databases (4DM-STDs). Next, TKNFD presents a two-level quantitative feature identification strategy independent of forecasting models, finds 43-48 features, and systematically analyses feature contribution and dependency correlation. Benchmark experiments in two different regions around the world demonstrate that the forecasting accuracy of TKNFD-discovered features reliably outperforms that of SoTA feature schemes by 16.84% to 36.36% MAPE. In particular, TKNFD reveals many unknown features, especially several dominant features in the unknown energy and astronomical dimensions, which extend the knowledge on the origin of strong randomness and non-linearity in power demand fluctuation. Besides, 4DM-STDs can serve as public baseline databases.

[71] 2312.00096

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

[72] 2312.00097

SparseDC: Depth Completion from sparse and non-uniform inputs

We propose SparseDC, a model for Depth Completion of Sparse and non-uniform depth inputs. Unlike previous methods focusing on completing fixed distributions on benchmark datasets (e.g., NYU with 500 points, KITTI with 64 lines), SparseDC is specifically designed to handle depth maps with poor quality in real usage. The key contributions of SparseDC are two-fold. First, we design a simple strategy, called SFFM, to improve the robustness under sparse input by explicitly filling the unstable depth features with stable image features. Second, we propose a two-branch feature embedder to predict both the precise local geometry of regions with available depth values and accurate structures in regions with no depth. The key of the embedder is an uncertainty-based fusion module called UFFM to balance the local and long-term information extracted by CNNs and ViTs. Extensive indoor and outdoor experiments demonstrate the robustness of our framework when facing sparse and non-uniform input depths. The pre-trained model and code are available at

[73] 2312.00098

Identifying tourist destinations from movie scenes using Deep Learning

Movies wield significant influence in our lives, playing a pivotal role in the tourism industry of any country. The inclusion of picturesque landscapes, waterfalls, and mountains as backdrops in films serves to enhance the allure of specific scenarios. Recognizing the impact of movies on tourism, this paper introduces a method for identifying tourist destinations featured in films. We propose the development of a deep learning model capable of recognizing these locations during movie viewing. The model is trained on a dataset comprising major tourism destinations worldwide. Through this research, the goal is to enable viewers to identify the real-world locations depicted in movie scenes, offering a novel way to connect cinema with global travel experiences.

[74] 2312.00099

Online Influence Maximization: Concept and Algorithm

In this survey, we offer an extensive overview of the Online Influence Maximization (IM) problem by covering both theoretical aspects and practical applications. For the integrity of the article and because the online algorithm takes an offline oracle as a subroutine, we first make a clear definition of the Offline IM problem and summarize those commonly used Offline IM algorithms, which include traditional approximation or heuristic algorithms and ML-based algorithms. Then, we give a standard definition of the Online IM problem and a basic Combinatorial Multi-Armed Bandit (CMAB) framework, CMAB-T. Here, we summarize three types of feedback in the CMAB model and discuss in detail how to study the Online IM problem based on the CMAB-T model. This paves the way for solving the Online IM problem by using online learning methods. Furthermore, we have covered almost all Online IM algorithms up to now, focusing on characteristics and theoretical guarantees of online algorithms for different feedback types. Here, we elaborately explain their working principle and how to obtain regret bounds. Besides, we also collect plenty of innovative ideas about problem definition and algorithm designs and pioneering works for variants of the Online IM problem and their corresponding algorithms. Finally, we encapsulate current challenges and outline prospective research directions from four distinct perspectives.

[75] 2312.00100

Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines

Rhetoric, both spoken and written, involves not only content but also style. One common stylistic tool is $\textit{parallelism}$: the juxtaposition of phrases which have the same sequence of linguistic ($\textit{e.g.}$, phonological, syntactic, semantic) features. Despite the ubiquity of parallelism, the field of natural language processing has seldom investigated it, missing a chance to better understand the nature of the structure, meaning, and intent that humans convey. To address this, we introduce the task of $\textit{rhetorical parallelism detection}$. We construct a formal definition of it; we provide one new Latin dataset and one adapted Chinese dataset for it; we establish a family of metrics to evaluate performance on it; and, lastly, we create baseline systems and novel sequence labeling schemes to capture it. On our strictest metric, we attain $F_{1}$ scores of $0.40$ and $0.43$ on our Latin and Chinese datasets, respectively.

[76] 2312.00101

Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations

Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.

[77] 2312.00102

FedEmb: A Vertical and Hybrid Federated Learning Algorithm using Network And Feature Embedding Aggregation

Federated learning (FL) is an emerging paradigm for decentralized training of machine learning models on distributed clients, without revealing the data to the central server. The learning scheme may be horizontal, vertical or hybrid (both vertical and horizontal). Most existing research work with deep neural network (DNN) modelling is focused on horizontal data distributions, while vertical and hybrid schemes are much less studied. In this paper, we propose a generalized algorithm FedEmb, for modelling vertical and hybrid DNN-based learning. The idea of our algorithm is characterised by higher inference accuracy, stronger privacy-preserving properties, and lower client-server communication bandwidth demands as compared with existing work. The experimental results show that FedEmb is an effective method to tackle both split feature & subject space decentralized problems, shows 0.3% to 4.2% inference accuracy improvement with limited privacy revealing for datasets stored in local clients, and reduces 88.9 % time complexity over vertical baseline method.

[78] 2312.00103

DeepEn2023: Energy Datasets for Edge Artificial Intelligence

Climate change poses one of the most significant challenges to humanity. As a result of these climatic changes, the frequency of weather, climate, and water-related disasters has multiplied fivefold over the past 50 years, resulting in over 2 million deaths and losses exceeding $3.64 trillion USD. Leveraging AI-powered technologies for sustainable development and combating climate change is a promising avenue. Numerous significant publications are dedicated to using AI to improve renewable energy forecasting, enhance waste management, and monitor environmental changes in real time. However, very few research studies focus on making AI itself environmentally sustainable. This oversight regarding the sustainability of AI within the field might be attributed to a mindset gap and the absence of comprehensive energy datasets. In addition, with the ubiquity of edge AI systems and applications, especially on-device learning, there is a pressing need to measure, analyze, and optimize their environmental sustainability, such as energy efficiency. To this end, in this paper, we propose large-scale energy datasets for edge AI, named DeepEn2023, covering a wide range of kernels, state-of-the-art deep neural network models, and popular edge AI applications. We anticipate that DeepEn2023 will improve transparency in sustainability in on-device deep learning across a range of edge AI systems and applications. For more information, including access to the dataset and code, please visit

[79] 2312.00104

A Metadata Generation System with Semantic Understanding for Video Retrieval in Film Production

In film production, metadata plays an important role in original raw video indexing and classification within the industrial post-production software. Inspired by deep visual-semantic methods, we propose an automated image information extraction process to extend the diversity of metadata entities for massive large-scale raw video searching and retrieval. In this paper, we introduce the proposed system architecture and modules, integrating semantic annotation models and user-demand-oriented information fusion. We conducted experiments to validate the effectiveness of our system on Film Raw Video Semantic Annotation Dataset (Film-RVSAD) and Slate Board Template Dataset (SBTD), two benchmark datasets built for cinematography-related semantic annotation and slate detection. Experimental results show that the proposed system provides an effective strategy to improve the efficiency of metadata generation and transformation, which is necessary and convenient for collaborative work in the filmmaking process.

[80] 2312.00105

Improving the Robustness of Quantized Deep Neural Networks to White-Box Attacks using Stochastic Quantization and Information-Theoretic Ensemble Training

Most real-world applications that employ deep neural networks (DNNs) quantize them to low precision to reduce the compute needs. We present a method to improve the robustness of quantized DNNs to white-box adversarial attacks. We first tackle the limitation of deterministic quantization to fixed ``bins'' by introducing a differentiable Stochastic Quantizer (SQ). We explore the hypothesis that different quantizations may collectively be more robust than each quantized DNN. We formulate a training objective to encourage different quantized DNNs to learn different representations of the input image. The training objective captures diversity and accuracy via mutual information between ensemble members. Through experimentation, we demonstrate substantial improvement in robustness against $L_\infty$ attacks even if the attacker is allowed to backpropagate through SQ (e.g., > 50\% accuracy to PGD(5/255) on CIFAR10 without adversarial training), compared to vanilla DNNs as well as existing ensembles of quantized DNNs. We extend the method to detect attacks and generate robustness profiles in the adversarial information plane (AIP), towards a unified analysis of different threat models by correlating the MI and accuracy.

[81] 2312.00109

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications. The recent 3D Gaussian Splatting method has achieved the state-of-the-art rendering quality and speed combining the benefits of both primitive-based representations and volumetric representations. However, it often leads to heavily redundant Gaussians that try to fit every training view, neglecting the underlying scene geometry. Consequently, the resulting model becomes less robust to significant view changes, texture-less area and lighting effects. We introduce Scaffold-GS, which uses anchor points to distribute local 3D Gaussians, and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. Anchor growing and pruning strategies are developed based on the importance of neural Gaussians to reliably improve the scene coverage. We show that our method effectively reduces redundant Gaussians while delivering high-quality rendering. We also demonstrates an enhanced capability to accommodate scenes with varying levels-of-detail and view-dependent observations, without sacrificing the rendering speed.

[82] 2312.00110

CLIP-QDA: An Explainable Concept Bottleneck Model

In this paper, we introduce an explainable algorithm designed from a multi-modal foundation model, that performs fast and explainable image classification. Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs), our method creates a latent space where each neuron is linked to a specific word. Observing that this latent space can be modeled with simple distributions, we use a Mixture of Gaussians (MoG) formalism to enhance the interpretability of this latent space. Then, we introduce CLIP-QDA, a classifier that only uses statistical values to infer labels from the concepts. In addition, this formalism allows for both local and global explanations. These explanations come from the inner design of our architecture, our work is part of a new family of greybox models, combining performances of opaque foundation models and the interpretability of transparent models. Our empirical findings show that in instances where the MoG assumption holds, CLIP-QDA achieves similar accuracy with state-of-the-art methods CBMs. Our explanations compete with existing XAI methods while being faster to compute.

[83] 2312.00111

Multimodal Learning for Crystalline Materials

Artificial intelligence (AI) has revolutionized the field of materials science by improving the prediction of properties and accelerating the discovery of novel materials. In recent years, publicly available material data repositories containing data for various material properties have grown rapidly. In this work, we introduce Multimodal Learning for Crystalline Materials (MLCM), a new method for training a foundation model for crystalline materials via multimodal alignment, where high-dimensional material properties (i.e. modalities) are connected in a shared latent space to produce highly useful material representations. We show the utility of MLCM on multiple axes: (i) MLCM achieves state-of-the-art performance for material property prediction on the challenging Materials Project database; (ii) MLCM enables a novel, highly accurate method for inverse design, allowing one to screen for stable material with desired properties; and (iii) MLCM allows the extraction of interpretable emergent features that may provide insight to material scientists. Further, we explore several novel methods for aligning an arbitrary number of modalities, improving upon prior art in multimodal learning that focuses on bimodal alignment. Our work brings innovations from the ongoing AI revolution into the domain of materials science and identifies materials as a testbed for the next generation of AI.

[84] 2312.00112

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

Accurately and efficiently modeling dynamic scenes and motions is considered so challenging a task due to temporal dynamics and motion complexity. To address these challenges, we propose DynMF, a compact and efficient representation that decomposes a dynamic scene into a few neural trajectories. We argue that the per-point motions of a dynamic scene can be decomposed into a small set of explicit or learned trajectories. Our carefully designed neural framework consisting of a tiny set of learned basis queried only in time allows for rendering speed similar to 3D Gaussian Splatting, surpassing 120 FPS, while at the same time, requiring only double the storage compared to static scenes. Our neural representation adequately constrains the inherently underconstrained motion field of a dynamic scene leading to effective and fast optimization. This is done by biding each point to motion coefficients that enforce the per-point sharing of basis trajectories. By carefully applying a sparsity loss to the motion coefficients, we are able to disentangle the motions that comprise the scene, independently control them, and generate novel motion combinations that have never been seen before. We can reach state-of-the-art render quality within just 5 minutes of training and in less than half an hour, we can synthesize novel views of dynamic scenes with superior photorealistic quality. Our representation is interpretable, efficient, and expressive enough to offer real-time view synthesis of complex dynamic scene motions, in monocular and multi-view scenarios.

[85] 2312.00113

Event-based Continuous Color Video Decompression from Single Frames

We present ContinuityCam, a novel approach to generate a continuous video from a single static RGB image, using an event camera. Conventional cameras struggle with high-speed motion capture due to bandwidth and dynamic range limitations. Event cameras are ideal sensors to solve this problem because they encode compressed change information at high temporal resolution. In this work, we propose a novel task called event-based continuous color video decompression, pairing single static color frames and events to reconstruct temporally continuous videos. Our approach combines continuous long-range motion modeling with a feature-plane-based synthesis neural integration model, enabling frame prediction at arbitrary times within the events. Our method does not rely on additional frames except for the initial image, increasing, thus, the robustness to sudden light changes, minimizing the prediction latency, and decreasing the bandwidth requirement. We introduce a novel single objective beamsplitter setup that acquires aligned images and events and a novel and challenging Event Extreme Decompression Dataset (E2D2) that tests the method in various lighting and motion profiles. We thoroughly evaluate our method through benchmarking reconstruction as well as various downstream tasks. Our approach significantly outperforms the event- and image- based baselines in the proposed task.

[86] 2312.00114

Un-EvMoSeg: Unsupervised Event-based Independent Motion Segmentation

Event cameras are a novel type of biologically inspired vision sensor known for their high temporal resolution, high dynamic range, and low power consumption. Because of these properties, they are well-suited for processing fast motions that require rapid reactions. Although event cameras have recently shown competitive performance in unsupervised optical flow estimation, performance in detecting independently moving objects (IMOs) is lacking behind, although event-based methods would be suited for this task based on their low latency and HDR properties. Previous approaches to event-based IMO segmentation have been heavily dependent on labeled data. However, biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Due to its unsupervised nature, our method can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available. We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.

[87] 2312.00115

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at

[88] 2312.00116

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

Image-to-image translation (I2IT) refers to the process of transforming images from a source domain to a target domain while maintaining a fundamental connection in terms of image content. In the past few years, remarkable advancements in I2IT were achieved by Generative Adversarial Networks (GANs), which nevertheless struggle with translations requiring high precision. Recently, Diffusion Models have established themselves as the engine of choice for image generation. In this paper we introduce S2ST, a novel framework designed to accomplish global I2IT in complex photorealistic images, such as day-to-night or clear-to-rain translations of automotive scenes. S2ST operates within the seed space of a Latent Diffusion Model, thereby leveraging the powerful image priors learned by the latter. We show that S2ST surpasses state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches, for complex automotive scenes, improving fidelity while respecting the target domain's appearance across a variety of domains. Notably, S2ST obviates the necessity for training domain-specific translation networks.

[89] 2312.00148

Pedaling, Fast and Slow: The Race Towards an Optimized Power Strategy

With the advent of power-meters allowing cyclists to precisely track their power outputs throughout the duration of a race, devising optimal power output strategies for races has become increasingly important in competitive cycling. To do so, the track, weather, and individual cyclist's abilities must all be considered. We propose differential equation models of fatigue and kinematics to simulate the performance of such strategies, and an innovative optimization algorithm to find the optimal strategy. Our model for fatigue translates a cyclist's power curve (obtained by fitting the Omni-Power Duration Model to power curve data) into a differential equation to capture which power output strategies are feasible. Our kinematics model calculates the forces on the rider, and with power output models the cyclist's velocity and position via a system of differential equations. Using track data, including the slope of the track and velocity of the wind, the model accurately computes race times given a power output strategy on the exact track being raced. To make power strategy optimization computationally tractable, we split the track into segments based on changes in slope and discretize the power output levels. As the space of possible strategies is large, we vectorize the differential equation model for efficient numerical integration of many simulations at once and develop a parallelized Tree Exploration with Monte-Carlo Evaluation algorithm. The algorithm is efficient, running in $O(ab\sqrt{n})$ time and $O(n)$ space where $n$ is the number of simulations done for each choice, $a$ is the number of segments, and $b$ is the number of discrete power output levels. We present results of this optimization for several different tracks and athletes. As an example, the model's time for Filippo Ganna in Tokyo 2020 differs from his real time by just 18%, supporting our model's efficacy.

[90] 2312.00151

Which way is `right'?: Uncovering limitations of Vision-and-Language Navigation model

The challenging task of Vision-and-Language Navigation (VLN) requires embodied agents to follow natural language instructions to reach a goal location or object (e.g. `walk down the hallway and turn left at the piano'). For agents to complete this task successfully, they must be able to ground objects referenced into the instruction (e.g.`piano') into the visual scene as well as ground directional phrases (e.g.`turn left') into actions. In this work we ask the following question -- to what degree are spatial and directional language cues informing the navigation model's decisions? We propose a series of simple masking experiments to inspect the model's reliance on different parts of the instruction. Surprisingly we uncover that certain top performing models rely only on the noun tokens of the instructions. We propose two training methods to alleviate this concerning limitation.

[91] 2312.00155

A Threshold Greedy Algorithm for Noisy Submodular Maximization

We consider the optimization problem of cardinality constrained maximization of a monotone submodular set function $f:2^U\to\mathbb{R}_{\geq 0}$ (SM) with noisy evaluations of $f$. In particular, it is assumed that we do not have value oracle access to $f$, but instead for any $X\subseteq U$ and $u\in U$ we can take samples from a noisy distribution with expected value $f(X\cup\{u\})-f(X)$. Our goal is to develop algorithms in this setting that take as few samples as possible, and return a solution with an approximation guarantee relative to the optimal with high probability. We propose the algorithm Confident Threshold Greedy (CTG), which is based on the threshold greedy algorithm of Badanidiyuru and Vondrak [1] and samples adaptively in order to produce an approximate solution with high probability. We prove that CTG achieves an approximation ratio arbitrarily close to $1-1/e$, depending on input parameters. We provide an experimental evaluation on real instances of SM and demonstrate the sample efficiency of CTG.

[92] 2312.00157

Universal Backdoor Attacks

Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.

[93] 2312.00163

Just add $\textit{WATER}$: WebAssembly-based Circumvention Transports

As Internet censors rapidly evolve new blocking techniques, circumvention tools must also adapt and roll out new strategies to remain unblocked. But new strategies can be time consuming for circumventors to develop and deploy, and usually an update to one tool often requires significant additional effort to be ported to others. Moreover, distributing the updated application across different platforms poses its own set of challenges. In this paper, we introduce $\textit{WATER}$ (WebAssembly Transport Executables Runtime), a novel design that enables applications to use a WebAssembly-based application-layer to wrap network transports (e.g., TLS). Deploying a new circumvention technique with $\textit{WATER}$ only requires distributing the WebAssembly Transport Module(WATM) binary and any transport-specific configuration, allowing dynamic transport updates without any change to the application itself. WATMs are also designed to be generic such that different applications using $\textit{WATER}$ can use the same WATM to rapidly deploy successful circumvention techniques to their own users, facilitating rapid interoperability between independent circumvention tools.

[94] 2312.00164

Towards Accurate Differential Diagnosis with Large Language Models

An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

[95] 2312.00168

Navigating News Narratives: A Media Bias Analysis Dataset

The proliferation of biased news narratives across various media platforms has become a prominent challenge, influencing public opinion on critical topics like politics, health, and climate change. This paper introduces the "Navigating News Narratives: A Media Bias Analysis Dataset", a comprehensive dataset to address the urgent need for tools to detect and analyze media bias. This dataset encompasses a broad spectrum of biases, making it a unique and valuable asset in the field of media studies and artificial intelligence. The dataset is available at

[96] 2312.00169

Integration of Swin UNETR and statistical shape modeling for a semi-automated segmentation of the knee and biomechanical modeling of articular cartilage

Simulation studies like finite element (FE) modeling provide insight into knee joint mechanics without patient experimentation. Generic FE models represent biomechanical behavior of the tissue by overlooking variations in geometry, loading, and material properties of a population. On the other hand, subject-specific models include these specifics, resulting in enhanced predictive precision. However, creating such models is laborious and time-intensive. The present study aimed to enhance subject-specific knee joint FE modeling by incorporating a semi-automated segmentation algorithm. This segmentation was a 3D Swin UNETR for an initial segmentation of the femur and tibia, followed by a statistical shape model (SSM) adjustment to improve surface roughness and continuity. Five hundred and seven magnetic resonance images (MRIs) from the Osteoarthritis Initiative (OAI) database were used to build and validate the segmentation model. A semi-automated FE model was developed using this semi-automated segmentation. On the other hand, a manual FE model was developed through manual segmentation (i.e., the gold standard approach). Both FE models were subjected to gait loading. The predicted mechanical response of manual and semi-automated FE models were compared. In the result, our semi-automated segmentation achieved Dice similarity coefficient (DSC) over 98% for both femur and tibia. The mechanical results (max principal stress, max principal strain, fluid pressure, fibril strain, and contact area) showed no significant differences between the manual and semi-automated FE models, indicating the effectiveness of the proposed semi-automated segmentation in creating accurate knee joint FE models. ( ).

[97] 2312.00170

Non-uniform Online Learning: Towards Understanding Induction

Can a physicist make only finite errors in the endless pursuit of the law of nature? This millennium-old question of inductive inference is a fundamental, yet mysterious problem in philosophy, lacking rigorous justifications. While classic online learning theory and inductive inference share a similar sequential decision-making spirit, the former's reliance on an adaptive adversary and worst-case error bounds limits its applicability to the latter. In this work, we introduce the concept of non-uniform online learning, which we argue aligns more closely with the principles of inductive reasoning. This setting assumes a predetermined ground-truth hypothesis and considers non-uniform, hypothesis-wise error bounds. In the realizable setting, we provide a complete characterization of learnability with finite error: a hypothesis class is non-uniform learnable if and only if it's a countable union of Littlestone classes, no matter the observations are adaptively chosen or iid sampled. Additionally, we propose a necessary condition for the weaker criterion of consistency which we conjecture to be tight. To further promote our theory, we extend our result to the more realistic agnostic setting, showing that any countable union of Littlestone classes can be learnt with regret $\tilde{O}(\sqrt{T})$. We hope this work could offer a new perspective of interpreting the power of induction from an online learning viewpoint.

[98] 2312.00171

Extending Rely-Guarantee thinking to handle Real-Time Scheduling

The reference point for developing any artefact is its specification; to develop software formally, a formal specification is required. For sequential programs, pre and post conditions (together with abstract objects) suffice; rely and guarantee conditions extend the scope of formal development approaches to tackle concurrency. In addition, real-time systems need ways of both requiring progress and relating that progress to some notion of time. This paper extends rely-guarantee ideas to cope with specifications of -- and assumptions about -- real-time schedulers. Furthermore it shows how the approach helps identify and specify fault-tolerance aspects of such schedulers by systematically challenging the assumptions.

[99] 2312.00172

Projected exponential methods for stiff dynamical low-rank approximation problems

The numerical integration of stiff equations is a challenging problem that needs to be approached by specialized numerical methods. Exponential integrators form a popular class of such methods since they are provably robust to stiffness and have been successfully applied to a variety of problems. The dynamical low- \rank approximation is a recent technique for solving high-dimensional differential equations by means of low-rank approximations. However, the domain is lacking numerical methods for stiff equations since existing methods are either not robust-to-stiffness or have unreasonably large hidden constants. In this paper, we focus on solving large-scale stiff matrix differential equations with a Sylvester-like structure, that admit good low-rank approximations. We propose two new methods that have good convergence properties, small memory footprint and that are fast to compute. The theoretical analysis shows that the new methods have order one and two, respectively. We also propose a practical implementation based on Krylov techniques. The approximation error is analyzed, leading to a priori error bounds and, therefore, a mean for choosing the size of the Krylov space. Numerical experiments are performed on several examples, confirming the theory and showing good speedup in comparison to existing techniques.

[100] 2312.00173

Fool the Hydra: Adversarial Attacks against Multi-view Object Detection Systems

Adversarial patches exemplify the tangible manifestation of the threat posed by adversarial attacks on Machine Learning (ML) models in real-world scenarios. Robustness against these attacks is of the utmost importance when designing computer vision applications, especially for safety-critical domains such as CCTV systems. In most practical situations, monitoring open spaces requires multi-view systems to overcome acquisition challenges such as occlusion handling. Multiview object systems are able to combine data from multiple views, and reach reliable detection results even in difficult environments. Despite its importance in real-world vision applications, the vulnerability of multiview systems to adversarial patches is not sufficiently investigated. In this paper, we raise the following question: Does the increased performance and information sharing across views offer as a by-product robustness to adversarial patches? We first conduct a preliminary analysis showing promising robustness against off-the-shelf adversarial patches, even in an extreme setting where we consider patches applied to all views by all persons in Wildtrack benchmark. However, we challenged this observation by proposing two new attacks: (i) In the first attack, targeting a multiview CNN, we maximize the global loss by proposing gradient projection to the different views and aggregating the obtained local gradients. (ii) In the second attack, we focus on a Transformer-based multiview framework. In addition to the focal loss, we also maximize the transformer-specific loss by dissipating its attention blocks. Our results show a large degradation in the detection performance of victim multiview systems with our first patch attack reaching an attack success rate of 73% , while our second proposed attack reduced the performance of its target detector by 62%

[101] 2312.00175

Advances in soft grasping in agriculture

Agricultural robotics and automation are facing some challenges rooted in the high variability 9 of products, task complexity, crop quality requirement, and dense vegetation. Such a set of 10 challenges demands a more versatile and safe robotic system. Soft robotics is a young yet 11 promising field of research aimed to enhance these aspects of current rigid robots which 12 makes it a good candidate solution for that challenge. In general, it aimed to provide robots 13 and machines with adaptive locomotion (Ansari et al., 2015), safe and adaptive manipulation 14 (Arleo et al., 2020) and versatile grasping (Langowski et al., 2020). But in agriculture, soft 15 robots have been mainly used in harvesting tasks and more specifically in grasping. In this 16 chapter, we review a candidate group of soft grippers that were used for handling and 17 harvesting crops regarding agricultural challenges i.e. safety in handling and adaptability to 18 the high variation of crops. The review is aimed to show why and to what extent soft grippers 19 have been successful in handling agricultural tasks. The analysis carried out on the results 20 provides future directions for the systematic design of soft robots in agricultural tasks.

[102] 2312.00176

Ellora: Exploring Low-Power OFDM-based Radar Processors using Approximate Computing

In recent times, orthogonal frequency-division multiplexing (OFDM)-based radar has gained wide acceptance given its applicability in joint radar-communication systems. However, realizing such a system on hardware poses a huge area and power bottleneck given its complexity. Therefore it has become ever-important to explore low-power OFDM-based radar processors in order to realize energy-efficient joint radar-communication systems targeting edge devices. This paper aims to address the aforementioned challenges by exploiting approximations on hardware for early design space exploration (DSE) of trade-offs between accuracy, area and power. We present Ellora, a DSE framework for incorporating approximations in an OFDM radar processing pipeline. Ellora uses pairs of approximate adders and multipliers to explore design points realizing energy-efficient radar processors. Particularly, we incorporate approximations into the block involving periodogram based estimation and report area, power and accuracy levels. Experimental results show that at an average accuracy loss of 0.063% in the positive SNR region, we save 22.9% of on-chip area and 26.2% of power. Towards achieving the area and power statistics, we design a fully parallel Inverse Fast Fourier Transform (IFFT) core which acts as a part of periodogram based estimation and approximate the addition and multiplication operations in it. The aforementioned results show that Ellora can be used in an integrated way with various other optimization methods for generating low-power and energy-efficient radar processors.

[103] 2312.00183

RNA-KG: An ontology-based knowledge graph for representing interactions involving RNA molecules

The "RNA world" represents a novel frontier for the study of fundamental biological processes and human diseases and is paving the way for the development of new drugs tailored to the patient's biomolecular characteristics. Although scientific data about coding and non-coding RNA molecules are continuously produced and available from public repositories, they are scattered across different databases and a centralized, uniform, and semantically consistent representation of the "RNA world" is still lacking. We propose RNA-KG, a knowledge graph encompassing biological knowledge about RNAs gathered from more than 50 public databases, integrating functional relationships with genes, proteins, and chemicals and ontologically grounded biomedical concepts. To develop RNA-KG, we first identified, pre-processed, and characterized each data source; next, we built a meta-graph that provides an ontological description of the KG by representing all the bio-molecular entities and medical concepts of interest in this domain, as well as the types of interactions connecting them. Finally, we leveraged an instance-based semantically abstracted knowledge model to specify the ontological alignment according to which RNA-KG was generated. RNA-KG can be downloaded in different formats and also queried by a SPARQL endpoint. A thorough topological analysis of the resulting heterogeneous graph provides further insights into the characteristics of the "RNA world". RNA-KG can be both directly explored and visualized, and/or analyzed by applying computational methods to infer bio-medical knowledge from its heterogeneous nodes and edges. The resource can be easily updated with new experimental data, and specific views of the overall KG can be extracted according to the bio-medical problem to be studied.

[104] 2312.00184

Galaxy Classification: A machine learning approach for classifying shapes using numerical data

The classification of galaxies as spirals or ellipticals is a crucial task in understanding their formation and evolution. With the arrival of large-scale astronomical surveys, such as the Sloan Digital Sky Survey (SDSS), astronomers now have access to images of a vast number of galaxies. However, the visual inspection of these images is an impossible task for humans due to the sheer number of galaxies to be analyzed. To solve this problem, the Galaxy Zoo project was created to engage thousands of citizen scientists to classify the galaxies based on their visual features. In this paper, we present a machine learning model for galaxy classification using numerical data from the Galaxy Zoo[5] project. Our model utilizes a convolutional neural network architecture to extract features from galaxy images and classify them into spirals or ellipticals. We demonstrate the effectiveness of our model by comparing its performance with that of human classifiers using a subset of the Galaxy Zoo dataset. Our results show that our model achieves high accuracy in classifying galaxies and has the potential to significantly enhance our understanding of the formation and evolution of galaxies.

[105] 2312.00188

REACT: Recognize Every Action Everywhere All At Once

Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (\textbf{R}ecognize \textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features. Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.

[106] 2312.00189

HeTriNet: Heterogeneous Graph Triplet Attention Network for Drug-Target-Disease Interaction

Modeling the interactions between drugs, targets, and diseases is paramount in drug discovery and has significant implications for precision medicine and personalized treatments. Current approaches frequently consider drug-target or drug-disease interactions individually, ignoring the interdependencies among all three entities. Within human metabolic systems, drugs interact with protein targets in cells, influencing target activities and subsequently impacting biological pathways to promote healthy functions and treat diseases. Moving beyond binary relationships and exploring tighter triple relationships is essential to understanding drugs' mechanism of action (MoAs). Moreover, identifying the heterogeneity of drugs, targets, and diseases, along with their distinct characteristics, is critical to model these complex interactions appropriately. To address these challenges, we effectively model the interconnectedness of all entities in a heterogeneous graph and develop a novel Heterogeneous Graph Triplet Attention Network (\texttt{HeTriNet}). \texttt{HeTriNet} introduces a novel triplet attention mechanism within this heterogeneous graph structure. Beyond pairwise attention as the importance of an entity for the other one, we define triplet attention to model the importance of pairs for entities in the drug-target-disease triplet prediction problem. Experimental results on real-world datasets show that \texttt{HeTriNet} outperforms several baselines, demonstrating its remarkable proficiency in uncovering novel drug-target-disease relationships.

[107] 2312.00192

Benchmarking and Enhancing Disentanglement in Concept-Residual Models

Concept bottleneck models (CBMs) are interpretable models that first predict a set of semantically meaningful features, i.e., concepts, from observations that are subsequently used to condition a downstream task. However, the model's performance strongly depends on the engineered features and can severely suffer from incomplete sets of concepts. Prior works have proposed a side channel -- a residual -- that allows for unconstrained information flow to the downstream task, thus improving model performance but simultaneously introducing information leakage, which is undesirable for interpretability. This work proposes three novel approaches to mitigate information leakage by disentangling concepts and residuals, investigating the critical balance between model performance and interpretability. Through extensive empirical analysis on the CUB, OAI, and CIFAR 100 datasets, we assess the performance of each disentanglement method and provide insights into when they work best. Further, we show how each method impacts the ability to intervene over the concepts and their subsequent impact on task performance.

[108] 2312.00193

SISO Decoding of Z4 Linear Kerdock and Preparata Codes

Some nonlinear codes, such as Kerdock and Preparata codes, can be represented as binary images under the Gray map of linear codes over rings. This paper introduces MAP decoding of Kerdock and Preparata codes by working with their quaternary representation (linear codes over Z4 ) with the complexity of O(N2log2N), where N is the code length in Z4. A sub-optimal bitwise APP decoder with good error-correcting performance and complexity of O(Nlog2N) that is constructed using the decoder lifting technique is also introduced. This APP decoder extends upon the original lifting decoder by working with likelihoods instead of hard decisions and is not limited to Kerdock and Preparata code families. Simulations show that our novel decoders significantly outperform several popular decoders in terms of error rate.

[109] 2312.00194

Robust Concept Erasure via Kernelized Rate-Distortion Maximization

Distributed representations provide a vector space that captures meaningful relationships between data instances. The distributed nature of these representations, however, entangles together multiple attributes or concepts of data instances (e.g., the topic or sentiment of a text, characteristics of the author (age, gender, etc), etc). Recent work has proposed the task of concept erasure, in which rather than making a concept predictable, the goal is to remove an attribute from distributed representations while retaining other information from the original representation space as much as possible. In this paper, we propose a new distance metric learning-based objective, the Kernelized Rate-Distortion Maximizer (KRaM), for performing concept erasure. KRaM fits a transformation of representations to match a specified distance measure (defined by a labeled concept to erase) using a modified rate-distortion function. Specifically, KRaM's objective function aims to make instances with similar concept labels dissimilar in the learned representation space while retaining other information. We find that optimizing KRaM effectively erases various types of concepts: categorical, continuous, and vector-valued variables from data representations across diverse domains. We also provide a theoretical analysis of several properties of KRaM's objective. To assess the quality of the learned representations, we propose an alignment score to evaluate their similarity with the original representation space. Additionally, we conduct experiments to showcase KRaM's efficacy in various settings, from erasing binary gender variables in word embeddings to vector-valued variables in GPT-3 representations.

[110] 2312.00195

Raising the Bar of AI-generated Image Detection with CLIP

Aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, unlike previous belief, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits a surprising generalization ability and high robustness across several different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the SoTA on in-distribution data, and improve largely above it in terms of generalization to out-of-distribution data (+6% in terms of AUC) and robustness to impaired/laundered data (+13%). Our project is available at

[111] 2312.00198

Optimal Attack and Defense for Reinforcement Learning

To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.

[112] 2312.00200

Performance Analysis of Multi-Angle QAOA for p > 1

In this paper we consider the scalability of Multi-Angle QAOA with respect to the number of QAOA layers. We found that MA-QAOA is able to significantly reduce the depth of QAOA circuits, by a factor of up to 4 for the considered data sets. However, MA-QAOA is not optimal for minimization of the total QPU time. Different optimization initialization strategies are considered and compared for both QAOA and MA-QAOA. Among them, a new initialization strategy is suggested for MA-QAOA that is able to consistently and significantly outperform random initialization used in the previous studies.

[113] 2312.00201

An integrated framework for developing and evaluating an automated lecture style assessment system

The aim of the work presented in this paper is to develop and evaluate an integrated system that provides automated lecture style evaluation, allowing teachers to get instant feedback related to the goodness of their lecturing style. The proposed system aims to promote improvement of lecture quality, that could upgrade the overall student learning experience. The proposed application utilizes specific measurable biometric characteristics, such as facial expressions, body activity, speech rate and intonation, hand movement, and facial pose, extracted from a video showing the lecturer from the audience point of view. Measurable biometric features extracted during a lecture are combined to provide teachers with a score reflecting lecture style quality both at frame rate and by providing lecture quality metrics for the whole lecture. The acceptance of the proposed lecture style evaluation system was evaluated by chief education officers, teachers and students regarding the functionality, usefulness of the application, and possible improvements. The results indicate that participants found the application novel and useful in providing automated feedback regarding lecture quality. Furthermore, the performance evaluation of the proposed system was compared with the performance of humans in the task of lecture style evaluation. Results indicate that the proposed system not only achieves similar performance to human observers, but in some cases, it outperforms them.

[114] 2312.00204

DNS SLAM: Dense Neural Semantic-Informed SLAM

In recent years, coordinate-based neural implicit representations have shown promising results for the task of Simultaneous Localization and Mapping (SLAM). While achieving impressive performance on small synthetic scenes, these methods often suffer from oversmoothed reconstructions, especially for complex real-world scenes. In this work, we introduce DNS SLAM, a novel neural RGB-D semantic SLAM approach featuring a hybrid representation. Relying only on 2D semantic priors, we propose the first semantic neural SLAM method that trains class-wise scene representations while providing stable camera tracking at the same time. Our method integrates multi-view geometry constraints with image-based feature extraction to improve appearance details and to output color, density, and semantic class information, enabling many downstream applications. To further enable real-time tracking, we introduce a lightweight coarse scene representation which is trained in a self-supervised manner in latent space. Our experimental results achieve state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. Further, our method outputs class-wise decomposed reconstructions with better texture capturing appearance and geometric details.

[115] 2312.00206

SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting

The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.

[116] 2312.00207

EpiTESTER: Testing Autonomous Vehicles with Epigenetic Algorithm and Attention Mechanism

Testing autonomous vehicles (AVs) under various environmental scenarios that lead the vehicles to unsafe situations is known to be challenging. Given the infinite possible environmental scenarios, it is essential to find critical scenarios efficiently. To this end, we propose a novel testing method, named EpiTESTER, by taking inspiration from epigenetics, which enables species to adapt to sudden environmental changes. In particular, EpiTESTER adopts gene silencing as its epigenetic mechanism, which regulates gene expression to prevent the expression of a certain gene, and the probability of gene expression is dynamically computed as the environment changes. Given different data modalities (e.g., images, lidar point clouds) in the context of AV, EpiTESTER benefits from a multi-model fusion transformer to extract high-level feature representations from environmental factors and then calculates probabilities based on these features with the attention mechanism. To assess the cost-effectiveness of EpiTESTER, we compare it with a classical genetic algorithm (GA) (i.e., without any epigenetic mechanism implemented) and EpiTESTER with equal probability for each gene. We evaluate EpiTESTER with four initial environments from CARLA, an open-source simulator for autonomous driving research, and an end-to-end AV controller, Interfuser. Our results show that EpiTESTER achieved a promising performance in identifying critical scenarios compared to the baselines, showing that applying epigenetic mechanisms is a good option for solving practical problems.

[117] 2312.00209

On the Interplay Between Stepsize Tuning and Progressive Sharpening

Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Coehn et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch may be well explained by its tendency to ever-increase the sharpness of the objective in the full or large batch regimes. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, while outperforming its Armijo and constant stepsizes counterparts. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.

[118] 2312.00210

DREAM: Diffusion Rectification and Estimation-Adaptive Models

We present DREAM, a novel training framework representing Diffusion Rectification and Estimation-Adaptive Models, requiring minimal code changes (just three lines) yet significantly enhancing the alignment of training with sampling in diffusion models. DREAM features two components: diffusion rectification, which adjusts training to reflect the sampling process, and estimation adaptation, which balances perception against distortion. When applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff between minimizing distortion and preserving high image quality. Experiments demonstrate DREAM's superiority over standard diffusion-based SR methods, showing a $2$ to $3\times $ faster training convergence and a $10$ to $20\times$ reduction in necessary sampling steps to achieve comparable or superior results. We hope DREAM will inspire a rethinking of diffusion model training paradigms.

[119] 2312.00214

Relevance-guided Neural Machine Translation

With the advent of the Transformer architecture, Neural Machine Translation (NMT) results have shown great improvement lately. However, results in low-resource conditions still lag behind in both bilingual and multilingual setups, due to the limited amount of available monolingual and/or parallel data; hence, the need for methods addressing data scarcity in an efficient, and explainable way, is eminent. We propose an explainability-based training approach for NMT, applied in Unsupervised and Supervised model training, for translation of three languages of varying resources, French, Gujarati, Kazakh, to and from English. Our results show our method can be promising, particularly when training in low-resource conditions, outperforming simple training baselines; though the improvement is marginal, it sets the ground for further exploration of the approach and the parameters, and its extension to other languages.

[120] 2312.00215

Learning active tactile perception through belief-space control

Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate the object's physical parameters using a differentiable Bayesian filtering algorithm and 2) develop an exploration policy using an information-gathering model predictive controller. We evaluate our method on three simulated tasks where the goal is to estimate a desired object property (mass, height or toppling height) through physical interaction. We find that our method is able to discover policies that efficiently gather information about the desired property in an intuitive manner. Finally, we validate our method on a real robot system for the height estimation task, where our method is able to successfully learn and execute an information-gathering policy from scratch.

[121] 2312.00216

Medication abortion via digital health in the United States: a systematic scoping review

Digital health, including telemedicine, has increased access to abortion care. The convenience, flexibility of appointment times, and ensured privacy to abortion users may make abortion services via telemedicine preferable. This scoping review systematically mapped studies conducted on abortion services via telemedicine, including their effectiveness and acceptability for abortion users and providers. All published papers included abortion services via telemedicine in the United States were considered. Articles were searched in PubMed, CINAHL, and Google Scholar databases in September 2022. The findings were synthesized narratively, and the PRISMA-ScR guidelines were used to report this study. Out of 757 retrieved articles, 33 articles were selected based on the inclusion criteria. These studies were published between 2011 and 2022, with 24 published in the last 3 years. The study found that telemedicine increased access to abortion care in the United States, especially for people in remote areas or those worried about stigma from in-person visits. The effectiveness of abortion services via telemedicine was comparable to in-clinic visits, with 6% or fewer abortions requiring surgical intervention. Both care providers and abortion seekers expressed positive perceptions of telemedicine-based abortion services. However, abortion users reported mixed emotions, with some preferring in-person visits. The most common reasons for choosing telemedicine included the distance to the abortion clinic, convenience, privacy, cost, flexibility of appointment times, and state laws imposing waiting periods or restrictive policies. Telemedicine offered a preferable option for abortion seekers and providers. The feasibility of accessing abortion services via telemedicine in low-resource settings needs further investigation.

[122] 2312.00218

Cascaded Channel Decoupling Based Solution for RIS Regulation Matrix

This article presents a novel solution for reconfigurable intelligent surfaces (RISs) based on cascaded channel decoupling. The proposed mechanism simplifies the RIS regulation matrix, by decomposing the electromagnetic wave regulation process into two sub-processes: virtual receiving response and virtual regular transmission, which leads to the decoupling of the RIS cascaded channel. This article further discusses the concrete implementation of the proposed channel decoupling mechanism in two scenarios of single-user access and multi-user access, and gives the corresponding detailed scheme. The numerical simulation results demonstrate that the proposed channel decoupling scheme is a low-complexity and effective solution for resolving the RIS regulation matrix.

[123] 2312.00220

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Video topic segmentation unveils the coarse-grained semantic structure underlying videos and is essential for other video understanding tasks. Given the recent surge in multi-modal, relying solely on a single modality is arguably insufficient. On the other hand, prior solutions for similar tasks like video scene/shot segmentation cater to short videos with clear visual shifts but falter for long videos with subtle changes, such as livestreams. In this paper, we introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames, bolstered by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, enhancing our model's adaptability to longer, more semantically complex videos. Experiments on short and long video corpora demonstrate that our proposed solution, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.

[124] 2312.00224

Unsupervised textile defect detection using convolutional neural networks

In this study, we propose a novel motif-based approach for unsupervised textile anomaly detection that combines the benefits of traditional convolutional neural networks with those of an unsupervised learning paradigm. It consists of five main steps: preprocessing, automatic pattern period extraction, patch extraction, features selection and anomaly detection. This proposed approach uses a new dynamic and heuristic method for feature selection which avoids the drawbacks of initialization of the number of filters (neurons) and their weights, and those of the backpropagation mechanism such as the vanishing gradients, which are common practice in the state-of-the-art methods. The design and training of the network are performed in a dynamic and input domain-based manner and, thus, no ad-hoc configurations are required. Before building the model, only the number of layers and the stride are defined. We do not initialize the weights randomly nor do we define the filter size or number of filters as conventionally done in CNN-based approaches. This reduces effort and time spent on hyperparameter initialization and fine-tuning. Only one defect-free sample is required for training and no further labeled data is needed. The trained network is then used to detect anomalies on defective fabric samples. We demonstrate the effectiveness of our approach on the Patterned Fabrics benchmark dataset. Our algorithm yields reliable and competitive results (on recall, precision, accuracy and f1- measure) compared to state-of-the-art unsupervised approaches, in less time, with efficient training in a single epoch and a lower computational cost.

[125] 2312.00228

Multi-Axis and Multi-Vector Gradient Estimations: Using Multi-Sampled Complex Unit Vectors to Estimate Gradients of Real Functions

In this preliminary study, we provide two methods for estimating the gradients of functions of real value. Both methods are built on derivative estimations that are calculated using the standard method or the Squire-Trapp method for any given direction. Gradients are computed as the average of derivatives in uniformly sampled directions. The first method uses a uniformly distributed set of axes that consists of orthogonal unit vectors that span the space. The second method only uses a uniformly distributed set of unit vectors. Both methods essentially minimize the error through an average of estimations to cancel error terms. Both methods are essentially a conceptual generalization of the method used to estimate normal fractal surfaces.

[126] 2312.00229

WENO based adaptive image zooming algorithm

Image zooming or upsampling is a widely used tool in image processing and an essential step in many algorithms. Upsampling increases the number of pixels and introduces new information into the image, which can lead to numerical effects such as ringing artifacts, aliasing effects, and blurring of the image. In this paper, we propose an efficient polynomial interpolation algorithm based on the WENO algorithm for image upsampling that provides high accuracy in smooth regions, preserves edges and reduces aliasing effects. Although this is not the first application of WENO interpolation for image resampling, it is designed to have comparable complexity and memory load with better image quality than the separable WENO algorithm. We show that the algorithm performs equally well on smooth 2D functions, artificial pixel art, and real digital images. Comparison with similar methods on test images shows good results on standard metrics and also provides visually satisfactory results. Moreover, the low complexity of the algorithm is ensured by a small local approximation stencil and the appropriate choice of smoothness indicators.

[127] 2312.00232

Uncertainty in Graph Contrastive Learning with Bayesian Neural Networks

Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.

[128] 2312.00233

The role of interface design on prompt-mediated creativity in Generative AI

Generative AI models for the creation of images is becoming a staple in the toolkit of digital artists and visual designers. The interaction with these systems is mediated by prompting, a process in which users write a short text to describe the desired image's content and style. The study of prompts offers an unprecedented opportunity to gain insight into the process of human creativity, yet our understanding of how people use them remains limited. We analyze more than 145,000 prompts from the logs of two Generative AI platforms (Stable Diffusion and Pick-a-Pic) to shed light on how people explore new concepts over time, and how their exploration might be influenced by different design choices in human-computer interfaces to Generative AI. We find that users exhibit a tendency towards exploration of new topics over exploitation of concepts visited previously. However, a comparative analysis of the two platforms, which differ both in scope and functionalities, reveals that the introduction of features diverting user focus from prompting and providing instead shortcuts for generating new image variants with simple clicks is associated with a considerable reduction in both exploration of novel concepts and detail in the submitted prompts. These results carry direct implications for the design of human interfaces to Generative AI and raise new questions regarding how the process of prompting should be aided in ways that best support creativity.

[129] 2312.00234

Deep Equilibrium Based Neural Operators for Steady-State PDEs

Data-driven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We seek to remedy this gap by studying the benefits of weight-tied neural network architectures for steady-state PDEs. To achieve this, we first demonstrate that the solution of most steady-state PDEs can be expressed as a fixed point of a non-linear operator. Motivated by this observation, we propose FNO-DEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steady-state PDE as the infinite-depth fixed point of an implicit operator layer using a black-box root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory. Our experiments indicate that FNO-DEQ-based architectures outperform FNO-based baselines with $4\times$ the number of parameters in predicting the solution to steady-state PDEs such as Darcy Flow and steady-state incompressible Navier-Stokes. Finally, we show FNO-DEQ is more robust when trained with datasets with more noisy observations than the FNO-based baselines, demonstrating the benefits of using appropriate inductive biases in architectural design for different neural network based PDE solvers. Further, we show a universal approximation result that demonstrates that FNO-DEQ can approximate the solution to any steady-state PDE that can be written as a fixed point equation.

[130] 2312.00236

Brainformer: Modeling MRI Brain Functions to Machine Vision

"Perception is reality". Human perception plays a vital role in forming beliefs and understanding reality. Exploring how the human brain works in the visual system facilitates bridging the gap between human visual perception and computer vision models. However, neuroscientists study the brain via Neuroimaging, i.e., Functional Magnetic Resonance Imaging (fMRI), to discover the brain's functions. These approaches face interpretation challenges where fMRI data can be complex and require expertise. Therefore, neuroscientists make inferences about cognitive processes based on patterns of brain activities, which can lead to potential misinterpretation or limited functional understanding. In this work, we first present a simple yet effective Brainformer approach, a novel Transformer-based framework, to analyze the patterns of fMRI in the human perception system from the machine learning perspective. Secondly, we introduce a novel mechanism incorporating fMRI, which represents the human brain activities, as the supervision for the machine vision model. This work also introduces a novel perspective on transferring knowledge from human perception to neural networks. Through our experiments, we demonstrated that by leveraging fMRI information, the machine vision model can achieve potential results compared to the current State-of-the-art methods in various image recognition tasks.

[131] 2312.00237

Negotiated Representations to Prevent Forgetting in Machine Learning Applications

Catastrophic forgetting is a significant challenge in the field of machine learning, particularly in neural networks. When a neural network learns to perform well on a new task, it often forgets its previously acquired knowledge or experiences. This phenomenon occurs because the network adjusts its weights and connections to minimize the loss on the new task, which can inadvertently overwrite or disrupt the representations that were crucial for the previous tasks. As a result, the the performance of the network on earlier tasks deteriorates, limiting its ability to learn and adapt to a sequence of tasks. In this paper, we propose a novel method for preventing catastrophic forgetting in machine learning applications, specifically focusing on neural networks. Our approach aims to preserve the knowledge of the network across multiple tasks while still allowing it to learn new information effectively. We demonstrate the effectiveness of our method by conducting experiments on various benchmark datasets, including Split MNIST, Split CIFAR10, Split Fashion MNIST, and Split CIFAR100. These datasets are created by dividing the original datasets into separate, non overlapping tasks, simulating a continual learning scenario where the model needs to learn multiple tasks sequentially without forgetting the previous ones. Our proposed method tackles the catastrophic forgetting problem by incorporating negotiated representations into the learning process, which allows the model to maintain a balance between retaining past experiences and adapting to new tasks. By evaluating our method on these challenging datasets, we aim to showcase its potential for addressing catastrophic forgetting and improving the performance of neural networks in continual learning settings.

[132] 2312.00238

Self-similarity of Communities of the ABCD Model

The Artificial Benchmark for Community Detection (ABCD) graph is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs similar to the well-known LFR model but it is faster and can be investigated analytically. In this paper, we show that the ABCD model exhibits some interesting self-similar behaviour, namely, the degree distribution of ground-truth communities is asymptotically the same as the degree distribution of the whole graph (appropriately normalized based on their sizes). As a result, we can not only estimate the number of edges induced by each community but also the number of self-loops and multi-edges generated during the process. Understanding these quantities is important as (a) rewiring self-loops and multi-edges to keep the graph simple is an expensive part of the algorithm, and (b) every rewiring causes the underlying configuration models to deviate slightly from uniform simple graphs on their corresponding degree sequences.

[133] 2312.00243

Low Revenue in Display Ad Auctions: Algorithmic Collusion vs. Non-Quasilinear Preferences

The transition of display ad exchanges from second-price to first-price auctions has raised questions about its impact on revenue. Evaluating this shift empirically proves challenging. One key factor is the behavior of automated bidding agents, who are unlikely to use static game-theoretical equilibrium strategies instead of favoring dynamic realms that continuously adapt and learn independently through the process of exploration and exploitation. Thus revenue equivalence between first- and second-price auctions might not hold. Research on algorithmic collusion in display ad auctions found revenue differences between second-price and first-price auctions. First-price auctions can induce Q-learning agents to tacitly collude below the Nash equilibrium in repeated complete-information auctions with payoff-maximizing agents (i.e., agents maximizing value minus price). Our analysis explores wide-spread online learning algorithms' convergence behavior in both complete and incomplete information models, but does not find a systematic deviance from equilibrium behavior. Convergence for Q-learning depends on hyperparameters and initializations, and algorithmic collusion vanishes when competing against other learning algorithms. Apart from their learning behavior, the objectives reported in the literature extend payoff maximization, often focusing on return-on-investment or return-on-spend. We derive equilibrium bid functions for such utility models, revealing that revenue equivalence doesn't hold. In low-competition scenarios, the first-price auction often yields lower revenue than the second-price counterpart. These insights offer an alternative rationale for the potential revenue decrease in first-price auctions. Understanding the intricate interplay of auction rules, learning algorithms, and utility models is crucial in maximizing revenue in the ever-evolving world of display ad exchanges.

[134] 2312.00245

SPAM: Secure & Private Aircraft Management

With the rising use of aircrafts for operations ranging from disaster-relief to warfare, there is a growing risk of adversarial attacks. Malicious entities often only require the location of the aircraft for these attacks. Current satellite-aircraft communication and tracking protocols put aircrafts at risk if the satellite is compromised, due to computation being done in plaintext. In this work, we present \texttt{SPAM}, a private, secure, and accurate system that allows satellites to efficiently manage and maintain tracking angles for aircraft fleets without learning aircrafts' locations. \texttt{SPAM} is built upon multi-party computation and zero-knowledge proofs to guarantee privacy and high efficiency. While catered towards aircrafts, \texttt{SPAM}'s zero-knowledge fleet management can be easily extended to the IoT, with very little overhead.

[135] 2312.00246

Curvature Explains Loss of Plasticity

Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for plasticity loss, based on an assertion that neural networks lose directions of curvature during training and that plasticity loss can be attributed to this reduction in curvature. To support such a claim, we provide a systematic empirical investigation of plasticity loss across several continual supervised learning problems. Our findings illustrate that curvature loss coincides with and sometimes precedes plasticity loss, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings considered.

[136] 2312.00250

Advancements and Trends in Ultra-High-Resolution Image Processing: An Overview

Currently, to further improve visual enjoyment, Ultra-High-Definition (UHD) images are catching wide attention. Here, UHD images are usually referred to as having a resolution greater than or equal to $3840 \times 2160$. However, since the imaging equipment is subject to environmental noise or equipment jitter, UHD images are prone to contrast degradation, blurring, low dynamic range, etc. To address these issues, a large number of algorithms for UHD image enhancement have been proposed. In this paper, we introduce the current state of UHD image enhancement from two perspectives, one is the application field and the other is the technology. In addition, we briefly explore its trends.

[137] 2312.00252

PyNeRF: Pyramidal Neural Radiance Fields

Neural Radiance Fields (NeRFs) can be dramatically accelerated by spatial grid representations. However, they do not explicitly reason about scale and so introduce aliasing artifacts when reconstructing scenes captured at different camera distances. Mip-NeRF and its extensions propose scale-aware renderers that project volumetric frustums rather than point samples but such approaches rely on positional encodings that are not readily compatible with grid methods. We propose a simple modification to grid-based models by training model heads at different spatial grid resolutions. At render time, we simply use coarser grids to render samples that cover larger volumes. Our method can be easily applied to existing accelerated NeRF methods and significantly improves rendering quality (reducing error rates by 20-90% across synthetic and unbounded real-world scenes) while incurring minimal performance overhead (as each model head is quick to evaluate). Compared to Mip-NeRF, we reduce error rates by 20% while training over 60x faster.

[138] 2312.00259

Scalable Cellular V2X Solutions: Large-Scale Deployment Challenges of Connected Vehicle Safety Networks

Vehicle-to-Everything (V2X) communication is expected to accomplish a long-standing goal of the Connected and Autonomous Vehicle (CAV) community to bring connected vehicles to roads on a large scale. A major challenge, and perhaps the biggest hurdle on the path towards this goal is the scalability issues associated with it, especially when vehicular safety is concerned. As a major stakeholder, 3rd Generation Partnership Project (3GPP) based Cellular V2X (C-V2X) community has long been trying to research on whether vehicular networks are able to support the safety-critical applications in high-density vehicular scenarios. This paper attempts to answer this by first presenting an overview on the scalability challenges faced by 3GPP Release 14 Long Term Evolution C-V2X (LTE-V2X) using the PC5 sidelink interface for low and heavy-density traffic scenarios. Next, it demonstrates a series of solutions that address network congestion, packet losses and other scalability issues associated with LTE-V2X to enable this communication technology for commercial deployment. In addition, a brief survey is provided into 3GPP Release 16 5G New Radio V2X (NR-V2X) that utilizes the NR sidelink interface and works as an evolution of C-V2X towards better performance for V2X communications including new enhanced V2X (eV2X) scenarios that possess ultra-low-latency and high-reliability requirements.

[139] 2312.00262

Augmented Kinesthetic Teaching: Enhancing Task Execution Efficiency through Intuitive Human Instructions

In this paper, we present a complete and efficient implementation of a knowledge-sharing augmented kinesthetic teaching approach for efficient task execution in robotics. Our augmented kinesthetic teaching method integrates intuitive human feedback, including verbal, gesture, gaze, and physical guidance, to facilitate the extraction of multiple layers of task information including control type, attention direction, input and output type, action state change trigger, etc., enhancing the adaptability and autonomy of robots during task execution. We propose an efficient Programming by Demonstration (PbD) framework for users with limited technical experience to teach the robot in an intuitive manner. The proposed framework provides an interface for such users to teach customized tasks using high-level commands, with the goal of achieving a smoother teaching experience and task execution. This is demonstrated with the sample task of pouring water.

[140] 2312.00265

RoboSync: OS for Social Robots with Customizable Behaviour

Traditional robotic systems require complex implementations that are not always accessible or easy to use for Human-Robot Interaction (HRI) application developers. With the aim of simplifying the implementation of HRI applications, this paper introduces a novel real-time operating system (RTOS) designed for customizable HRI - RoboSync. By creating multi-level abstraction layers, the system enables users to define complex emotional and behavioral models without needing deep technical expertise. The system's modular architecture comprises a behavior modeling layer, a machine learning plugin configuration layer, a sensor checks customization layer, a scheduler that fits the need of HRI, and a communication and synchronization layer. This approach not only promotes ease of use without highly specialized skills but also ensures real-time responsiveness and adaptability. The primary functionality of the RTOS has been implemented for proof of concept and was tested on a CortexM4 microcontroller, demonstrating its potential for a wide range of lightweight simple-to-implement social robotics applications.

[141] 2312.00267

Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration

Preference-based feedback is important for many applications in reinforcement learning where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback (RLHF) on large language models. For many applications of RLHF, the cost of acquiring the human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and formalize this as an offline contextual dueling bandit problem. We give an upper-confidence-bound style algorithm for this problem and prove a polynomial worst-case regret bound. We then provide empirical confirmation in a synthetic setting that our approach outperforms existing methods. After, we extend the setting and methodology for practical use in RLHF training of large language models. Here, our method is able to reach better performance with fewer samples of human preferences than multiple baselines on three real-world datasets.

[142] 2312.00268

Academic competitions

Academic challenges comprise effective means for (i) advancing the state of the art, (ii) putting in the spotlight of a scientific community specific topics and problems, as well as (iii) closing the gap for under represented communities in terms of accessing and participating in the shaping of research fields. Competitions can be traced back for centuries and their achievements have had great influence in our modern world. Recently, they (re)gained popularity, with the overwhelming amounts of data that is being generated in different domains, as well as the need of pushing the barriers of existing methods, and available tools to handle such data. This chapter provides a survey of academic challenges in the context of machine learning and related fields. We review the most influential competitions in the last few years and analyze challenges per area of knowledge. The aims of scientific challenges, their goals, major achievements and expectations for the next few years are reviewed.

[143] 2312.00269

Adaptability of Computer Vision at the Tactical Edge: Addressing Environmental Uncertainty

Computer Vision (CV) systems are increasingly being adopted into Command and Control (C2) systems to improve intelligence analysis on the battlefield, the tactical edge. CV systems leverage Artificial Intelligence (AI) algorithms to help visualize and interpret the environment, enhancing situational awareness. However, the adaptability of CV systems at the tactical edge remains challenging due to rapidly changing environments and objects which can confuse the deployed models. A CV model leveraged in this environment can become uncertain in its predictions, as the environment and the objects existing in the environment begin to change. Additionally, mission objectives can rapidly change leading to adjustments in technology, camera angles, and image resolutions. All of which can negatively affect the performance of and potentially introduce uncertainty into the system. When the training environment and/or technology differs from the deployment environment, CV models can perform unexpectedly. Unfortunately, most scenarios at the tactical edge do not incorporate Uncertainty Quantification (UQ) into their deployed C2 and CV systems. This concept paper explores the idea of synchronizing robust data operations and model fine-tuning driven by UQ all at the tactical edge. Specifically, curating datasets and training child models based on the residuals of predictions, using these child models to calculate prediction intervals (PI), and then using these PI to calibrate the deployed models. By incorporating UQ into the core operations surrounding C2 and CV systems at the tactical edge, we can help drive purposeful adaptability on the battlefield.

[144] 2312.00271

Towards Clinical Prediction with Transparency: An Explainable AI Approach to Survival Modelling in Residential Aged Care

Background: Accurate survival time estimates aid end-of-life medical decision-making. Objectives: Develop an interpretable survival model for elderly residential aged care residents using advanced machine learning. Setting: A major Australasian residential aged care provider. Participants: Residents aged 65+ admitted for long-term care from July 2017 to August 2023. Sample size: 11,944 residents across 40 facilities. Predictors: Factors include age, gender, health status, co-morbidities, cognitive function, mood, nutrition, mobility, smoking, sleep, skin integrity, and continence. Outcome: Probability of survival post-admission, specifically calibrated for 6-month survival estimates. Statistical Analysis: Tested CoxPH, EN, RR, Lasso, GB, XGB, and RF models in 20 experiments with a 90/10 train/test split. Evaluated accuracy using C-index, Harrell's C-index, dynamic AUROC, IBS, and calibrated ROC. Chose XGB for its performance and calibrated it for 1, 3, 6, and 12-month predictions using Platt scaling. Employed SHAP values to analyze predictor impacts. Results: GB, XGB, and RF models showed the highest C-Index values (0.714, 0.712, 0.712). The optimal XGB model demonstrated a 6-month survival prediction AUROC of 0.746 (95% CI 0.744-0.749). Key mortality predictors include age, male gender, mobility, health status, pressure ulcer risk, and appetite. Conclusions: The study successfully applies machine learning to create a survival model for aged care, aligning with clinical insights on mortality risk factors and enhancing model interpretability and clinical utility through explainable AI.

[145] 2312.00273

Mark My Words: Analyzing and Evaluating Language Model Watermarks

The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. In this context, the ability to distinguish machine-generated text from human-authored content becomes important. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on text watermarking techniques - as opposed to image watermarks - and proposes a comprehensive benchmark for them under different tasks as well as practical attacks. We focus on three main metrics: quality, size (e.g. the number of tokens needed to detect a watermark), and tamper-resistance. Current watermarking techniques are good enough to be deployed: Kirchenbauer et al. can watermark Llama2-7B-chat with no perceivable loss in quality in under 100 tokens, and with good tamper-resistance to simple attacks, regardless of temperature. We argue that watermark indistinguishability is too strong a requirement: schemes that slightly modify logit distributions outperform their indistinguishable counterparts with no noticeable loss in generation quality. We publicly release our benchmark.

[146] 2312.00276

Automating Continual Learning

General-purpose learning systems should improve themselves in open-ended fashion in ever-changing environments. Conventional learning algorithms for neural networks, however, suffer from catastrophic forgetting (CF) -- previously acquired skills are forgotten when a new task is learned. Instead of hand-crafting new algorithms for avoiding CF, we propose Automated Continual Learning (ACL) to train self-referential neural networks to meta-learn their own in-context continual (meta-)learning algorithms. ACL encodes all desiderata -- good performance on both old and new tasks -- into its meta-learning objectives. Our experiments demonstrate that ACL effectively solves "in-context catastrophic forgetting"; our ACL-learned algorithms outperform hand-crafted ones, e.g., on the Split-MNIST benchmark in the replay-free setting, and enables continual learning of diverse tasks consisting of multiple few-shot and standard image classification datasets.

[147] 2312.00277

Text Attribute Control via Closed-Loop Disentanglement

Changing an attribute of a text without changing the content usually requires to first disentangle the text into irrelevant attributes and content representations. After that, in the inference phase, the representation of one attribute is tuned to a different value, expecting that the corresponding attribute of the text can also be changed accordingly. The usual way of disentanglement is to add some constraints on the latent space of an encoder-decoder architecture, including adversarial-based constraints and mutual-information-based constraints. However, the previous semi-supervised processes of attribute change are usually not enough to guarantee the success of attribute change and content preservation. In this paper, we propose a novel approach to achieve a robust control of attributes while enhancing content preservation. In this approach, we use a semi-supervised contrastive learning method to encourage the disentanglement of attributes in latent spaces. Differently from previous works, we re-disentangle the reconstructed sentence and compare the re-disentangled latent space with the original latent space, which makes a closed-loop disentanglement process. This also helps content preservation. In addition, the contrastive learning method is also able to replace the role of minimizing mutual information and adversarial training in the disentanglement process, which alleviates the computation cost. We conducted experiments on three text datasets, including the Yelp Service review dataset, the Amazon Product review dataset, and the GoEmotions dataset. The experimental results show the effectiveness of our model.

[148] 2312.00279

Age-Based Scheduling for Mobile Edge Computing: A Deep Reinforcement Learning Approach

With the rapid development of Mobile Edge Computing (MEC), various real-time applications have been deployed to benefit people's daily lives. The performance of these applications relies heavily on the freshness of collected environmental information, which can be quantified by its Age of Information (AoI). In the traditional definition of AoI, it is assumed that the status information can be actively sampled and directly used. However, for many MEC-enabled applications, the desired status information is updated in an event-driven manner and necessitates data processing. To better serve these applications, we propose a new definition of AoI and, based on the redefined AoI, we formulate an online AoI minimization problem for MEC systems. Notably, the problem can be interpreted as a Markov Decision Process (MDP), thus enabling its solution through Reinforcement Learning (RL) algorithms. Nevertheless, the traditional RL algorithms are designed for MDPs with completely unknown system dynamics and hence usually suffer long convergence times. To accelerate the learning process, we introduce Post-Decision States (PDSs) to exploit the partial knowledge of the system's dynamics. We also combine PDSs with deep RL to further improve the algorithm's applicability, scalability, and robustness. Numerical results demonstrate that our algorithm outperforms the benchmarks under various scenarios.

[149] 2312.00281

A Scale-out Decentralized Blockchain Ledger System for Web3.0

The development of underlying technologies in blockchain mostly revolves around a difficult problem: how to enhance the performance of the system and reduce various costs of nodes (such as communication, storage and verification) without compromising the system's security and decentralization. Various layer-1 and layer-2 protocols have provided excellent solutions for this challenge. However, they cannot yet be considered as a ``silver bullet". This paper proposes EZchain -- a novel decentralized ``scale-out" ledger system designed for web3.0, aiming to enable blockchain technology to truly support ledger applications in large-scale fully decentralized networks. Without compromising security and decentralization, EZchain successfully accomplishes the following milestones: 1) Scalability: The theoretical throughput of EZchain can be infinitely expanded, nearly unaffected by bandwidth and other resource constraints. 2) Consumer-Grade Hardware Compatibility: EZchain is designed to be compatible with consumer-grade hardware, supporting storage, computation, and verification requirements. 3) Efficient Transaction Confirmation: EZchain strives to maintain transaction confirmation delays within one minute. Our prototype experiment demonstrates that under typical daily bandwidth network conditions, EZchain's performance in all aspects approaches that of the accounts in centralized payment systems. This provides a solid infrastructure for realizing mobile payments in web3.0.

[150] 2312.00289

Robust Generalized Proportional Integral Control for Trajectory Tracking of Soft Actuators in a Pediatric Wearable Assistive Device

Soft robotics hold promise in the development of safe yet powered assistive wearable devices for infants. Key to this is the development of closed-loop controllers that can help regulate pneumatic pressure in the device's actuators in an effort to induce controlled motion at the user's limbs and be able to track different types of trajectories. This work develops a controller for soft pneumatic actuators aimed to power a pediatric soft wearable robotic device prototype for upper extremity motion assistance. The controller tracks desired trajectories for a system of soft pneumatic actuators supporting two-degree-of-freedom shoulder joint motion on an infant-sized engineered mannequin. The degrees of freedom assisted by the actuators are equivalent to shoulder motion (abduction/adduction and flexion/extension). Embedded inertial measurement unit sensors provide real-time joint feedback. Experimental data from performing reaching tasks using the engineered mannequin are obtained and compared against ground truth to evaluate the performance of the developed controller. Results reveal the proposed controller leads to accurate trajectory tracking performance across a variety of shoulder joint motions.

[151] 2312.00290

Learning to forecast diagnostic parameters using pre-trained weather embedding

Data-driven weather prediction (DDWP) models are increasingly becoming popular for weather forecasting. However, while operational weather forecasts predict a wide variety of weather variables, DDWPs currently forecast a specific set of key prognostic variables. Non-prognostic ("diagnostic") variables are sometimes modeled separately as dependent variables of the prognostic variables (c.f. FourCastNet), or by including the diagnostic variable as a target in the DDWP. However, the cost of training and deploying bespoke models for each diagnostic variable can increase dramatically with more diagnostic variables, and limit the operational use of such models. Likewise, retraining an entire DDWP each time a new diagnostic variable is added is also cost-prohibitive. We present an two-stage approach that allows new diagnostic variables to be added to an end-to-end DDWP model without the expensive retraining. In the first stage, we train an autoencoder that learns to embed prognostic variables into a latent space. In the second stage, the autoencoder is frozen and "downstream" models are trained to predict diagnostic variables using only the latent representations of prognostic variables as input. Our experiments indicate that models trained using the two-stage approach offer accuracy comparable to training bespoke models, while leading to significant reduction in resource utilization during training and inference. This approach allows for new "downstream" models to be developed as needed, without affecting existing models and thus reducing the friction in operationalizing new models.

[152] 2312.00291

The Existence the Solution of Nonlinear Discrete Schemes and Convergence of a Linearized Iterative Method for time-dependent PNP Equations

We establish the existence theory of several commonly used finite element (FE) nonlinear fully discrete solutions, and the convergence theory of a linearized iteration. First, it is shown for standard FE, SUPG and edge-averaged method respectively that the stiffness matrix is a column M-matrix under certain conditions, and then the existence theory of these three FE nonlinear fully discrete solutions is presented by using Brouwer's fixed point theorem. Second, the contraction of a commonly used linearized iterative method-Gummel iteration is proven, and then the convergence theory is established for the iteration. At last, a numerical experiment is shown to verifies the theories.

[153] 2312.00292

SEPSIS: I Can Catch Your Lies -- A New Paradigm for Deception Detection

Deception is the intentional practice of twisting information. It is a nuanced societal practice deeply intertwined with human societal evolution, characterized by a multitude of facets. This research explores the problem of deception through the lens of psychology, employing a framework that categorizes deception into three forms: lies of omission, lies of commission, and lies of influence. The primary focus of this study is specifically on investigating only lies of omission. We propose a novel framework for deception detection leveraging NLP techniques. We curated an annotated dataset of 876,784 samples by amalgamating a popular large-scale fake news dataset and scraped news headlines from the Twitter handle of Times of India, a well-known Indian news media house. Each sample has been labeled with four layers, namely: (i) the type of omission (speculation, bias, distortion, sounds factual, and opinion), (ii) colors of lies(black, white, etc), and (iii) the intention of such lies (to influence, etc) (iv) topic of lies (political, educational, religious, etc). We present a novel multi-task learning pipeline that leverages the dataless merging of fine-tuned language models to address the deception detection task mentioned earlier. Our proposed model achieved an F1 score of 0.87, demonstrating strong performance across all layers including the type, color, intent, and topic aspects of deceptive content. Finally, our research explores the relationship between lies of omission and propaganda techniques. To accomplish this, we conducted an in-depth analysis, uncovering compelling findings. For instance, our analysis revealed a significant correlation between loaded language and opinion, shedding light on their interconnectedness. To encourage further research in this field, we will be making the models and dataset available with the MIT License, making it favorable for open-source research.

[154] 2312.00293

PsyAttention: Psychological Attention Model for Personality Detection

Work on personality detection has tended to incorporate psychological features from different personality models, such as BigFive and MBTI. There are more than 900 psychological features, each of which is helpful for personality detection. However, when used in combination, the application of different calculation standards among these features may result in interference between features calculated using distinct systems, thereby introducing noise and reducing performance. This paper adapts different psychological models in the proposed PsyAttention for personality detection, which can effectively encode psychological features, reducing their number by 85%. In experiments on the BigFive and MBTI models, PysAttention achieved average accuracy of 65.66% and 86.30%, respectively, outperforming state-of-the-art methods, indicating that it is effective at encoding psychological features.

[155] 2312.00296

Towards Aligned Canonical Correlation Analysis: Preliminary Formulation and Proof-of-Concept Results

Canonical Correlation Analysis (CCA) has been widely applied to jointly embed multiple views of data in a maximally correlated latent space. However, the alignment between various data perspectives, which is required by traditional approaches, is unclear in many practical cases. In this work we propose a new framework Aligned Canonical Correlation Analysis (ACCA), to address this challenge by iteratively solving the alignment and multi-view embedding.

[156] 2312.00303

A Review of the In-Network Computing and Its Role in the Edge-Cloud Continuum

Future networks are anticipated to enable exciting applications and industrial services ranging from Multisensory Extended Reality to Holographic and Haptic communication. These services are accompanied by high bandwidth requirements and/or require low latency and low reliability, which leads to the need for scarce and expensive resources. Cloud and edge computing offer different functionalities to these applications that require communication, computing, and caching (3C) resources working collectively. Hence, a paradigm shift is necessary to enable the joint management of the 3Cs in the edge-cloud continuum. We argue that In-Network Computing (INC) is the missing element that completes the edge-cloud continuum. This paper provides a detailed analysis of the driving use-cases, explores the synergy between INC and 3C, and emphasizes the crucial role of INC. A discussion on the opportunities and challenges posed by INC is held from various perspectives, including hardware implementation, architectural design, and regulatory and commercial aspects.

[157] 2312.00304

Developmental Pretraining (DPT) for Image Classification Networks

In the backdrop of increasing data requirements of Deep Neural Networks for object recognition that is growing more untenable by the day, we present Developmental PreTraining (DPT) as a possible solution. DPT is designed as a curriculum-based pre-training approach designed to rival traditional pre-training techniques that are data-hungry. These training approaches also introduce unnecessary features that could be misleading when the network is employed in a downstream classification task where the data is sufficiently different from the pre-training data and is scarce. We design the curriculum for DPT by drawing inspiration from human infant visual development. DPT employs a phased approach where carefully-selected primitive and universal features like edges and shapes are taught to the network participating in our pre-training regime. A model that underwent the DPT regime is tested against models with randomised weights to evaluate the viability of DPT.

[158] 2312.00308

A knowledge-based data-driven (KBDD) framework for all-day identification of cloud types using satellite remote sensing

Cloud types, as a type of meteorological data, are of particular significance for evaluating changes in rainfall, heatwaves, water resources, floods and droughts, food security and vegetation cover, as well as land use. In order to effectively utilize high-resolution geostationary observations, a knowledge-based data-driven (KBDD) framework for all-day identification of cloud types based on spectral information from Himawari-8/9 satellite sensors is designed. And a novel, simple and efficient network, named CldNet, is proposed. Compared with widely used semantic segmentation networks, including SegNet, PSPNet, DeepLabV3+, UNet, and ResUnet, our proposed model CldNet with an accuracy of 80.89+-2.18% is state-of-the-art in identifying cloud types and has increased by 32%, 46%, 22%, 2%, and 39%, respectively. With the assistance of auxiliary information (e.g., satellite zenith/azimuth angle, solar zenith/azimuth angle), the accuracy of CldNet-W using visible and near-infrared bands and CldNet-O not using visible and near-infrared bands on the test dataset is 82.23+-2.14% and 73.21+-2.02%, respectively. Meanwhile, the total parameters of CldNet are only 0.46M, making it easy for edge deployment. More importantly, the trained CldNet without any fine-tuning can predict cloud types with higher spatial resolution using satellite spectral data with spatial resolution 0.02{\deg}*0.02{\deg}, which indicates that CldNet possesses a strong generalization ability. In aggregate, the KBDD framework using CldNet is a highly effective cloud-type identification system capable of providing a high-fidelity, all-day, spatiotemporal cloud-type database for many climate assessment fields.

[159] 2312.00311

3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation

3D Morphable Models (3DMMs) provide promising 3D face reconstructions in various applications. However, existing methods struggle to reconstruct faces with extreme expressions due to deficiencies in supervisory signals, such as sparse or inaccurate landmarks. Segmentation information contains effective geometric contexts for face reconstruction. Certain attempts intuitively depend on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation, which is prone to issues like local optima and gradient instability. In this paper, we fully utilize the facial part segmentation geometry by introducing Part Re-projection Distance Loss (PRDL). Specifically, PRDL transforms facial part segmentation into 2D points and re-projects the reconstruction onto the image plane. Subsequently, by introducing grid anchors and computing different statistical distances from these anchors to the point sets, PRDL establishes geometry descriptors to optimize the distribution of the point sets for face reconstruction. PRDL exhibits a clear gradient compared to the renderer-based methods and presents state-of-the-art reconstruction performance in extensive quantitative and qualitative experiments. The project will be publicly available.

[160] 2312.00312

Segment Anything Model-guided Collaborative Learning Network for Scribble-supervised Polyp Segmentation

Polyp segmentation plays a vital role in accurately locating polyps at an early stage, which holds significant clinical importance for the prevention of colorectal cancer. Various polyp segmentation methods have been developed using fully-supervised deep learning techniques. However, pixel-wise annotation for polyp images by physicians during the diagnosis is both time-consuming and expensive. Moreover, visual foundation models such as the Segment Anything Model (SAM) have shown remarkable performance. Nevertheless, directly applying SAM to medical segmentation may not produce satisfactory results due to the inherent absence of medical knowledge. In this paper, we propose a novel SAM-guided Collaborative Learning Network (SAM-CLNet) for scribble-supervised polyp segmentation, enabling a collaborative learning process between our segmentation network and SAM to boost the model performance. Specifically, we first propose a Cross-level Enhancement and Aggregation Network (CEA-Net) for weakly-supervised polyp segmentation. Within CEA-Net, we propose a Cross-level Enhancement Module (CEM) that integrates the adjacent features to enhance the representation capabilities of different resolution features. Additionally, a Feature Aggregation Module (FAM) is employed to capture richer features across multiple levels. Moreover, we present a box-augmentation strategy that combines the segmentation maps generated by CEA-Net with scribble annotations to create more precise prompts. These prompts are then fed into SAM, generating segmentation SAM-guided masks, which can provide additional supervision to train CEA-Net effectively. Furthermore, we present an Image-level Filtering Mechanism to filter out unreliable SAM-guided masks. Extensive experimental results show that our SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods.

[161] 2312.00313

Improving Normalization with the James-Stein Estimator

Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.

[162] 2312.00314

A bilevel optimal motion planning (BOMP) model with application to autonomous parking

In this paper, we present a bilevel optimal motion planning (BOMP) model for autonomous parking. The BOMP model treats motion planning as an optimal control problem, in which the upper level is designed for vehicle nonlinear dynamics, and the lower level is for geometry collision-free constraints. The significant feature of the BOMP model is that the lower level is a linear programming problem that serves as a constraint for the upper-level problem. That is, an optimal control problem contains an embedded optimization problem as constraints. Traditional optimal control methods cannot solve the BOMP problem directly. Therefore, the modified approximate Karush-Kuhn-Tucker theory is applied to generate a general nonlinear optimal control problem. Then the pseudospectral optimal control method solves the converted problem. Particularly, the lower level is the $J_2$-function that acts as a distance function between convex polyhedron objects. Polyhedrons can approximate vehicles in higher precision than spheres or ellipsoids. Besides, the modified $J_2$-function (MJ) and the active-points based modified $J_2$-function (APMJ) are proposed to reduce the variables number and time complexity. As a result, an iteirative two-stage BOMP algorithm for autonomous parking concerning dynamical feasibility and collision-free property is proposed. The MJ function is used in the initial stage to find an initial collision-free approximate optimal trajectory and the active points, then the APMJ function in the final stage finds out the optimal trajectory. Simulation results and experiment on Turtlebot3 validate the BOMP model, and demonstrate that the computation speed increases almost two orders of magnitude compared with the area criterion based collision avoidance method.

[163] 2312.00315

Multiple Control Functionals for Interconnected Time-Delay Systems

Safety is essential for autonomous systems, in particular for interconnected systems in which the interactions among subsystems are involved. Motivated by the recent interest in cyber-physical and interconnected autonomous systems, we address the safe stabilization problem of interconnected systems with time delays. We propose multiple control Lyapunov and barrier functionals for the stabilization and safety control problems, respectively. In order to investigate the safe stabilization control problem, the proposed multiple control functionals are combined together via two methods: the optimization-based method and the sliding mode based method. The resulting controllers can be of either explicit or implicit forms, both of which ensure the safe stabilization objective of the whole system. The derived results are illustrated via a reach-avoid problem of multi-robot systems.

[164] 2312.00316

Improving Efficiency of DNN-based Relocalization Module for Autonomous Driving with Server-side Computing

In this work, we present a novel framework for camera relocation in autonomous vehicles, leveraging deep neural networks (DNN). While existing literature offers various DNN-based camera relocation methods, their deployment is hindered by their high computational demands during inference. In contrast, our approach addresses this challenge through edge cloud collaboration. Specifically, we strategically offload certain modules of the neural network to the server and evaluate the inference time of data frames under different network segmentation schemes to guide our offloading decisions. Our findings highlight the vital role of server-side offloading in DNN-based camera relocation for autonomous vehicles, and we also discuss the results of data fusion. Finally, we validate the effectiveness of our proposed framework through experimental evaluation.

[165] 2312.00320

On Multi-step Fuzzy Inference in Goedel Logic

This paper addresses the logical and computational foundations of multi-step fuzzy inference using the Mamdani-Assilian type of fuzzy rules by implementing such inference in Goedel logic with truth constants. We apply the results achieved in the development of a hyperresolution calculus for this logic. We pose three fundamental problems: reachability, stability, the existence of a k-cycle in multi-step fuzzy inference and reduce them to certain deduction and unsatisfiability problems. The corresponding unsatisfiability problems may be solved using hyperresolution.

[166] 2312.00324

Machine Learning for Actionable Warning Identification: A Comprehensive Survey

Actionable Warning Identification (AWI) plays a crucial role in improving the usability of static code analyzers. With recent advances in Machine Learning (ML), various approaches have been proposed to incorporate ML techniques into AWI. These ML-based AWI approaches, benefiting from ML's strong ability to learn subtle and previously unseen patterns from historical data, have demonstrated superior performance. However, a comprehensive overview of these approaches is missing, which could hinder researchers/practitioners from understanding the current process and discovering potential for future improvement in the ML-based AWI community. In this paper, we systematically review the state-of-the-art ML-based AWI approaches. First, we employ a meticulous survey methodology and gather 50 primary studies from 2000/01/01 to 2023/09/01. Then, we outline the typical ML-based AWI workflow, including warning dataset preparation, preprocessing, AWI model construction, and evaluation stages. In such a workflow, we categorize ML-based AWI approaches based on the warning output format. Besides, we analyze the techniques used in each stage, along with their strengths, weaknesses, and distribution. Finally, we provide practical research directions for future ML-based AWI approaches, focusing on aspects like data improvement (e.g., enhancing the warning labeling strategy) and model exploration (e.g., exploring large language models for AWI).

[167] 2312.00326

Agent-OM: Leveraging Large Language Models for Ontology Matching

Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM-based agents have become revolutionary in data engineering and have been applied creatively in various domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With thoughtful consideration of several specific challenges to leverage LLMs for OM, we propose a generic framework, namely Agent-OM, consisting of two Siamese agents for retrieval and matching, with a set of simple prompt-based OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve very close results to the best long-standing performance on simple OM tasks and significantly improve the performance on complex and few-shot OM tasks.

[168] 2312.00327

A Framework for Solving Parabolic Partial Differential Equations on Discrete Domains

We introduce a framework for solving a class of parabolic partial differential equations on triangle mesh surfaces, including the Hamilton-Jacobi equation and the Fokker-Planck equation. PDE in this class often have nonlinear or stiff terms that cannot be resolved with standard methods on curved triangle meshes. To address this challenge, we leverage a splitting integrator combined with a convex optimization step to solve these PDE. Our machinery can be used to compute entropic approximation of optimal transport distances on geometric domains, overcoming the numerical limitations of the state-of-the-art method. In addition, we demonstrate the versatility of our method on a number of linear and nonlinear PDE that appear in diffusion and front propagation tasks in geometry processing.

[169] 2312.00328

On Novel Fixed-Point-Type Iterations with Structure-Preserving Doubling Algorithms for Stochastic Continuous-time Algebraic Riccati equations

In this paper we mainly propose efficient and reliable numerical algorithms for solving stochastic continuous-time algebraic Riccati equations (SCARE) typically arising from the differential statedependent Riccati equation technique from the 3D missile/target engagement, the F16 aircraft flight control and the quadrotor optimal control etc. To this end, we develop a fixed point (FP)-type iteration with solving a CARE by the structure-preserving doubling algorithm (SDA) at each iterative step, called FP-CARE SDA. We prove that either the FP-CARE SDA is monotonically nondecreasing or nonincreasing, and is R-linearly convergent, with the zero initial matrix or a special initial matrix satisfying some assumptions. The FP-CARE SDA (FPC) algorithm can be regarded as a robust initial step to produce a good initial matrix, and then the modified Newton (mNT) method can be used by solving the corresponding Lyapunov equation with SDA (FPC-mNT-Lyap SDA). Numerical experiments show that the FPC-mNT-Lyap SDA algorithm outperforms the other existing algorithms.

[170] 2312.00330

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.

[171] 2312.00332

Matching Weak Informative Ontologies

Most existing ontology matching methods utilize the literal information to discover alignments. However, some literal information in ontologies may be opaque and some ontologies may not have sufficient literal information. In this paper, these ontologies are named as weak informative ontologies (WIOs) and it is challenging for existing methods to matching WIOs. On one hand, string-based and linguistic-based matching methods cannot work well for WIOs. On the other hand, some matching methods use external resources to improve their performance, but collecting and processing external resources is still time-consuming. To address this issue, this paper proposes a practical method for matching WIOs by employing the ontology structure information to discover alignments. First, the semantic subgraphs are extracted from the ontology graph to capture the precise meanings of ontology elements. Then, a new similarity propagation model is designed for matching WIOs. Meanwhile, in order to avoid meaningless propagation, the similarity propagation is constrained by semantic subgraphs and other conditions. Consequently, the similarity propagation model ensures a balance between efficiency and quality during matching. Finally, the similarity propagation model uses a few credible alignments as seeds to find more alignments, and some useful strategies are adopted to improve the performance. This matching method for WIOs has been implemented in the ontology matching system Lily. Experimental results on public OAEI benchmark datasets demonstrate that Lily significantly outperforms most of the state-of-the-art works in both WIO matching tasks and general ontology matching tasks. In particular, Lily increases the recall by a large margin, while it still obtains high precision of matching results.

[172] 2312.00333

Green Edge AI: A Contemporary Survey

Artificial intelligence (AI) technologies have emerged as pivotal enablers across a multitude of industries, including consumer electronics, healthcare, and manufacturing, largely due to their resurgence over the past decade. The transformative power of AI is primarily derived from the utilization of deep neural networks (DNNs), which require extensive data for training and substantial computational resources for processing. Consequently, DNN models are typically trained and deployed on resource-rich cloud servers. However, due to potential latency issues associated with cloud communications, deep learning (DL) workflows are increasingly being transitioned to wireless edge networks near end-user devices (EUDs). This shift is designed to support latency-sensitive applications and has given rise to a new paradigm of edge AI, which will play a critical role in upcoming 6G networks to support ubiquitous AI applications. Despite its potential, edge AI faces substantial challenges, mostly due to the dichotomy between the resource limitations of wireless edge networks and the resource-intensive nature of DL. Specifically, the acquisition of large-scale data, as well as the training and inference processes of DNNs, can rapidly deplete the battery energy of EUDs. This necessitates an energy-conscious approach to edge AI to ensure both optimal and sustainable performance. In this paper, we present a contemporary survey on green edge AI. We commence by analyzing the principal energy consumption components of edge AI systems to identify the fundamental design principles of green edge AI. Guided by these principles, we then explore energy-efficient design methodologies for the three critical tasks in edge AI systems, including training data acquisition, edge training, and edge inference. Finally, we underscore potential future research directions to further enhance the energy efficiency of edge AI.

[173] 2312.00334

UAV-Aided Lifelong Learning for AoI and Energy Optimization in Non-Stationary IoT Networks

In this paper, a novel joint energy and age of information (AoI) optimization framework for IoT devices in a non-stationary environment is presented. In particular, IoT devices that are distributed in the real-world are required to efficiently utilize their computing resources so as to balance the freshness of their data and their energy consumption. To optimize the performance of IoT devices in such a dynamic setting, a novel lifelong reinforcement learning (RL) solution that enables IoT devices to continuously adapt their policies to each newly encountered environment is proposed. Given that IoT devices have limited energy and computing resources, an unmanned aerial vehicle (UAV) is leveraged to visit the IoT devices and update the policy of each device sequentially. As such, the UAV is exploited as a mobile learning agent that can learn a shared knowledge base with a feature base in its training phase, and feature sets of a zero-shot learning method in its testing phase, to generalize between the environments. To optimize the trajectory and flying velocity of the UAV, an actor-critic network is leveraged so as to minimize the UAV energy consumption. Simulation results show that the proposed lifelong RL solution can outperform the state-of-art benchmarks by enhancing the balanced cost of IoT devices by $8.3\%$ when incorporating warm-start policies for unseen environments. In addition, our solution achieves up to $49.38\%$ reduction in terms of energy consumption by the UAV in comparison to the random flying strategy.

[174] 2312.00335

Learning Anatomically Consistent Embedding for Chest Radiography

Self-supervised learning (SSL) approaches have recently shown substantial success in learning visual representations from unannotated images. Compared with photographic images, medical images acquired with the same imaging protocol exhibit high consistency in anatomy. To exploit this anatomical consistency, this paper introduces a novel SSL approach, called PEAC (patch embedding of anatomical consistency), for medical image analysis. Specifically, in this paper, we propose to learn global and local consistencies via stable grid-based matching, transfer pre-trained PEAC models to diverse downstream tasks, and extensively demonstrate that (1) PEAC achieves significantly better performance than the existing state-of-the-art fully/self-supervised methods, and (2) PEAC captures the anatomical structure consistency across views of the same patient and across patients of different genders, weights, and healthy statuses, which enhances the interpretability of our method for medical image analysis.

[175] 2312.00336

Hypergraph Node Representation Learning with One-Stage Message Passing

Hypergraphs as an expressive and general structure have attracted considerable attention from various research domains. Most existing hypergraph node representation learning techniques are based on graph neural networks, and thus adopt the two-stage message passing paradigm (i.e. node -> hyperedge -> node). This paradigm only focuses on local information propagation and does not effectively take into account global information, resulting in less optimal representations. Our theoretical analysis of representative two-stage message passing methods shows that, mathematically, they model different ways of local message passing through hyperedges, and can be unified into one-stage message passing (i.e. node -> node). However, they still only model local information. Motivated by this theoretical analysis, we propose a novel one-stage message passing paradigm to model both global and local information propagation for hypergraphs. We integrate this paradigm into HGraphormer, a Transformer-based framework for hypergraph node representation learning. HGraphormer injects the hypergraph structure information (local information) into Transformers (global information) by combining the attention matrix and hypergraph Laplacian. Extensive experiments demonstrate that HGraphormer outperforms recent hypergraph learning methods on five representative benchmark datasets on the semi-supervised hypernode classification task, setting new state-of-the-art performance, with accuracy improvements between 2.52% and 6.70%. Our code and datasets are available.

[176] 2312.00337

Dynamic Matrix of Extremisms and Terrorism (DMET): A Continuum Approach Towards Identifying Different Degrees of Extremisms

We propose to extend the current binary understanding of terrorism (versus non-terrorism) with a Dynamic Matrix of Extremisms and Terrorism (DMET). DMET considers the whole ecosystem of content and actors that can contribute to a continuum of extremism (e.g., right-wing, left-wing, religious, separatist, single-issue). It organizes levels of extremisms by varying degrees of ideological engagement and the presence of violence identified (e.g., partisan, fringe, violent extremism, terrorism) based on cognitive and behavioral cues and group dynamics. DMET is globally applicable due to its comprehensive conceptualization of the levels of extremisms. It is also dynamic, enabling iterative mapping with the region- and time-specific classifications of extremist actors. Once global actors recognize DMET types and their distinct characteristics, they can comprehensively analyze the profiles of extremist actors (e.g., individuals, groups, movements), track these respective actors and their activities (e.g., social media content) over time, and launch targeted counter activities (e.g. de-platforming, content moderation, or redirects to targeted CVE narratives).

[177] 2312.00342

Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value at Risk

This paper aims to solve a safe reinforcement learning (RL) problem with risk measure-based constraints. As risk measures, such as conditional value at risk (CVaR), focus on the tail distribution of cost signals, constraining risk measures can effectively prevent a failure in the worst case. An on-policy safe RL method, called TRC, deals with a CVaR-constrained RL problem using a trust region method and can generate policies with almost zero constraint violations with high returns. However, to achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods are required to be sample efficient. To this end, we propose an off-policy safe RL method with CVaR constraints, called off-policy TRC. If off-policy data from replay buffers is directly used to train TRC, the estimation error caused by the distributional shift results in performance degradation. To resolve this issue, we propose novel surrogate functions, in which the effect of the distributional shift can be reduced, and introduce an adaptive trust-region constraint to ensure a policy not to deviate far from replay buffers. The proposed method has been evaluated in simulation and real-world environments and satisfied safety constraints within a few steps while achieving high returns even in complex robotic tasks.

[178] 2312.00343

OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline

Stereo matching, a pivotal technique in computer vision, plays a crucial role in robotics, autonomous navigation, and augmented reality. Despite the development of numerous impressive methods in recent years, replicating their results and determining the most suitable architecture for practical application remains challenging. Addressing this gap, our paper introduces a comprehensive benchmark focusing on practical applicability rather than solely on performance enhancement. Specifically, we develop a flexible and efficient stereo matching codebase, called OpenStereo. OpenStereo includes training and inference codes of more than 12 network models, making it, to our knowledge, the most complete stereo matching toolbox available. Based on OpenStereo, we conducted experiments on the SceneFlow dataset and have achieved or surpassed the performance metrics reported in the original paper. Additionally, we conduct an in-depth revisitation of recent developments in stereo matching through ablative experiments. These investigations inspired the creation of StereoBase, a simple yet strong baseline model. Our extensive comparative analyses of StereoBase against numerous contemporary stereo matching methods on the SceneFlow dataset demonstrate its remarkably strong performance. The source code is available at

[179] 2312.00344

TRC: Trust Region Conditional Value at Risk for Safe Reinforcement Learning

As safety is of paramount importance in robotics, reinforcement learning that reflects safety, called safe RL, has been studied extensively. In safe RL, we aim to find a policy which maximizes the desired return while satisfying the defined safety constraints. There are various types of constraints, among which constraints on conditional value at risk (CVaR) effectively lower the probability of failures caused by high costs since CVaR is a conditional expectation obtained above a certain percentile. In this paper, we propose a trust region-based safe RL method with CVaR constraints, called TRC. We first derive the upper bound on CVaR and then approximate the upper bound in a differentiable form in a trust region. Using this approximation, a subproblem to get policy gradients is formulated, and policies are trained by iteratively solving the subproblem. TRC is evaluated through safe navigation tasks in simulations with various robots and a sim-to-real environment with a Jackal robot from Clearpath. Compared to other safe RL methods, the performance is improved by 1.93 times while the constraints are satisfied in all experiments.

[180] 2312.00345

IEEE 802.11be Network Throughput Optimization with Multi-Link Operation and AP Coordination

IEEE 802.11be (Wi-Fi 7) introduces a new concept called multi-link operation (MLO) which allows multiple Wi-Fi interfaces in different bands (2.4, 5, and 6 GHz) to work together to increase network throughput, reduce latency, and improve spectrum reuse efficiency in dense overlapping networks. To make the most of MLO, a new intelligent resource allocation is needed. This paper proposes a model to align MLO and access point (AP) coordination in 11be. To maximize network throughput, a network topology optimization problem is formulated for MLO with AP coordination, which is solved by exploiting the totally unimodular property of the bipartite graph formed by the connection between AP and station (STA) in Wi-Fi networks. Subsequently, a proportional fairness algorithm is applied for radio link allocation, network throughput optimization considering the channel condition, and the fairness of the multi-link device (MLD) data rate. The performance of the proposed algorithm on two main MLO implementations - multi-mink multi-radio (MLMR) with simultaneous transmission and reception (STR), and the interplay between multiple nodes employing them are evaluated through cross-layer (PHY-MAC) data rate simulation with PHY abstraction.

[181] 2312.00347

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods.

[182] 2312.00348

Student Activity Recognition in Classroom Environments using Transfer Learning

The recent advances in artificial intelligence and deep learning facilitate automation in various applications including home automation, smart surveillance systems, and healthcare among others. Human Activity Recognition is one of its emerging applications, which can be implemented in a classroom environment to enhance safety, efficiency, and overall educational quality. This paper proposes a system for detecting and recognizing the activities of students in a classroom environment. The dataset has been structured and recorded by the authors since a standard dataset for this task was not available at the time of this study. Transfer learning, a widely adopted method within the field of deep learning, has proven to be helpful in complex tasks like image and video processing. Pretrained models including VGG-16, ResNet-50, InceptionV3, and Xception are used for feature extraction and classification tasks. Xception achieved an accuracy of 93%, on the novel classroom dataset, outperforming the other three models in consideration. The system proposed in this study aims to introduce a safer and more productive learning environment for students and educators.

[183] 2312.00349

The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific Progress in NLP

I propose a paradigm for scientific progress in NLP centered around developing scalable, data-driven theories of linguistic structure. The idea is to collect data in tightly scoped, carefully defined ways which allow for exhaustive annotation of behavioral phenomena of interest, and then use machine learning to construct explanatory theories of these phenomena which can form building blocks for intelligible AI systems. After laying some conceptual groundwork, I describe several investigations into data-driven theories of shallow semantic structure using Question-Answer driven Semantic Role Labeling (QA-SRL), a schema for annotating verbal predicate-argument relations using highly constrained question-answer pairs. While this only scratches the surface of the complex language behaviors of interest in AI, I outline principles for data collection and theoretical modeling which can inform future scientific progress. This note summarizes and draws heavily on my PhD thesis.

[184] 2312.00351

Manipulating the Label Space for In-Context Classification

After pre-training by generating the next word conditional on previous words, the Language Model (LM) acquires the ability of In-Context Learning (ICL) that can learn a new task conditional on the context of the given in-context examples (ICEs). Similarly, visually-conditioned Language Modelling is also used to train Vision-Language Models (VLMs) with ICL ability. However, such VLMs typically exhibit weaker classification abilities compared to contrastive learning-based models like CLIP, since the Language Modelling objective does not directly contrast whether an object is paired with a text. To improve the ICL of classification, using more ICEs to provide more knowledge is a straightforward way. However, this may largely increase the selection time, and more importantly, the inclusion of additional in-context images tends to extend the length of the in-context sequence beyond the processing capacity of a VLM. To alleviate these limitations, we propose to manipulate the label space of each ICE to increase its knowledge density, allowing for fewer ICEs to convey as much information as a larger set would. Specifically, we propose two strategies which are Label Distribution Enhancement and Visual Descriptions Enhancement to improve In-context classification performance on diverse datasets, including the classic ImageNet and more fine-grained datasets like CUB-200. Specifically, using our approach on ImageNet, we increase accuracy from 74.70\% in a 4-shot setting to 76.21\% with just 2 shots. surpassing CLIP by 0.67\%. On CUB-200, our method raises 1-shot accuracy from 48.86\% to 69.05\%, 12.15\% higher than CLIP. The code is given in

[185] 2312.00353

On Exploring the Reasoning Capability of Large Language Models with Knowledge Graphs

This paper examines the capacity of LLMs to reason with knowledge graphs using their internal knowledge graph, i.e., the knowledge graph they learned during pre-training. Two research questions are formulated to investigate the accuracy of LLMs in recalling information from pre-training knowledge graphs and their ability to infer knowledge graph relations from context. To address these questions, we employ LLMs to perform four distinct knowledge graph reasoning tasks. Furthermore, we identify two types of hallucinations that may occur during knowledge reasoning with LLMs: content and ontology hallucination. Our experimental results demonstrate that LLMs can successfully tackle both simple and complex knowledge graph reasoning tasks from their own memory, as well as infer from input context.

[186] 2312.00359

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.

[187] 2312.00360

Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning

Multimodal (e.g., RGB-Depth/RGB-Thermal) fusion has shown great potential for improving semantic segmentation in complex scenes (e.g., indoor/low-light conditions). Existing approaches often fully fine-tune a dual-branch encoder-decoder framework with a complicated feature fusion strategy for achieving multimodal semantic segmentation, which is training-costly due to the massive parameter updates in feature extraction and fusion. To address this issue, we propose a surprisingly simple yet effective dual-prompt learning network (dubbed DPLNet) for training-efficient multimodal (e.g., RGB-D/T) semantic segmentation. The core of DPLNet is to directly adapt a frozen pre-trained RGB model to multimodal semantic segmentation, reducing parameter updates. For this purpose, we present two prompt learning modules, comprising multimodal prompt generator (MPG) and multimodal feature adapter (MFA). MPG works to fuse the features from different modalities in a compact manner and is inserted from shadow to deep stages to generate the multi-level multimodal prompts that are injected into the frozen backbone, while MPG adapts prompted multimodal features in the frozen backbone for better multimodal semantic segmentation. Since both the MPG and MFA are lightweight, only a few trainable parameters (3.88M, 4.4% of the pre-trained backbone parameters) are introduced for multimodal feature fusion and learning. Using a simple decoder (3.27M parameters), DPLNet achieves new state-of-the-art performance or is on a par with other complex approaches on four RGB-D/T semantic segmentation datasets while satisfying parameter efficiency. Moreover, we show that DPLNet is general and applicable to other multimodal tasks such as salient object detection and video semantic segmentation. Without special design, DPLNet outperforms many complicated models. Our code will be available at

[188] 2312.00362

Dancing with Images: Video Distillation via Static-Dynamic Disentanglement

Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation , and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with notably smaller storage expenditure. Our code will be publicly available.

[189] 2312.00364

Benchmarking Multi-Domain Active Learning on Image Classification

Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our benchmark demonstrates that traditional single-domain active learning strategies are often less effective than random selection in multi-domain scenarios. We also introduce CLIP-GeoYFCC, a novel large-scale image dataset built around geographical domains, in contrast to existing genre-based domain datasets. Analysis on our benchmark shows that all multi-domain strategies exhibit significant tradeoffs, with no strategy outperforming across all datasets or all metrics, emphasizing the need for future research.

[190] 2312.00372

Event-driven Real-time Retrieval in Web Search

Information retrieval in real-time search presents unique challenges distinct from those encountered in classical web search. These challenges are particularly pronounced due to the rapid change of user search intent, which is influenced by the occurrence and evolution of breaking news events, such as earthquakes, elections, and wars. Previous dense retrieval methods, which primarily focused on static semantic representation, lack the capacity to capture immediate search intent, leading to inferior performance in retrieving the most recent event-related documents in time-sensitive scenarios. To address this issue, this paper expands the query with event information that represents real-time search intent. The Event information is then integrated with the query through a cross-attention mechanism, resulting in a time-context query representation. We further enhance the model's capacity for event representation through multi-task training. Since publicly available datasets such as MS-MARCO do not contain any event information on the query side and have few time-sensitive queries, we design an automatic data collection and annotation pipeline to address this issue, which includes ModelZoo-based Coarse Annotation and LLM-driven Fine Annotation processes. In addition, we share the training tricks such as two-stage training and hard negative sampling. Finally, we conduct a set of offline experiments on a million-scale production dataset to evaluate our approach and deploy an A/B testing in a real online system to verify the performance. Extensive experimental results demonstrate that our proposed approach significantly outperforms existing state-of-the-art baseline methods.

[191] 2312.00373

Streaming Bayesian Modeling for predicting Fat-Tailed Customer Lifetime Value

We develop an online learning MCMC approach applicable for hierarchical bayesian models and GLMS. We also develop a fat-tailed LTV model that generalizes over several kinds of fat and thin tails. We demonstrate both developments on commercial LTV data from a large mobile app.

[192] 2312.00374

Unleashing Cheapfakes through Trojan Plugins of Large Language Models

Open-source Large Language Models (LLMs) have recently gained popularity because of their comparable performance to proprietary LLMs. To efficiently fulfill domain-specialized tasks, open-source LLMs can be refined, without expensive accelerators, using low-rank adapters. However, it is still unknown whether low-rank adapters can be exploited to control LLMs. To address this gap, we demonstrate that an infected adapter can induce, on specific triggers, an LLM to output content defined by an adversary and to even maliciously use tools. To train a Trojan adapter, we propose two novel attacks, POLISHED and FUSION, that improve over prior approaches. POLISHED uses LLM-enhanced paraphrasing to polish benchmark poisoned datasets. In contrast, in the absence of a dataset, FUSION leverages an over-poisoning procedure to transform a benign adaptor. Our experiments validate that our attacks provide higher attack effectiveness than the baseline and, for the purpose of attracting downloads, preserves or improves the adapter's utility. Finally, we provide two case studies to demonstrate that the Trojan adapter can lead a LLM-powered autonomous agent to execute unintended scripts or send phishing emails. Our novel attacks represent the first study of supply chain threats for LLMs through the lens of Trojan plugins.

[193] 2312.00375

Text-Guided 3D Face Synthesis -- From Generation to Editing

Text-guided 3D face synthesis has achieved remarkable results by leveraging text-to-image (T2I) diffusion models. However, most existing works focus solely on the direct generation, ignoring the editing, restricting them from synthesizing customized 3D faces through iterative adjustments. In this paper, we propose a unified text-guided framework from face generation to editing. In the generation stage, we propose a geometry-texture decoupled generation to mitigate the loss of geometric details caused by coupling. Besides, decoupling enables us to utilize the generated geometry as a condition for texture generation, yielding highly geometry-texture aligned results. We further employ a fine-tuned texture diffusion model to enhance texture quality in both RGB and YUV space. In the editing stage, we first employ a pre-trained diffusion model to update facial geometry or texture based on the texts. To enable sequential editing, we introduce a UV domain consistency preservation regularization, preventing unintentional changes to irrelevant facial attributes. Besides, we propose a self-guided consistency weight strategy to improve editing efficacy while preserving consistency. Through comprehensive experiments, we showcase our method's superiority in face synthesis. Project page:

[194] 2312.00377

SynFundus: Generating a synthetic fundus images dataset with millions of samples and multi-disease annotations

In the field of medical imaging, the scarcity of large-scale datasets due to privacy restrictions stands as a significant barrier to develop large models for medical. To address this issue, we introduce SynFundus-1M, a high-quality synthetic dataset with over 1 million retinal fundus images and extensive disease and pathologies annotations, which is generated by a Denoising Diffusion Probabilistic Model. The SynFundus-Generator and SynFundus-1M achieve superior Frechet Inception Distance (FID) scores compared to existing methods on main-stream public real datasets. Furthermore, the ophthalmologists evaluation validate the difficulty in discerning these synthetic images from real ones, confirming the SynFundus-1M's authenticity. Through extensive experiments, we demonstrate that both CNN and ViT can benifit from SynFundus-1M by pretraining or training directly. Compared to datasets like ImageNet or EyePACS, models train on SynFundus-1M not only achieve better performance but also faster convergence on various downstream tasks.

[195] 2312.00379

Optimal Sample Complexity of Contrastive Learning

Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general $\ell_p$-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning $\ell_p$-distances for integer $p$. For any $p \ge 1$ we show that $\tilde \Theta(\min(nd,n^2))$ labeled tuples are necessary and sufficient for learning $d$-dimensional representations of $n$-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.

[196] 2312.00380

Enhancing Explainability in Mobility Data Science through a combination of methods

In the domain of Mobility Data Science, the intricate task of interpreting models trained on trajectory data, and elucidating the spatio-temporal movement of entities, has persistently posed significant challenges. Conventional XAI techniques, although brimming with potential, frequently overlook the distinct structure and nuances inherent within trajectory data. Observing this deficiency, we introduced a comprehensive framework that harmonizes pivotal XAI techniques: LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), Saliency maps, attention mechanisms, direct trajectory visualization, and Permutation Feature Importance (PFI). Unlike conventional strategies that deploy these methods singularly, our unified approach capitalizes on the collective efficacy of these techniques, yielding deeper and more granular insights for models reliant on trajectory data. In crafting this synthesis, we effectively address the multifaceted essence of trajectories, achieving not only amplified interpretability but also a nuanced, contextually rich comprehension of model decisions. To validate and enhance our framework, we undertook a survey to gauge preferences and reception among various user demographics. Our findings underscored a dichotomy: professionals with academic orientations, particularly those in roles like Data Scientist, IT Expert, and ML Engineer, showcased a profound, technical understanding and often exhibited a predilection for amalgamated methods for interpretability. Conversely, end-users or individuals less acquainted with AI and Data Science showcased simpler inclinations, such as bar plots indicating timestep significance or visual depictions pinpointing pivotal segments of a vessel's trajectory.

[197] 2312.00388

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Deploying Large Language Models (LLMs) locally on mobile devices presents a significant challenge due to their extensive memory requirements. In this paper, we introduce LinguaLinked, a system for decentralized, distributed LLM inference on mobile devices. LinguaLinked enables collaborative execution of the inference task across multiple trusted devices. LinguaLinked ensures data privacy by processing information locally. LinguaLinked uses three key strategies. First, an optimized model assignment technique segments LLMs and uses linear optimization to align segments with each device's capabilities. Second, an optimized data transmission mechanism ensures efficient and structured data flow between model segments while also maintaining the integrity of the original model structure. Finally, LinguaLinked incorporates a runtime load balancer that actively monitors and redistributes tasks among mobile devices to prevent bottlenecks, enhancing the system's overall efficiency and responsiveness. We demonstrate that LinguaLinked facilitates efficient LLM inference while maintaining consistent throughput and minimal latency through extensive testing across various mobile devices, from high-end to low-end Android devices. In our evaluations, compared to the baseline, LinguaLinked achieves an inference performance acceleration of $1.11\times$ to $1.61\times$ in single-threaded settings, $1.73\times$ to $2.65\times$ with multi-threading. Additionally, runtime load balancing yields an overall inference acceleration of $1.29\times$ to $1.32\times$.

[198] 2312.00392

Study and Survey on Gesture Recognition Systems

In recent years, there has been a considerable amount of research in the Gesture Recognition domain, mainly owing to the technological advancements in Computer Vision. Various new applications have been conceptualised and developed in this field. This paper discusses the implementation of gesture recognition systems in multiple sectors such as gaming, healthcare, home appliances, industrial robots, and virtual reality. Different methodologies for capturing gestures are compared and contrasted throughout this survey. Various data sources and data acquisition techniques have been discussed. The role of gestures in sign language has been studied and existing approaches have been reviewed. Common challenges faced while building gesture recognition systems have also been explored.

[199] 2312.00394

Optimal Reverse-Complement-Duplication Error-Correcting Codes

Motivated by DNA storage in living organisms and inspired by biological mutation processes, this study explores the reverse-complement string-duplication system. We commence our investigation by introducing an optimal $q$-ary reverse-complement-duplication code construction for duplication length $1$ and any number of duplications, achieving a size of $\Theta(q^n)$. Subsequently, we establish a fundamental limitation, proving that for duplication lengths greater than $1$, all reverse-complement-duplication codes correcting any number of duplications possess a size of $o(q^n)$. Further, we present a construction of reverse-complement-duplication codes with a duplication length of $2$, demonstrating a redundancy of at most $\log_q(n/2) + \log_q(\log_q(n)+1) + 2 + \log_q(3)$. Finally, we contribute an explicit construction for $q$-ary codes addressing a single classical tandem duplication for any $k$. The redundancy of these codes is $\log_q(n/k) + 1 + (k-1)\log_q(\log_q(2n/k)+1)$.

[200] 2312.00396

GFN-SR: Symbolic Regression with Generative Flow Networks

Symbolic regression (SR) is an area of interpretable machine learning that aims to identify mathematical expressions, often composed of simple functions, that best fit in a given set of covariates $X$ and response $y$. In recent years, deep symbolic regression (DSR) has emerged as a popular method in the field by leveraging deep reinforcement learning to solve the complicated combinatorial search problem. In this work, we propose an alternative framework (GFN-SR) to approach SR with deep learning. We model the construction of an expression tree as traversing through a directed acyclic graph (DAG) so that GFlowNet can learn a stochastic policy to generate such trees sequentially. Enhanced with an adaptive reward baseline, our method is capable of generating a diverse set of best-fitting expressions. Notably, we observe that GFN-SR outperforms other SR algorithms in noisy data regimes, owing to its ability to learn a distribution of rewards over a space of candidate solutions.

[201] 2312.00398

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

Musculoskeletal diseases and cognitive impairments in patients lead to difficulties in movement as well as negative effects on their psychological health. Clinical gait analysis, a vital tool for early diagnosis and treatment, traditionally relies on expensive optical motion capture systems. Recent advances in computer vision and deep learning have opened the door to more accessible and cost-effective alternatives. This paper introduces a novel spatio-temporal Transformer network to estimate critical gait parameters from RGB videos captured by a single-view camera. Empirical evaluations on a public dataset of cerebral palsy patients indicate that the proposed framework surpasses current state-of-the-art approaches and show significant improvements in predicting general gait parameters (including Walking Speed, Gait Deviation Index - GDI, and Knee Flexion Angle at Maximum Extension), while utilizing fewer parameters and alleviating the need for manual feature extraction.

[202] 2312.00401

VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things

Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. Learning to schedule perceiving models and analyzing the collected videos intelligently will be potential sparks for VIoT. In this paper, to address the challenges posed by the fine-grained and interrelated vision tool usage of VIoT, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks. To support VIoTGPT and related future works, we meticulously crafted the training dataset and established benchmarks involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to ReAct instruction tuning based on the collected VIoT dataset to learn the tool capability. Quantitative and qualitative experimental results and analyses demonstrate the effectiveness of VIoTGPT.

[203] 2312.00404

A Causality-Aware Pattern Mining Scheme for Group Activity Recognition in a Pervasive Sensor Space

Human activity recognition (HAR) is a key challenge in pervasive computing and its solutions have been presented based on various disciplines. Specifically, for HAR in a smart space without privacy and accessibility issues, data streams generated by deployed pervasive sensors are leveraged. In this paper, we focus on a group activity by which a group of users perform a collaborative task without user identification and propose an efficient group activity recognition scheme which extracts causality patterns from pervasive sensor event sequences generated by a group of users to support as good recognition accuracy as the state-of-the-art graphical model. To filter out irrelevant noise events from a given data stream, a set of rules is leveraged to highlight causally related events. Then, a pattern-tree algorithm extracts frequent causal patterns by means of a growing tree structure. Based on the extracted patterns, a weighted sum-based pattern matching algorithm computes the likelihoods of stored group activities to the given test event sequence by means of matched event pattern counts for group activity recognition. We evaluate the proposed scheme using the data collected from our testbed and CASAS datasets where users perform their tasks on a daily basis and validate its effectiveness in a real environment. Experiment results show that the proposed scheme performs higher recognition accuracy and with a small amount of runtime overhead than the existing schemes.

[204] 2312.00407

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way

Large language models (LLMs) are increasingly pivotal in a wide range of natural language processing tasks. Access to pre-trained models, courtesy of the open-source community, has made it possible to adapt these models to specific applications for enhanced performance. However, the substantial resources required for training these models necessitate efficient solutions. This paper introduces CoLLiE, an efficient library that facilitates collaborative training of large language models using 3D parallelism, parameter-efficient fine-tuning (PEFT) methods, and optimizers such as Lion, Adan, Sophia, LOMO and AdaLomo. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization. CoLLiE has proven superior training efficiency in comparison with prevalent solutions in pre-training and fine-tuning scenarios. Furthermore, we provide an empirical evaluation of the correlation between model size and GPU memory consumption under different optimization methods, as well as an analysis of the throughput. Lastly, we carry out a comprehensive comparison of various optimizers and PEFT methods within the instruction-tuning context. CoLLiE is available at

[205] 2312.00408

Beyond the Screen: Reshaping the Workplace with Virtual and Augmented Reality

Although extended reality technologies have enjoyed an explosion in popularity in recent years, few applications are effectively used outside the entertainment or academic contexts. This work consists of a literature review regarding the effective integration of such technologies in the workplace. It aims to provide an updated view of how they are being used in that context. First, we examine existing research concerning virtual, augmented, and mixed-reality applications. We also analyze which have made their way to the workflows of companies and institutions. Furthermore, we circumscribe the aspects of extended reality technologies that determined this applicability.

[206] 2312.00411

A framework for mining lifestyle profiles through multi-dimensional and high-order mobility feature clustering

Human mobility demonstrates a high degree of regularity, which facilitates the discovery of lifestyle profiles. Existing research has yet to fully utilize the regularities embedded in high-order features extracted from human mobility records in such profiling. This study proposes a progressive feature extraction strategy that mines high-order mobility features from users' moving trajectory records from the spatial, temporal, and semantic dimensions. Specific features are extracted such as travel motifs, rhythms decomposed by discrete Fourier transform (DFT) of mobility time series, and vectorized place semantics by word2vec, respectively to the three dimensions, and they are further clustered to reveal the users' lifestyle characteristics. An experiment using a trajectory dataset of over 500k users in Shenzhen, China yields seven user clusters with different lifestyle profiles that can be well interpreted by common sense. The results suggest the possibility of fine-grained user profiling through cross-order trajectory feature engineering and clustering.

[207] 2312.00412

SCHEME: Scalable Channer Mixer for Vision Transformers

Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.

[208] 2312.00413

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.

[209] 2312.00414

Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.

[210] 2312.00416

Towards Explaining Satellite Based Poverty Predictions with Convolutional Neural Networks

Deep convolutional neural networks (CNNs) have been shown to predict poverty and development indicators from satellite images with surprising accuracy. This paper presents a first attempt at analyzing the CNNs responses in detail and explaining the basis for the predictions. The CNN model, while trained on relatively low resolution day- and night-time satellite images, is able to outperform human subjects who look at high-resolution images in ranking the Wealth Index categories. Multiple explainability experiments performed on the model indicate the importance of the sizes of the objects, pixel colors in the image, and provide a visualization of the importance of different structures in input images. A visualization is also provided of type images that maximize the network prediction of Wealth Index, which provides clues on what the CNN prediction is based on.

[211] 2312.00421

A Semi-Tensor Product based Circuit Simulation for SAT-sweeping

In recent years, circuit simulators and Boolean satisfiability (SAT) solvers have been tightly integrated to provide efficient logic synthesis and verification. Circuit simulation can generate highly expressive simulation patterns that can either enumerate or filter out most candidates for synthesis. Subsequently, SAT solvers are employed to check those that remain, thereby making the logic synthesis process more efficient. This paper introduces a novel circuit simulator of k-input lookup table (k-LUT) networks, based on semi-tensor product (STP). STP-based simulators use computation of logic matrices, the primitives of logic networks, as opposed to relying on bitwise logic operations for simulation of k-LUT networks. Experimental results show that our STP-based simulator reduces the runtime by an average of 7.2x. Furthermore, we integrate this proposed simulator into a SAT-sweeping engine known as SAT sweeper. Through a combination of structural hashing, simulation, and SAT queries, SAT sweeper simplifies logic networks by systematically merging graph vertices from input to output. To enhance the efficiency, we used STP-based exhaustive simulation, which significantly reduces the number of false equivalence class candidates, thereby improving the computational efficiency by reducing the number of SAT calls required. When compared to the SOTA SAT sweeper, our method demonstrates an average 35% runtime reduction.

[212] 2312.00425

A Low-Power Neuromorphic Approach for Efficient Eye-Tracking

This paper introduces a neuromorphic methodology for eye tracking, harnessing pure event data captured by a Dynamic Vision Sensor (DVS) camera. The framework integrates a directly trained Spiking Neuron Network (SNN) regression model and leverages a state-of-the-art low power edge neuromorphic processor - Speck, collectively aiming to advance the precision and efficiency of eye-tracking systems. First, we introduce a representative event-based eye-tracking dataset, "Ini-30", which was collected with two glass-mounted DVS cameras from thirty volunteers. Then,a SNN model, based on Integrate And Fire (IAF) neurons, named "Retina", is described , featuring only 64k parameters (6.63x fewer than the latest) and achieving pupil tracking error of only 3.24 pixels in a 64x64 DVS input. The continous regression output is obtained by means of convolution using a non-spiking temporal 1D filter slided across the output spiking layer. Finally, we evaluate Retina on the neuromorphic processor, showing an end-to-end power between 2.89-4.8 mW and a latency of 5.57-8.01 mS dependent on the time window. We also benchmark our model against the latest event-based eye-tracking method, "3ET", which was built upon event frames. Results show that Retina achieves superior precision with 1.24px less pupil centroid error and reduced computational complexity with 35 times fewer MAC operations. We hope this work will open avenues for further investigation of close-loop neuromorphic solutions and true event-based training pursuing edge performance.

[213] 2312.00434

PEFTDebias : Capturing debiasing information using PEFTs

The increasing use of foundation models highlights the urgent need to address and eliminate implicit biases present in them that arise during pretraining. In this paper, we introduce PEFTDebias, a novel approach that employs parameter-efficient fine-tuning (PEFT) to mitigate the biases within foundation models. PEFTDebias consists of two main phases: an upstream phase for acquiring debiasing parameters along a specific bias axis, and a downstream phase where these parameters are incorporated into the model and frozen during the fine-tuning process. By evaluating on four datasets across two bias axes namely gender and race, we find that downstream biases can be effectively reduced with PEFTs. In addition, we show that these parameters possess axis-specific debiasing characteristics, enabling their effective transferability in mitigating biases in various downstream tasks. To ensure reproducibility, we release the code to do our experiments.

[214] 2312.00435

Enhancing Image Captioning with Neural Models

This research explores the realm of neural image captioning using deep learning models. The study investigates the performance of different neural architecture configurations, focusing on the inject architecture, and proposes a novel quality metric for evaluating caption generation. Through extensive experimentation and analysis, this work sheds light on the challenges and opportunities in image captioning, providing insights into model behavior and overfitting. The results reveal that while the merge models exhibit a larger vocabulary and higher ROUGE scores, the inject architecture generates relevant and concise image captions. The study also highlights the importance of refining training data and optimizing hyperparameters for improved model performance. This research contributes to the growing body of knowledge in neural image captioning and encourages further exploration in the field, emphasizing the democratization of artificial intelligence.

[215] 2312.00436

Consensus group decision making under model uncertainty with a view towards environmental policy making

In this paper we propose a consensus group decision making scheme under model uncertainty consisting of an iterative two-stage procedure and based on the concept of Fr\'echet barycenter. Each step consists of two stages: the agents first update their position in the opinion metric space by a local barycenter characterized by the agents' immediate interactions and then a moderator makes a proposal in terms of a global barycenter, checking for consensus at each step. In cases of large heterogeneous groups the procedure can be complemented by an auxiliary initial homogenization step, consisting of a clustering procedure in opinion space, leading to large homogeneous groups for which the aforementioned procedure will be applied. The scheme is illustrated in examples motivated from environmental economics.

[216] 2312.00438

Dolphins: Multimodal Language Model for Driving

The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance Dolphins's reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process. Then we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are characterized into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.

[217] 2312.00451

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website:

[218] 2312.00452

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various text expressions or with unseen visual entities, limiting its further application. In this paper, we present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context, facilitating target capturing in the presence of linguistic style changes. Furthermore, we introduce a multi-modal fusion aggregation module with visual guidance from a powerful pretrained model to leverage spatial relations and pixel coherences to handle the incomplete target masks and false positive irregular clumps which often appear on unseen visual entities. Extensive experiments are conducted in the zero-shot cross-dataset settings and the proposed approach achieves consistent gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its effectiveness. Additionally, the results on GraspNet-RIS show that our approach also generalizes well to new scenarios with large domain shifts.

[219] 2312.00454

An Encoding Framework for Binarized Images using HyperDimensional Computing

Hyperdimensional Computing (HDC) is a brain-inspired and light-weight machine learning method. It has received significant attention in the literature as a candidate to be applied in the wearable internet of things, near-sensor artificial intelligence applications and on-device processing. HDC is computationally less complex than traditional deep learning algorithms and typically achieves moderate to good classification performance. A key aspect that determines the performance of HDC is the encoding of the input data to the hyperdimensional (HD) space. This article proposes a novel light-weight approach relying only on native HD arithmetic vector operations to encode binarized images that preserves similarity of patterns at nearby locations by using point of interest selection and local linear mapping. The method reaches an accuracy of 97.35% on the test set for the MNIST data set and 84.12% for the Fashion-MNIST data set. These results outperform other studies using baseline HDC with different encoding approaches and are on par with more complex hybrid HDC models. The proposed encoding approach also demonstrates a higher robustness to noise and blur compared to the baseline encoding.

[220] 2312.00455

Meta-Diversity Search in Complex Systems, A Recipe for Artificial Open-Endedness ?

Can we build an artificial system that would be able to generate endless surprises if ran "forever" in Minecraft? While there is not a single path toward solving that grand challenge, this article presents what we believe to be some working ingredients for the endless generation of novel increasingly complex artifacts in Minecraft. Our framework for an open-ended system includes two components: a complex system used to recursively grow and complexify artifacts over time, and a discovery algorithm that leverages the concept of meta-diversity search. Since complex systems have shown to enable the emergence of considerable complexity from set of simple rules, we believe them to be great candidates to generate all sort of artifacts in Minecraft. Yet, the space of possible artifacts that can be generated by these systems is often unknown, challenging to characterize and explore. Therefore automating the long-term discovery of novel and increasingly complex artifacts in these systems is an exciting research field. To approach these challenges, we formulate the problem of meta-diversity search where an artificial "discovery assistant" incrementally learns a diverse set of representations to characterize behaviors and searches to discover diverse patterns within each of them. A successful discovery assistant should continuously seek for novel sources of diversities while being able to quickly specialize the search toward a new unknown type of diversity. To implement those ideas in the Minecraft environment, we simulate an artificial "chemistry" system based on Lenia continuous cellular automaton for generating artifacts, as well as an artificial "discovery assistant" (called Holmes) for the artifact-discovery process. Holmes incrementally learns a hierarchy of modular representations to characterize divergent sources of diversity and uses a goal-based intrinsically-motivated exploration as the diversity search strategy.

[221] 2312.00458

Semantics of Attack-Defense Trees for Dynamic Countermeasures and a New Hierarchy of Star-free Languages

We present a mathematical setting for attack-defense trees, a classic graphical model to specify attacks and countermeasures. We equip attack-defense trees with (trace) language semantics allowing to have an original dynamic interpretation of countermeasures. Interestingly, the expressiveness of attack-defense trees coincides with star-free languages, and the nested countermeasures impact the expressiveness of attack-defense trees. With an adequate notion of countermeasure-depth, we exhibit a strict hierarchy of the star-free languages that does not coincides with the classic one. Additionally, driven by the use of attack-defense trees in practice, we address the decision problems of trace membership and of non-emptiness, and study their computational complexities parameterized by the countermeasure-depth.

[222] 2312.00462

Learning Unorthogonalized Matrices for Rotation Estimation

Estimating 3D rotations is a common procedure for 3D computer vision. The accuracy depends heavily on the rotation representation. One form of representation -- rotation matrices -- is popular due to its continuity, especially for pose estimation tasks. The learning process usually incorporates orthogonalization to ensure orthonormal matrices. Our work reveals, through gradient analysis, that common orthogonalization procedures based on the Gram-Schmidt process and singular value decomposition will slow down training efficiency. To this end, we advocate removing orthogonalization from the learning process and learning unorthogonalized `Pseudo' Rotation Matrices (PRoM). An optimization analysis shows that PRoM converges faster and to a better solution. By replacing the orthogonalization incorporated representation with our proposed PRoM in various rotation-related tasks, we achieve state-of-the-art results on large-scale benchmarks for human pose estimation.

[223] 2312.00463

Low-rank-modified Galerkin methods for the Lyapunov equation

Of all the possible projection methods for solving large-scale Lyapunov matrix equations, Galerkin approaches remain much more popular than Petrov-Galerkin ones. This is mainly due to the different nature of the projected problems stemming from these two families of methods. While a Galerkin approach leads to the solution of a low-dimensional matrix equation per iteration, a matrix least-squares problem needs to be solved per iteration in a Petrov-Galerkin setting. The significant computational cost of these least-squares problems has steered researchers towards Galerkin methods in spite of the appealing minimization properties of Petrov-Galerkin schemes. In this paper we introduce a framework that allows for modifying the Galerkin approach by low-rank, additive corrections to the projected matrix equation problem with the two-fold goal of attaining monotonic convergence rates similar to those of Petrov-Galerkin schemes while maintaining essentially the same computational cost of the original Galerkin method. We analyze the well-posedness of our framework and determine possible scenarios where we expect the residual norm attained by two low-rank-modified variants to behave similarly to the one computed by a Petrov-Galerkin technique. A panel of diverse numerical examples shows the behavior and potential of our new approach.

[224] 2312.00467

Unfolder: Fast localization and image rectification of a document with a crease from folding in half

Presentation of folded documents is not an uncommon case in modern society. Digitizing such documents by capturing them with a smartphone camera can be tricky since a crease can divide the document contents into separate planes. To unfold the document, one could hold the edges potentially obscuring it in a captured image. While there are many geometrical rectification methods, they were usually developed for arbitrary bends and folds. We consider such algorithms and propose a novel approach Unfolder developed specifically for images of documents with a crease from folding in half. Unfolder is robust to projective distortions of the document image and does not fragment the image in the vicinity of a crease after rectification. A new Folded Document Images dataset was created to investigate the rectification accuracy of folded (2, 3, 4, and 8 folds) documents. The dataset includes 1600 images captured when document placed on a table and when held in hand. The Unfolder algorithm allowed for a recognition error rate of 0.33, which is better than the advanced neural network methods DocTr (0.44) and DewarpNet (0.57). The average runtime for Unfolder was only 0.25 s/image on an iPhone XR.

[225] 2312.00471

A Bayesian approach for prompt optimization in pre-trained language models

A prompt is a sequence of symbol or tokens, selected from a vocabulary according to some rule, which is prepended/concatenated to a textual query. A key problem is how to select the sequence of tokens: in this paper we formulate it as a combinatorial optimization problem. The high dimensionality of the token space com-pounded by the length of the prompt sequence requires a very efficient solution. In this paper we propose a Bayesian optimization method, executed in a continuous em-bedding of the combinatorial space. In this paper we focus on hard prompt tuning (HPT) which directly searches for discrete tokens to be added to the text input with-out requiring access to the large language model (LLM) and can be used also when LLM is available only as a black-box. This is critically important if LLMs are made available in the Model as a Service (MaaS) manner as in GPT-4. The current manu-script is focused on the optimization of discrete prompts for classification tasks. The discrete prompts give rise to difficult combinatorial optimization problem which easily become intractable given the dimension of the token space in realistic applications. The optimization method considered in this paper is Bayesian optimization (BO) which has become the dominant approach in black-box optimization for its sample efficiency along with its modular structure and versatility. In this paper we use BoTorch, a library for Bayesian optimization research built on top of pyTorch. Albeit preliminary and obtained using a 'vanilla' version of BO, the experiments on RoB-ERTa on six benchmarks, show a good performance across a variety of tasks and enable an analysis of the tradeoff between size of the search space, accuracy and wall clock time.

[226] 2312.00475

Pathologists light level preferences using the microscope -- a study to guide digital pathology display use

There is a paucity of guidelines relating to displays in digital pathology making procurement decisions, and display configuration challenging. Experience suggests pathologists have personal preferences for brightness when using a microscope which we hypothesised could be used as a predictor for display setup. We conducted an online survey across 6 NHS hospitals to capture brightness adjustment habits on both microscopes and screens. A subsample of respondents took part in a practical task to determine microscope brightness and display luminance preferences. The survey indicates 81% of respondents adjust the brightness on their microscope, compared with 11% adjusting their digital display. Display adjustments are more likely for visual comfort and ambient light compensation rather than for tissue factors, common for microscope adjustments. Twenty consultants took part in the practical brightness assessment. Light preferences on the microscope showed no correlation with screen preferences, except where a pathologist has a markedly brighter microscope preference. All of the preferences in this cohort were for a display luminance of less than 500cd/m$^2$, with 90% preferring 350cd/m$^2$ or less. There was no correlation between these preferences and the ambient lighting in the room. We conclude that microscope preferences can only be used to predict screen luminance requirements where the microscope is being used at very high brightness levels. A display capable of a brightness of 500cd/m$^2$ should be suitable for almost all pathologists with 300cd/m$^2$ suitable for the majority. The ability to adjust display luminance was felt to be important by the majority of respondents. Further work needs to be undertaken to establish the relationship between diagnostic performance, preferences and ambient lighting levels.

[227] 2312.00476

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.

[228] 2312.00477

Interpretable Meta-Learning of Physical Systems

Machine learning methods can be a valuable aid in the scientific process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recent meta-learning methods have made significant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. Leveraging the structure of the learning problem, we argue that multi-environment generalization can be achieved using a simpler learning model, with an affine structure with respect to the learning task. Crucially, we prove that this architecture can identify the physical parameters of the system, enabling interpreable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to physical-parameter-induced adaptation and to adaptive control.

[229] 2312.00480

Japanese Tort-case Dataset for Rationale-supported Legal Judgment Prediction

This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court's accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3,477 Japanese Civil Code judgments by 41 legal experts, resulting in 7,978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.

[230] 2312.00481

Glued lattices are better quantizers than $K_{12}$

40 years ago, Conway and Sloane proposed using the highly symmetrical Coxeter-Todd lattice $K_{12}$ for quantization, and estimated its second moment. Since then, all published lists identify $K_{12}$ as the best 12-dimensional lattice quantizer. Surprisingly, $K_{12}$ is not optimal: we construct two new 12-dimensional lattices with lower normalized second moments. The new lattices are obtained by gluing together 6-dimensional lattices.

[231] 2312.00483

MalDicom: A Memory Forensic Framework for Detecting Malicious Payload in DICOM Files

Digital Imaging and Communication System (DICOM) is widely used throughout the public health sector for portability in medical imaging. However, these DICOM files have vulnerabilities present in the preamble section. Successful exploitation of these vulnerabilities can allow attackers to embed executable codes in the 128-Byte preamble of DICOM files. Embedding the malicious executable will not interfere with the readability or functionality of DICOM imagery. However, it will affect the underline system silently upon viewing these files. This paper shows the infiltration of Windows malware executables into DICOM files. On viewing the files, the malicious DICOM will get executed and eventually infect the entire hospital network through the radiologist's workstation. The code injection process of executing malware in DICOM files affects the hospital networks and workstations' memory. Memory forensics for the infected radiologist's workstation is crucial as it can detect which malware disrupts the hospital environment, and future detection methods can be deployed. In this paper, we consider the machine learning (ML) algorithms to conduct memory forensics on three memory dump categories: Trojan, Spyware, and Ransomware, taken from the CIC-MalMem-2022 dataset. We obtain the highest accuracy of 75\% with the Random Forest model. For estimating the feature importance for ML model prediction, we leveraged the concept of Shapley values.

[232] 2312.00484

MultiView Independent Component Analysis with Delays

Linear Independent Component Analysis (ICA) is a blind source separation technique that has been used in various domains to identify independent latent sources from observed signals. In order to obtain a higher signal-to-noise ratio, the presence of multiple views of the same sources can be used. In this work, we present MultiView Independent Component Analysis with Delays (MVICAD). This algorithm builds on the MultiView ICA model by allowing sources to be delayed versions of some shared sources: sources are shared across views up to some unknown latencies that are view- and source-specific. Using simulations, we demonstrate that MVICAD leads to better unmixing of the sources. Moreover, as ICA is often used in neuroscience, we show that latencies are age-related when applied to Cam-CAN, a large-scale magnetoencephalography (MEG) dataset. These results demonstrate that the MVICAD model can reveal rich effects on neural signals without human supervision.

[233] 2312.00485

Backbone-based Dynamic Graph Spatio-Temporal Network for Epidemic Forecasting

Accurate epidemic forecasting is a critical task in controlling disease transmission. Many deep learning-based models focus only on static or dynamic graphs when constructing spatial information, ignoring their relationship. Additionally, these models often rely on recurrent structures, which can lead to error accumulation and computational time consumption. To address the aforementioned problems, we propose a novel model called Backbone-based Dynamic Graph Spatio-Temporal Network (BDGSTN). Intuitively, the continuous and smooth changes in graph structure, make adjacent graph structures share a basic pattern. To capture this property, we use adaptive methods to generate static backbone graphs containing the primary information and temporal models to generate dynamic temporal graphs of epidemic data, fusing them to generate a backbone-based dynamic graph. To overcome potential limitations associated with recurrent structures, we introduce a linear model DLinear to handle temporal dependencies and combine it with dynamic graph convolution for epidemic forecasting. Extensive experiments on two datasets demonstrate that BDGSTN outperforms baseline models and ablation comparison further verifies the effectiveness of model components. Furthermore, we analyze and measure the significance of backbone and temporal graphs by using information metrics from different aspects. Finally, we compare model parameter volume and training time to confirm the superior complexity and efficiency of BDGSTN.

[234] 2312.00486

REDUCR: Robust Data Downsampling Using Class Priority Reweighting

Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

[235] 2312.00487

Explainable AI in Diagnosing and Anticipating Leukemia Using Transfer Learning Method

This research paper focuses on Acute Lymphoblastic Leukemia (ALL), a form of blood cancer prevalent in children and teenagers, characterized by the rapid proliferation of immature white blood cells (WBCs). These atypical cells can overwhelm healthy cells, leading to severe health consequences. Early and accurate detection of ALL is vital for effective treatment and improving survival rates. Traditional diagnostic methods are time-consuming, costly, and prone to errors. The paper proposes an automated detection approach using computer-aided diagnostic (CAD) models, leveraging deep learning techniques to enhance the accuracy and efficiency of leukemia diagnosis. The study utilizes various transfer learning models like ResNet101V2, VGG19, InceptionV3, and InceptionResNetV2 for classifying ALL. The methodology includes using the Local Interpretable Model-Agnostic Explanations (LIME) for ensuring the validity and reliability of the AI system's predictions. This approach is critical for overcoming the "black box" nature of AI, where decisions made by models are often opaque and unaccountable. The paper highlights that the proposed method using the InceptionV3 model achieved an impressive 98.38% accuracy, outperforming other tested models. The results, verified by the LIME algorithm, showcase the potential of this method in accurately identifying ALL, providing a valuable tool for medical practitioners. The research underscores the impact of explainable artificial intelligence (XAI) in medical diagnostics, paving the way for more transparent and trustworthy AI applications in healthcare.

[236] 2312.00489

Optimal complexity of goal-oriented adaptive FEM for nonsymmetric linear elliptic PDEs

We analyze a goal-oriented adaptive algorithm that aims to efficiently compute the quantity of interest $G(u^\star)$ with a linear goal functional $G$ and the solution $u^\star$ to a general second-order nonsymmetric linear elliptic partial differential equation. The current state of the analysis of iterative algebraic solvers for nonsymmetric systems lacks the contraction property in the norms that are prescribed by the functional analytic setting. This seemingly prevents their application in the optimality analysis of goal-oriented adaptivity. As a remedy, this paper proposes a goal-oriented adaptive iteratively symmetrized finite element method (GOAISFEM). It employs a nested loop with a contractive symmetrization procedure, e.g., the Zarantonello iteration, and a contractive algebraic solver, e.g., an optimal multigrid solver. The various iterative procedures require well-designed stopping criteria such that the adaptive algorithm can effectively steer the local mesh refinement and the computation of the inexact discrete approximations. The main results consist of full linear convergence of the proposed adaptive algorithm and the proof of optimal convergence rates with respect to both degrees of freedom and total computational cost (i.e., optimal complexity). Numerical experiments confirm the theoretical results and investigate the selection of the parameters.

[237] 2312.00491

Slotted Aloha for Optical Wireless Communications in Internet of Underwater Things

In this work, we design and analyse a Slotted ALOHA (SA) solution for Optical Wireless Communication (OWC)-based Internet of Underwater Things (IoUT). In the proposed system, user devices exchange data with an access point (AP) which exploits the capture effect. The space spanned by the IoUT nodes is three-dimensional, i.e., users are located in half-sphere centered at the AP placed at the bottom of a floating object at the water surface level. The analytical expressions for the system throughput and reliability expressed in terms of the outage probability are derived. Based on the simulated signal-to-noise-and-interference-ratio statistics and derived analytical expressions, we present numerical results that investigate the trade-off between the system performance and the IoUT system parameters, such as the number of users, activation probability and type of water medium. The presented conclusions provide valuable insights into the design of an SA-based solution for IoUT communications.

[238] 2312.00499

Unveiling the Landscape of Smart Contract Vulnerabilities: A Detailed Examination and Codification of Vulnerabilities in Prominent Blockchains

With the rise in using immature smart contract programming languages to build a decentralized application, more vulnerabilities have been introduced to the Blockchain and were the main reasons behind critical financial losses. Moreover, the immutability of Blockchain technology makes deployed smart contracts unfixable for the whole life of the Blockchain itself. The lack of complete and up-to-date resources that explain those vulnerabilities in detail has also contributed to increasing the number of vulnerabilities in Blockchain. In addition, the lack of a standardized nomination of the existing vulnerabilities has made redundant research and made developers more confused. Therefore, in this paper, we propose the most complete list of smart contract vulnerabilities that exist in the most popular Blockchains with a detailed explanation of each one of them. In addition, we propose a new codification system that facilitates the communication of those vulnerabilities between developers and researchers. This codification, help identify the most uncovered vulnerabilities to focus on in future research. Moreover, the discussed list of vulnerabilities covers multiple Blockchain and could be used for even future built Blockchains.

[239] 2312.00500

Global Localization: Utilizing Relative Spatio-Temporal Geometric Constraints from Adjacent and Distant Cameras

Re-localizing a camera from a single image in a previously mapped area is vital for many computer vision applications in robotics and augmented/virtual reality. In this work, we address the problem of estimating the 6 DoF camera pose relative to a global frame from a single image. We propose to leverage a novel network of relative spatial and temporal geometric constraints to guide the training of a Deep Network for localization. We employ simultaneously spatial and temporal relative pose constraints that are obtained not only from adjacent camera frames but also from camera frames that are distant in the spatio-temporal space of the scene. We show that our method, through these constraints, is capable of learning to localize when little or very sparse ground-truth 3D coordinates are available. In our experiments, this is less than 1% of available ground-truth data. We evaluate our method on 3 common visual localization datasets and show that it outperforms other direct pose estimation methods.

[240] 2312.00502

On the Out-Of-Distribution Robustness of Self-Supervised Representation Learning for Phonocardiogram Signals

Objective: Despite the recent increase in research activity, deep-learning models have not yet been widely accepted in medicine. The shortage of high-quality annotated data often hinders the development of robust and generalizable models, which do not suffer from degraded effectiveness when presented with newly-collected, out-of-distribution (OOD) datasets. Methods: Contrastive Self-Supervised Learning (SSL) offers a potential solution to the scarcity of labeled data as it takes advantage of unlabeled data to increase model effectiveness and robustness. In this research, we propose applying contrastive SSL for detecting abnormalities in phonocardiogram (PCG) samples by learning a generalized representation of the signal. Specifically, we perform an extensive comparative evaluation of a wide range of audio-based augmentations and evaluate trained classifiers on multiple datasets across different downstream tasks. Results: We experimentally demonstrate that, depending on its training distribution, the effectiveness of a fully-supervised model can degrade up to 32% when evaluated on unseen data, while SSL models only lose up to 10% or even improve in some cases. Conclusions: Contrastive SSL pretraining can assist in providing robust classifiers which can generalize to unseen, OOD data, without relying on time- and labor-intensive annotation processes by medical experts. Furthermore, the proposed extensive evaluation protocol sheds light on the most promising and appropriate augmentations for robust PCG signal processing. Significance: We provide researchers and practitioners with a roadmap towards producing robust models for PCG classification, in addition to an open-source codebase for developing novel approaches.

[241] 2312.00506

Generative artificial intelligence enhances individual creativity but reduces the collective diversity of novel content

Creativity is core to being human. Generative artificial intelligence (GenAI) holds promise for humans to be more creative by offering new ideas, or less creative by anchoring on GenAI ideas. We study the causal impact of GenAI ideas on the production of an unstructured creative output in an online experimental study where some writers could obtain ideas for a story from a GenAI platform. We find that access to GenAI ideas causes stories to be evaluated as more creative, better written and more enjoyable, especially among less creative writers. However, objective measures of story similarity within each condition reveal that GenAI-enabled stories are more similar to each other than stories by humans alone. These results point to an increase in individual creativity, but at the same time there is a risk of losing collective novelty: this dynamic resembles a social dilemma where individual writers are better off using GenAI to improve their own writing, but collectively a narrower scope of novel content may be produced with GenAI. Our results have implications for researchers, policy-makers and practitioners interested in bolstering creativity, but point to potential downstream consequences from over-reliance.

[242] 2312.00507

VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

We propose VEXIR2Vec, a code embedding framework for finding similar functions in binaries. Our representations rely on VEX IR, the intermediate representation used by binary analysis tools like Valgrind and angr. Our proposed embeddings encode both syntactic and semantic information to represent a function, and is both application and architecture independent. We also propose POV, a custom Peephole Optimization engine that normalizes the VEX IR for effective similarity analysis. We design several optimizations like copy/constant propagation, constant folding, common subexpression elimination and load-store elimination in POV. We evaluate our framework on two experiments -- diffing and searching -- involving binaries targeting different architectures, compiled using different compilers and versions, optimization sequences, and obfuscations. We show results on several standard projects and on real-world vulnerabilities. Our results show that VEXIR2Vec achieves superior precision and recall values compared to the state-of-the-art works. Our framework is highly scalable and is built as a multi-threaded, parallel library by only using open-source tools. VEXIR2Vec achieves about $3.2 \times$ speedup on the closest competitor, and orders-of-magnitude speedup on other tools.

[243] 2312.00508

PyraTrans: Learning Attention-Enriched Multi-Scale Pyramid Network from Pre-Trained Transformers for Effective Malicious URL Detection

Detecting malicious URLs is a crucial aspect of web search and mining, significantly impacting internet security. Though advancements in machine learning have improved the effectiveness of detection methods, these methods still face significant challenges in their capacity to generalize and their resilience against evolving threats. In this paper, we propose PyraTrans, an approach that combines the strengths of pretrained Transformers and pyramid feature learning for improving malicious URL detection. We implement PyraTrans by leveraging a pretrained CharBERT as the base and augmenting it with 3 connected feature modules: 1) The Encoder Feature Extraction module, which extracts representations from each encoder layer of CharBERT to obtain multi-order features; 2) The Multi-Scale Feature Learning Module, which captures multi-scale local contextual insights and aggregate information across different layer-levels; and 3) The Pyramid Spatial Attention Module, which learns hierarchical and spatial feature attentions, highlighting critical classification signals while reducing noise. The proposed approach addresses the limitations of the Transformer in local feature learning and spatial awareness, and enabling us to extract multi-order, multi-scale URL feature representations with enhanced attentional focus. PyraTrans is evaluated using 4 benchmark datasets, where it demonstrated significant advancements over prior baseline methods. Particularly, on the imbalanced dataset, our method, with just 10% of the data for training, the TPR is 3.3-6.5 times and the F1-score is 2.9-4.5 times that of the baseline. Our approach also demonstrates robustness against adversarial attacks. Codes and data are available at

[244] 2312.00512

Attack Detection Using Item Vector Shift in Matrix Factorisation Recommenders

This paper proposes a novel method for detecting shilling attacks in Matrix Factorization (MF)-based Recommender Systems (RS), in which attackers use false user-item feedback to promote a specific item. Unlike existing methods that use either use supervised learning to distinguish between attack and genuine profiles or analyse target item rating distributions to detect false ratings, our method uses an unsupervised technique to detect false ratings by examining shifts in item preference vectors that exploit rating deviations and user characteristics, making it a promising new direction. The experimental results demonstrate the effectiveness of our approach in various attack scenarios, including those involving obfuscation techniques.

[245] 2312.00513

Summarization-based Data Augmentation for Document Classification

Despite the prevalence of pretrained language models in natural language understanding tasks, understanding lengthy text such as document is still challenging due to the data sparseness problem. Inspired by that humans develop their ability of understanding lengthy text from reading shorter text, we propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification. We first obtain easy-to-learn examples for the target document classification task by summarizing the input of the original training examples, while optionally merging the original labels to conform to the summarized input. We then use the generated pseudo examples to perform curriculum learning. Experimental results on two datasets confirmed the advantage of our method compared to existing baseline methods in terms of robustness and accuracy. We release our code and data at

[246] 2312.00516

Spatio-Temporal-Decoupled Masked Pre-training for Traffic Forecasting

Accurate forecasting of multivariate traffic flow time series remains challenging due to substantial spatio-temporal heterogeneity and complex long-range correlative patterns. To address this, we propose Spatio-Temporal-Decoupled Masked Pre-training (STD-MAE), a novel framework that employs masked autoencoders to learn and encode complex spatio-temporal dependencies via pre-training. Specifically, we use two decoupled masked autoencoders to reconstruct the traffic data along spatial and temporal axes using a self-supervised pre-training approach. These mask reconstruction mechanisms capture the long-range correlations in space and time separately. The learned hidden representations are then used to augment the downstream spatio-temporal traffic predictor. A series of quantitative and qualitative evaluations on four widely-used traffic benchmarks (PEMS03, PEMS04, PEMS07, and PEMS08) are conducted to verify the state-of-the-art performance, with STD-MAE explicitly enhancing the downstream spatio-temporal models' ability to capture long-range intricate spatial and temporal patterns. Codes are available at

[247] 2312.00518

Preprocess your Paths -- Speeding up Linear Programming-based Optimization for Segment Routing Traffic Engineering

Many state-of-the-art Segment Routing (SR) Traffic Engineering (TE) algorithms rely on Linear Program (LP)-based optimization. However, the poor scalability of the latter and the resulting high computation times impose severe restrictions on the practical usability of such approaches for many use cases. To tackle this problem, a variety of preprocessing approaches have been proposed that aim to reduce computational complexity by preemtively limiting the number of SR paths to consider during optimization. In this paper, we provide the first extensive literature review of existing preprocessing approaches for SR. Based on this, we conduct a large scale comparative study using various real-world topologies, including recent data from a Tier-1 Internet Service Provider (ISP) backbone. Based on the insights obtained from this evaluation, we finally propose a combination of multiple preprocessing approaches and show that this can reliably reduce computation times by around a factor of 10 or more, without resulting in relevant deterioration of the solution quality. This is a major improvement over the current state-of-the-art and facilitates the reliable usability of LP-based optimization for large segment-routed networks.

[248] 2312.00519

The Impact of Privacy and Security Attitudes and Concerns of Travellers on Their Willingness to Use Mobility-as-a-Service Systems

This paper reports results from an online survey on the impact of travellers' privacy and security attitudes and concerns on their willingness to use mobility-as-a-service (MaaS) systems. This study is part of a larger project that aims at investigating barriers to potential MaaS uptake. The online survey was designed to cover data privacy and security attitudes and concerns as well as a variety of socio-psychological and socio-demographic variables associated with travellers' intentions to use MaaS systems. The study involved $n=320$ UK participants recruited via the Prolific survey platform. Overall, correlation analysis and a multiple regression model indicated that, neither attitudes nor concerns of participants over the privacy and security of personal data would significantly impact their decisions to use MaaS systems, which was an unexpected result, however, their trust in (commercial and governmental) websites would. Another surprising result is that, having been a victim of improper invasion of privacy did not appear to affect individuals' intentions to use MaaS systems, whereas frequency with which one heard about misuse of personal data did. Implications of the results and future directions are also discussed, e.g., MaaS providers are encouraged to work on improving the trustworthiness of their corporate image.

[249] 2312.00522

No Ascending Auction can find Equilibrium for SubModular valuations

We show that no efficient ascending auction can guarantee to find even a minimal envy-free price vector if all valuations are submodular, assuming a basic complexity theory's assumption.

[250] 2312.00525

SurreyAI 2023 Submission for the Quality Estimation Shared Task

Quality Estimation (QE) systems are important in situations where it is necessary to assess the quality of translations, but there is no reference available. This paper describes the approach adopted by the SurreyAI team for addressing the Sentence-Level Direct Assessment shared task in WMT23. The proposed approach builds upon the TransQuest framework, exploring various autoencoder pre-trained language models within the MonoTransQuest architecture using single and ensemble settings. The autoencoder pre-trained language models employed in the proposed systems are XLMV, InfoXLM-large, and XLMR-large. The evaluation utilizes Spearman and Pearson correlation coefficients, assessing the relationship between machine-predicted quality scores and human judgments for 5 language pairs (English-Gujarati, English-Hindi, English-Marathi, English-Tamil and English-Telugu). The MonoTQ-InfoXLM-large approach emerges as a robust strategy, surpassing all other individual models proposed in this study by significantly improving over the baseline for the majority of the language pairs.

[251] 2312.00526

Automated design space exploration for poultry processing systems using discrete-event simulation

The poultry processing industry struggles to keep up with new developments in meat consumption and livestock breeding. Designing poultry processing systems is becoming increasingly more complex, and an increasing number of iterations of (re)design are required to optimize the product flow in these systems. To address this issue, this study presents a discrete-event simulation-based method for design space exploration of production systems. This method is mostly automated, greatly reducing the time and effort required in the design process. The steps that are automated are iterating on the design, model construction, performing simulation experiments, and interpreting the simulation results. An industrial case study in which a poultry processing system is redesigned is used to validate the effectiveness of the proposed method. A detailed description of this case study is given to showcase the different ways in which this design space exploration method can be used.

[252] 2312.00532

DeepDR: Deep Structure-Aware RGB-D Inpainting for Diminished Reality

Diminished reality (DR) refers to the removal of real objects from the environment by virtually replacing them with their background. Modern DR frameworks use inpainting to hallucinate unobserved regions. While recent deep learning-based inpainting is promising, the DR use case is complicated by the need to generate coherent structure and 3D geometry (i.e., depth), in particular for advanced applications, such as 3D scene editing. In this paper, we propose DeepDR, a first RGB-D inpainting framework fulfilling all requirements of DR: Plausible image and geometry inpainting with coherent structure, running at real-time frame rates, with minimal temporal artifacts. Our structure-aware generative network allows us to explicitly condition color and depth outputs on the scene semantics, overcoming the difficulty of reconstructing sharp and consistent boundaries in regions with complex backgrounds. Experimental results show that the proposed framework can outperform related work qualitatively and quantitatively.

[253] 2312.00534

LiDAR-based curb detection for ground truth annotation in automated driving validation

Curb detection is essential for environmental awareness in Automated Driving (AD), as it typically limits drivable and non-drivable areas. Annotated data are necessary for developing and validating an AD function. However, the number of public datasets with annotated point cloud curbs is scarce. This paper presents a method for detecting 3D curbs in a sequence of point clouds captured from a LiDAR sensor, which consists of two main steps. First, our approach detects the curbs at each scan using a segmentation deep neural network. Then, a sequence-level processing step estimates the 3D curbs in the reconstructed point cloud using the odometry of the vehicle. From these 3D points of the curb, we obtain polylines structured following ASAM OpenLABEL standard. These detections can be used as pre-annotations in labelling pipelines to efficiently generate curb-related ground truth data. We validate our approach through an experiment in which different human annotators were required to annotate curbs in a group of LiDAR-based sequences with and without our automatically generated pre-annotations. The results show that the manual annotation time is reduced by 50.99% thanks to our detections, keeping the data quality level.

[254] 2312.00536

Trained MT Metrics Learn to Cope with Machine-translated References

Neural metrics trained on human evaluations of MT tend to correlate well with human judgments, but their behavior is not fully understood. In this paper, we perform a controlled experiment and compare a baseline metric that has not been trained on human evaluations (Prism) to a trained version of the same metric (Prism+FT). Surprisingly, we find that Prism+FT becomes more robust to machine-translated references, which are a notorious problem in MT evaluation. This suggests that the effects of metric training go beyond the intended effect of improving overall correlation with human judgments.

[255] 2312.00538

A Preconditioned Interior Point Method for Support Vector Machines Using an ANOVA-Decomposition and NFFT-Based Matrix-Vector Products

In this paper we consider the numerical solution to the soft-margin support vector machine optimization problem. This problem is typically solved using the SMO algorithm, given the high computational complexity of traditional optimization algorithms when dealing with large-scale kernel matrices. In this work, we propose employing an NFFT-accelerated matrix-vector product using an ANOVA decomposition for the feature space that is used within an interior point method for the overall optimization problem. As this method requires the solution of a linear system of saddle point form we suggest a preconditioning approach that is based on low-rank approximations of the kernel matrix together with a Krylov subspace solver. We compare the accuracy of the ANOVA-based kernel with the default LIBSVM implementation. We investigate the performance of the different preconditioners as well as the accuracy of the ANOVA kernel on several large-scale datasets.

[256] 2312.00540

Target-agnostic Source-free Domain Adaptation for Regression Tasks

Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcome it, we propose TASFAR, a novel target-agnostic source-free domain adaptation approach for regression tasks. Using prediction confidence, TASFAR estimates a label density map as the target label distribution, which is then used to calibrate the source model on the target domain. We have conducted extensive experiments on four regression tasks with various domain gaps, namely, pedestrian dead reckoning for different users, image-based people counting in different scenes, housing-price prediction at different districts, and taxi-trip duration prediction from different departure points. TASFAR is shown to substantially outperform the state-of-the-art source-free UDA approaches by averagely reducing 22% errors for the four tasks and achieve notably comparable accuracy as source-based UDA without using source data.

[257] 2312.00545

Hiding in text/plain sight: Security defences of Tor Onion Services

Tor Onion Services are a way to host websites and other internet services anonymously. Onion Services are often used to bypass internet censorship and provide information services to users in oppressive regimes. This paper presents an analysis of the security defences deployed on these Onion Services. Onion Services tend to have better security policy than sites on the clear web. However they lag behind in the deployment of HTTPS, a key defence to ensuring the security of users of such services.

[258] 2312.00548

Domain Adaptive Imitation Learning with Visual Observation

In this paper, we consider domain-adaptive imitation learning with visual observation, where an agent in a target domain learns to perform a task by observing expert demonstrations in a source domain. Domain adaptive imitation learning arises in practical scenarios where a robot, receiving visual sensory data, needs to mimic movements by visually observing other robots from different angles or observing robots of different shapes. To overcome the domain shift in cross-domain imitation learning with visual observation, we propose a novel framework for extracting domain-independent behavioral features from input observations that can be used to train the learner, based on dual feature extraction and image reconstruction. Empirical results demonstrate that our approach outperforms previous algorithms for imitation learning from visual observation with domain shift.

[259] 2312.00552

Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs

Unsupervised relation extraction (URE) aims to extract relations between named entities from raw text without requiring manual annotations or pre-existing knowledge bases. In recent studies of URE, researchers put a notable emphasis on contrastive learning strategies for acquiring relation representations. However, these studies often overlook two important aspects: the inclusion of diverse positive pairs for contrastive learning and the exploration of appropriate loss functions. In this paper, we propose AugURE with both within-sentence pairs augmentation and augmentation through cross-sentence pairs extraction to increase the diversity of positive pairs and strengthen the discriminative power of contrastive learning. We also identify the limitation of noise-contrastive estimation (NCE) loss for relation representation learning and propose to apply margin loss for sentence pairs. Experiments on NYT-FB and TACRED datasets demonstrate that the proposed relation representation learning and a simple K-Means clustering achieves state-of-the-art performance.

[260] 2312.00553

A Spatio-Temporal Graph Convolutional Network for Gesture Recognition from High-Density Electromyography

Accurate hand gesture prediction is crucial for effective upper-limb prosthetic limbs control. As the high flexibility and multiple degrees of freedom exhibited by human hands, there has been a growing interest in integrating deep networks with high-density surface electromyography (HD-sEMG) grids to enhance gesture recognition capabilities. However, many existing methods fall short in fully exploit the specific spatial topology and temporal dependencies present in HD-sEMG data. Additionally, these studies are often limited number of gestures and lack generality. Hence, this study introduces a novel gesture recognition method, named STGCN-GR, which leverages spatio-temporal graph convolution networks for HD-sEMG-based human-machine interfaces. Firstly, we construct muscle networks based on functional connectivity between channels, creating a graph representation of HD-sEMG recordings. Subsequently, a temporal convolution module is applied to capture the temporal dependences in the HD-sEMG series and a spatial graph convolution module is employed to effectively learn the intrinsic spatial topology information among distinct HD-sEMG channels. We evaluate our proposed model on a public HD-sEMG dataset comprising a substantial number of gestures (i.e., 65). Our results demonstrate the remarkable capability of the STGCN-GR method, achieving an impressive accuracy of 91.07% in predicting gestures, which surpasses state-of-the-art deep learning methods applied to the same dataset.

[261] 2312.00554

Questioning Biases in Case Judgment Summaries: Legal Datasets or Large Language Models?

The evolution of legal datasets and the advent of large language models (LLMs) have significantly transformed the legal field, particularly in the generation of case judgment summaries. However, a critical concern arises regarding the potential biases embedded within these summaries. This study scrutinizes the biases present in case judgment summaries produced by legal datasets and large language models. The research aims to analyze the impact of biases on legal decision making. By interrogating the accuracy, fairness, and implications of biases in these summaries, this study contributes to a better understanding of the role of technology in legal contexts and the implications for justice systems worldwide. In this study, we investigate biases wrt Gender-related keywords, Race-related keywords, Keywords related to crime against women, Country names and religious keywords. The study shows interesting evidences of biases in the outputs generated by the large language models and pre-trained abstractive summarization models. The reasoning behind these biases needs further studies.

[262] 2312.00561

Interior Point Constrained Reinforcement Learning with Global Convergence Guarantees

We consider discounted infinite horizon constrained Markov decision processes (CMDPs) where the goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing an algorithm that ensures constraint satisfaction during learning. To this end, we develop a zeroth-order interior point approach based on the log barrier function of the CMDP. Under the commonly assumed conditions of Fisher non-degeneracy and bounded transfer error of the policy parameterization, we establish the theoretical properties of the algorithm. In particular, in contrast to existing CMDP approaches that ensure policy feasibility only upon convergence, our algorithm guarantees feasibility of the policies during the learning process and converges to the optimal policy with a sample complexity of $O(\varepsilon^{-6})$. In comparison to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA, our algorithm requires an additional $O(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with same Fisher-non-degenerate parameterization.

[263] 2312.00564

The Discontinuous Strain Method: accurately representing fatigue and failure

Fatigue simulation requires accurate modeling of unloading and reloading. However, classical ductile damage models treat deformations after complete failure as irrecoverable -- which leads to unphysical behavior during unloading. This unphysical behavior stems from the continued accumulation of plastic strains after failure, resulting in an incorrect stress state at crack closure. As a remedy, we introduce a \textit{discontinuity strain} in the additive elasto-plastic strain decomposition, which absorbs the excess strain after failure. This allows representing pre- and post-cracking regimes in a fully continuous setting, wherein the transition from the elasto-plastic response to cracking can be triggered at any arbitrary stage in a completely smooth manner. Moreover, the presented methodology does not exhibit the spurious energy release observed in hybrid approaches. In addition, our approach guarantees mesh-independent results by relying on a characteristic length scale -- based on the discretization's resolution. We name this new methodology the \textit{discontinuous strain method}. The proposed approach requires only minor modifications of conventional plastic-damage routines. To convey the method in a didactic manner, the algorithmic modifications are first discussed for one- and subsequently for two-/three-dimensional implementations. Using a simple ductile constitutive model, the discontinuous strain method is validated against established two-dimensional benchmarks. The method is, however, independent of the employed constitutive model. Elastic, plastic, and damage models may thus be chosen arbitrarily. Furthermore, computational efforts associated with the method are minimal, rendering it advantageous for accurately representing low-cycle fatigue but potentially also for other scenarios requiring a discontinuity representation within a plastic-damage framework.

[264] 2312.00567

Explanatory Argument Extraction of Correct Answers in Resident Medical Exams

Developing the required technology to assist medical experts in their everyday activities is currently a hot topic in the Artificial Intelligence research field. Thus, a number of large language models (LLMs) and automated benchmarks have recently been proposed with the aim of facilitating information extraction in Evidence-Based Medicine (EBM) using natural language as a tool for mediating in human-AI interaction. The most representative benchmarks are limited to either multiple-choice or long-form answers and are available only in English. In order to address these shortcomings, in this paper we present a new dataset which, unlike previous work: (i) includes not only explanatory arguments for the correct answer, but also arguments to reason why the incorrect answers are not correct; (ii) the explanations are written originally by medical doctors to answer questions from the Spanish Residency Medical Exams. Furthermore, this new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors. An additional benefit of our setting is that we can leverage the extractive QA paradigm to automatically evaluate performance of LLMs without resorting to costly manual evaluation by medical experts. Comprehensive experimentation with language models for Spanish shows that sometimes multilingual models fare better than monolingual ones, even outperforming models which have been adapted to the medical domain. Furthermore, results across the monolingual models are mixed, with supposedly smaller and inferior models performing competitively. In any case, the obtained results show that our novel dataset and approach can be an effective technique to help medical practitioners in identifying relevant evidence-based explanations for medical questions.

[265] 2312.00570

Generative models for visualising abstract social processes: Guiding streetview image synthesis of StyleGAN2 with indices of deprivation

This paper presents a novel application of Generative Adverserial Networks (GANs) to study visual aspects of social processes. I train a a StyleGAN2-model on a custom dataset of 14,564 images of London, sourced from Google Streetview taken in London. After training, I invert the images in the training set, finding points in the model's latent space that correspond to them, and compare results from three inversion techniques. I connect each data point with metadata from the Indices of Multiple Deprivation, describing income, health and environmental quality in the area where the photographs were taken. It is then possible to map which parts of the model's latent space encode visual features that are distinctive for health, income and environmental quality, and condition the synthesis of new images based on these factors. The synthetic images created reflect visual features of social processes that were previously unknown and difficult to study, describing recurring visual differences between deprived and privileged areas in London. GANs are known for their capability to produce a continuous range of images that exhibit visual differences. The paper tests how to exploit this ability through visual comparisons in still images as well as through an interactive website where users can guide image synthesis with sliders. Though conditioned synthesis has its limitations and the results are difficult to validate, the paper points to the potential for generative models to be repurposed to be parts of social scientific methods.

[266] 2312.00575

Instruction-tuning Aligns LLMs to the Human Brain

Instruction-tuning is a widely adopted method of finetuning that enables large language models (LLMs) to generate output that more closely resembles human responses to natural language queries, in many cases leading to human-level performance on diverse testbeds. However, it remains unclear whether instruction-tuning truly makes LLMs more similar to how humans process language. We investigate the effect of instruction-tuning on LLM-human similarity in two ways: (1) brain alignment, the similarity of LLM internal representations to neural activity in the human language system, and (2) behavioral alignment, the similarity of LLM and human behavior on a reading task. We assess 25 vanilla and instruction-tuned LLMs across three datasets involving humans reading naturalistic stories and sentences. We discover that instruction-tuning generally enhances brain alignment by an average of 6%, but does not have a similar effect on behavioral alignment. To identify the factors underlying LLM-brain alignment, we compute correlations between the brain alignment of LLMs and various model properties, such as model size, various problem-solving abilities, and performance on tasks requiring world knowledge spanning various domains. Notably, we find a strong positive correlation between brain alignment and model size (r = 0.95), as well as performance on tasks requiring world knowledge (r = 0.81). Our results demonstrate that instruction-tuning LLMs improves both world knowledge representations and brain alignment, suggesting that mechanisms that encode world knowledge in LLMs also improve representational alignment to the human brain.

[267] 2312.00580

Using Honeybuckets to Characterize Cloud Storage Scanning in the Wild

In this work, we analyze to what extent actors target poorly-secured cloud storage buckets for attack. We deployed hundreds of AWS S3 honeybuckets with different names and content to lure and measure different scanning strategies. Actors exhibited clear preferences for scanning buckets that appeared to belong to organizations, especially commercial entities in the technology sector with a vulnerability disclosure program. Actors continuously engaged with the content of buckets by downloading, uploading, and deleting files. Most alarmingly, we recorded multiple instances in which malicious actors downloaded, read, and understood a document from our honeybucket, leading them to attempt to gain unauthorized server access.

[268] 2312.00581

Pathway to a fully data-driven geotechnics: lessons from materials informatics

This paper elucidates the challenges and opportunities inherent in integrating data-driven methodologies into geotechnics, drawing inspiration from the success of materials informatics. Highlighting the intricacies of soil complexity, heterogeneity, and the lack of comprehensive data, the discussion underscores the pressing need for community-driven database initiatives and open science movements. By leveraging the transformative power of deep learning, particularly in feature extraction from high-dimensional data and the potential of transfer learning, we envision a paradigm shift towards a more collaborative and innovative geotechnics field. The paper concludes with a forward-looking stance, emphasizing the revolutionary potential brought about by advanced computational tools like large language models in reshaping geotechnics informatics.

[269] 2312.00582

Design Patterns for Machine Learning Based Systems with Human-in-the-Loop

The development and deployment of systems using supervised machine learning (ML) remain challenging: mainly due to the limited reliability of prediction models and the lack of knowledge on how to effectively integrate human intelligence into automated decision-making. Humans involvement in the ML process is a promising and powerful paradigm to overcome the limitations of pure automated predictions and improve the applicability of ML in practice. We compile a catalog of design patterns to guide developers select and implement suitable human-in-the-loop (HiL) solutions. Our catalog takes into consideration key requirements as the cost of human involvement and model retraining. It includes four training patterns, four deployment patterns, and two orthogonal cooperation patterns.

[270] 2312.00583

MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes

Accurate 3D tracking in highly deformable scenes with occlusions and shadows can facilitate new applications in robotics, augmented reality, and generative AI. However, tracking under these conditions is extremely challenging due to the ambiguity that arises with large deformations, shadows, and occlusions. We introduce MD-Splatting, an approach for simultaneous 3D tracking and novel view synthesis, using video captures of a dynamic scene from various camera poses. MD-Splatting builds on recent advances in Gaussian splatting, a method that learns the properties of a large number of Gaussians for state-of-the-art and fast novel view synthesis. MD-Splatting learns a deformation function to project a set of Gaussians with non-metric, thus canonical, properties into metric space. The deformation function uses a neural-voxel encoding and a multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow scalar. We enforce physics-inspired regularization terms based on local rigidity, conservation of momentum, and isometry, which leads to trajectories with smaller trajectory errors. MD-Splatting achieves high-quality 3D tracking on highly deformable scenes with shadows and occlusions. Compared to state-of-the-art, we improve 3D tracking by an average of 23.9 %, while simultaneously achieving high-quality novel view synthesis. With sufficient texture such as in scene 6, MD-Splatting achieves a median tracking error of 3.39 mm on a cloth of 1 x 1 meters in size. Project website:

[271] 2312.00584

The Ethics of Automating Legal Actors

The introduction of large public legal datasets has brought about a renaissance in legal NLP. Many of these datasets are comprised of legal judgements - the product of judges deciding cases. This fact, together with the way machine learning works, means that several legal NLP models are models of judges. While some have argued for the automation of judges, in this position piece, we argue that automating the role of the judge raises difficult ethical challenges, in particular for common law legal systems. Our argument follows from the social role of the judge in actively shaping the law, rather than merely applying it. Since current NLP models come nowhere close to having the facilities necessary for this task, they should not be used to automate judges. Furthermore, even in the case the models could achieve human-level capabilities, there would still be remaining ethical concerns inherent in the automation of the legal process.

[272] 2312.00586

Explainable Fraud Detection with Deep Symbolic Classification

There is a growing demand for explainable, transparent, and data-driven models within the domain of fraud detection. Decisions made by fraud detection models need to be explainable in the event of a customer dispute. Additionally, the decision-making process in the model must be transparent to win the trust of regulators and business stakeholders. At the same time, fraud detection solutions can benefit from data due to the noisy, dynamic nature of fraud and the availability of large historical data sets. Finally, fraud detection is notorious for its class imbalance: there are typically several orders of magnitude more legitimate transactions than fraudulent ones. In this paper, we present Deep Symbolic Classification (DSC), an extension of the Deep Symbolic Regression framework to classification problems. DSC casts classification as a search problem in the space of all analytic functions composed of a vocabulary of variables, constants, and operations and optimizes for an arbitrary evaluation metric directly. The search is guided by a deep neural network trained with reinforcement learning. Because the functions are mathematical expressions that are in closed-form and concise, the model is inherently explainable both at the level of a single classification decision and the model's decision process. Furthermore, the class imbalance problem is successfully addressed by optimizing for metrics that are robust to class imbalance such as the F1 score. This eliminates the need for oversampling and undersampling techniques that plague traditional approaches. Finally, the model allows to explicitly balance between the prediction accuracy and the explainability. An evaluation on the PaySim data set demonstrates competitive predictive performance with state-of-the-art models, while surpassing them in terms of explainability. This establishes DSC as a promising model for fraud detection systems.

[273] 2312.00587

Efficient and Secure Energy Trading with Electric Vehicles and Distributed Ledger Technology

Efficient energy management of Distributed Renewable Energy Resources (DRER) enables a more sustainable and efficient energy ecosystem. Therefore, we propose a holistic Energy Management System (EMS), utilising the computational and energy storage capabilities of nearby Electric Vehicles (EVs), providing a low-latency and efficient management platform for DRER. Through leveraging the inherent, immutable features of Distributed Ledger Technology (DLT) and smart contracts, we create a secure management environment, facilitating interactions between multiple EVs and energy resources. Using a privacy-preserving load forecasting method powered by Vehicular Fog Computing (VFC), we integrate the computational resources of the EVs. Using DLT and our forecasting framework, we accommodate efficient management algorithms in a secure and low-latency manner enabling greater utilisation of the energy storage resources. Finally, we assess our proposed EMS in terms of monetary and energy utility metrics, establishing the increased benefits of multiple interacting EVs and load forecasting. Through the proposed system, we have established the potential of our framework to create a more sustainable and efficient energy ecosystem whilst providing measurable benefits to participating agents.

[274] 2312.00588

LucidDreaming: Controllable Object-Centric 3D Generation

With the recent development of generative models, Text-to-3D generations have also seen significant growth. Nonetheless, achieving precise control over 3D generation continues to be an arduous task, as using text to control often leads to missing objects and imprecise locations. Contemporary strategies for enhancing controllability in 3D generation often entail the introduction of additional parameters, such as customized diffusion models. This often induces hardness in adapting to different diffusion models or creating distinct objects. In this paper, we present LucidDreaming as an effective pipeline capable of fine-grained control over 3D generation. It requires only minimal input of 3D bounding boxes, which can be deduced from a simple text prompt using a Large Language Model. Specifically, we propose clipped ray sampling to separately render and optimize objects with user specifications. We also introduce object-centric density blob bias, fostering the separation of generated objects. With individual rendering and optimizing of objects, our method excels not only in controlled content generation from scratch but also within the pre-trained NeRF scenes. In such scenarios, existing generative approaches often disrupt the integrity of the original scene, and current editing methods struggle to synthesize new content in empty spaces. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks, and achieves superior alignment of 3D content when compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability.

[275] 2312.00589

Merlin:Empowering Multimodal LLMs with Foresight Minds

Humans possess the remarkable ability to foresee the future to a certain extent based on present observations, a skill we term as foresight minds. However, this capability remains largely under explored within existing Multimodal Large Language Models (MLLMs), hindering their capacity to learn the fundamental principles of how things operate and the intentions behind the observed subjects. To address this issue, we introduce the integration of future modeling into the existing learning frameworks of MLLMs. By utilizing the subject trajectory, a highly structured representation of a consecutive frame sequence, as a learning objective, we aim to bridge the gap between the past and the future. We propose two innovative methods to empower MLLMs with foresight minds, Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT), which are inspired by the modern learning paradigm of LLMs. Specifically, FPT jointly training various tasks centered on trajectories, enabling MLLMs to learn how to attend and predict entire trajectories from a given initial observation. Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them. Aided by FPT and FIT, we build a novel and unified MLLM named Merlin that supports multi-images input and analysis about potential actions of multiple objects for the future reasoning. Experimental results show Merlin powerful foresight minds with impressive performance on both future reasoning and visual comprehension tasks.

[276] 2312.00591

Less is More: Learning Reference Knowledge Using No-Reference Image Quality Assessment

Image Quality Assessment (IQA) with reference images have achieved great success by imitating the human vision system, in which the image quality is effectively assessed by comparing the query image with its pristine reference image. However, for the images in the wild, it is quite difficult to access accurate reference images. We argue that it is possible to learn reference knowledge under the No-Reference Image Quality Assessment (NR-IQA) setting, which is effective and efficient empirically. Concretely, by innovatively introducing a novel feature distillation method in IQA, we propose a new framework to learn comparative knowledge from non-aligned reference images. And then, to achieve fast convergence and avoid overfitting, we further propose an inductive bias regularization. Such a framework not only solves the congenital defects of NR-IQA but also improves the feature extraction framework, enabling it to express more abundant quality information. Surprisingly, our method utilizes less input while obtaining a more significant improvement compared to the teacher models. Extensive experiments on eight standard NR-IQA datasets demonstrate the superior performance to the state-of-the-art NR-IQA methods, i.e., achieving the PLCC values of 0.917 (vs. 0.884 in LIVEC) and 0.686 (vs. 0.661 in LIVEFB).

[277] 2312.00592

Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)

Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance. We make our code available at

[278] 2312.00593

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers

Analyzing laparoscopic surgery videos presents a complex and multifaceted challenge, with applications including surgical training, intra-operative surgical complication prediction, and post-operative surgical assessment. Identifying crucial events within these videos is a significant prerequisite in a majority of these applications. In this paper, we introduce a comprehensive dataset tailored for relevant event recognition in laparoscopic gynecology videos. Our dataset includes annotations for critical events associated with major intra-operative challenges and post-operative complications. To validate the precision of our annotations, we assess event recognition performance using several CNN-RNN architectures. Furthermore, we introduce and evaluate a hybrid transformer architecture coupled with a customized training-inference framework to recognize four specific events in laparoscopic surgery videos. Leveraging the Transformer networks, our proposed architecture harnesses inter-frame dependencies to counteract the adverse effects of relevant content occlusion, motion blur, and surgical scene variation, thus significantly enhancing event recognition accuracy. Moreover, we present a frame sampling strategy designed to manage variations in surgical scenes and the surgeons' skill level, resulting in event recognition with high temporal resolution. We empirically demonstrate the superiority of our proposed methodology in event recognition compared to conventional CNN-RNN architectures through a series of extensive experiments.

[279] 2312.00596

BCN: Batch Channel Normalization for Image Classification

Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along the (N,H,W) dimensions and LN computes the mean and variance along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial height and width dimension, respectively), this paper presents a novel normalization technique called Batch Channel Normalization (BCN). To exploit both the channel and batch dependence and adaptively and combine the advantages of BN and LN based on specific datasets or tasks, BCN separately normalizes inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized outputs based on adaptive parameters. As a basic block, BCN can be easily integrated into existing models for various applications in the field of computer vision. Empirical results show that the proposed technique can be seamlessly applied to various versions of CNN or Vision Transformer architecture. The code is publicly available at

[280] 2312.00597

UAVs and Birds: Enhancing Short-Range Navigation through Budgerigar Flight Studies

This study delves into the flight behaviors of Budgerigars (Melopsittacus undulatus) to gain insights into their flight trajectories and movements. Using 3D reconstruction from stereo video camera recordings, we closely examine the velocity and acceleration patterns during three flight motion takeoff, flying and landing. The findings not only contribute to our understanding of bird behaviors but also hold significant implications for the advancement of algorithms in Unmanned Aerial Vehicles (UAVs). The research aims to bridge the gap between biological principles observed in birds and the application of these insights in developing more efficient and autonomous UAVs. In the context of the increasing use of drones, this study focuses on the biologically inspired principles drawn from bird behaviors, particularly during takeoff, flying and landing flight, to enhance UAV capabilities. The dataset created for this research sheds light on Budgerigars' takeoff, flying, and landing techniques, emphasizing their ability to control speed across different situations and surfaces. The study underscores the potential of incorporating these principles into UAV algorithms, addressing challenges related to short-range navigation, takeoff, flying, and landing.

[281] 2312.00598

Learning from One Continuous Video Stream

We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers.

[282] 2312.00600

Improving Plasticity in Online Continual Learning via Collaborative Learning

Online Continual Learning (CL) solves the problem of learning the ever-emerging new classification tasks from a continuous data stream. Unlike its offline counterpart, in online CL, the training data can only be seen once. Most existing online CL research regards catastrophic forgetting (i.e., model stability) as almost the only challenge. In this paper, we argue that the model's capability to acquire new knowledge (i.e., model plasticity) is another challenge in online CL. While replay-based strategies have been shown to be effective in alleviating catastrophic forgetting, there is a notable gap in research attention toward improving model plasticity. To this end, we propose Collaborative Continual Learning (CCL), a collaborative learning based strategy to improve the model's capability in acquiring new concepts. Additionally, we introduce Distillation Chain (DC), a novel collaborative learning scheme to boost the training of the models. We adapted CCL-DC to existing representative online CL works. Extensive experiments demonstrate that even if the learners are well-trained with state-of-the-art online CL methods, our strategy can still improve model plasticity dramatically, and thereby improve the overall performance by a large margin.

[283] 2312.00601

Online Graph Coloring with Predictions

We introduce learning augmented algorithms to the online graph coloring problem. Although the simple greedy algorithm FirstFit is known to perform poorly in the worst case, we are able to establish a relationship between the structure of any input graph $G$ that is revealed online and the number of colors that FirstFit uses for $G$. Based on this relationship, we propose an online coloring algorithm FirstFitPredictions that extends FirstFit while making use of machine learned predictions. We show that FirstFitPredictions is both \emph{consistent} and \emph{smooth}. Moreover, we develop a novel framework for combining online algorithms at runtime specifically for the online graph coloring problem. Finally, we show how this framework can be used to robustify by combining it with any classical online coloring algorithm (that disregards the predictions).

[284] 2312.00610

Experiment on Gender and Racial/Ethnic Bias Against Video Game Streamers: Comparing Perceived Gameplay Skill and Viewer Engagement

Research suggests there is a perception that females and underrepresented racial/ethnic minorities have worse gameplay skills and produce less engaging video game streaming content. This bias might impact streamers' audience size, viewers' financial patronage of a streamer, streamers' sponsorship offers, etc. However, few studies on this topic use experimental methods. To fill this gap, we conducted a between-subjects survey experiment to examine if viewers are biased against video game streamers based on the streamer's gender or race/ethnicity. 200 survey participants rated the gameplay skill and viewer engagement of an identical gameplay recording. The only change between experimental conditions was the streamer's name who purportedly created the recording. The Dunnett's test found no statistically significant differences in viewer engagement ratings when comparing White male streamers to either White female (p = 0.37), Latino male (p = 0.66), or Asian male (p = 0.09) streamers. Similarly, there were no statistically significant differences in gameplay skill ratings when comparing White male streamers to either White female (p = 0.10), Latino male (p = 1.00), or Asian male (p = 0.59) streamers. Potential contributors to statistically non-significant results and counter-intuitive results (i.e., White females received non-significantly higher ratings than White males) are discussed.

[285] 2312.00616

Investigating a domain adaptation approach for integrating different measurement instruments in a longitudinal clinical registry

In a longitudinal clinical registry, different measurement instruments might have been used for assessing individuals at different time points. To combine them, we investigate deep learning techniques for obtaining a joint latent representation, to which the items of different measurement instruments are mapped. This corresponds to domain adaptation, an established concept in computer science for image data. Using the proposed approach as an example, we evaluate the potential of domain adaptation in a longitudinal cohort setting with a rather small number of time points, motivated by an application with different motor function measurement instruments in a registry of spinal muscular atrophy (SMA) patients. There, we model trajectories in the latent representation by ordinary differential equations (ODEs), where person-specific ODE parameters are inferred from baseline characteristics. The goodness of fit and complexity of the ODE solutions then allows to judge the measurement instrument mappings. We subsequently explore how alignment can be improved by incorporating corresponding penalty terms into model fitting. To systematically investigate the effect of differences between measurement instruments, we consider several scenarios based on modified SMA data, including scenarios where a mapping should be feasible in principle and scenarios where no perfect mapping is available. While misalignment increases in more complex scenarios, some structure is still recovered, even if the availability of measurement instruments depends on patient state. A reasonable mapping is feasible also in the more complex real SMA dataset. These results indicate that domain adaptation might be more generally useful in statistical modeling for longitudinal registry data.

[286] 2312.00622

Practical Path-based Bayesian Optimization

There has been a surge in interest in data-driven experimental design with applications to chemical engineering and drug manufacturing. Bayesian optimization (BO) has proven to be adaptable to such cases, since we can model the reactions of interest as expensive black-box functions. Sometimes, the cost of this black-box functions can be separated into two parts: (a) the cost of the experiment itself, and (b) the cost of changing the input parameters. In this short paper, we extend the SnAKe algorithm to deal with both types of costs simultaneously. We further propose extensions to the case of a maximum allowable input change, as well as to the multi-objective setting.

[287] 2312.00626

Forecasting Trends in Food Security: a Reservoir Computing Approach

Early warning systems are an essential tool for effective humanitarian action. Advance warnings on impending disasters facilitate timely and targeted response which help save lives, livelihoods, and scarce financial resources. In this work we present a new quantitative methodology to forecast levels of food consumption for 60 consecutive days, at the sub-national level, in four countries: Mali, Nigeria, Syria, and Yemen. The methodology is built on publicly available data from the World Food Programme's integrated global hunger monitoring system which collects, processes, and displays daily updates on key food security metrics, conflict, weather events, and other drivers of food insecurity across 90 countries ( In this study, we assessed the performance of various models including ARIMA, XGBoost, LSTMs, CNNs, and Reservoir Computing (RC), by comparing their Root Mean Squared Error (RMSE) metrics. This comprehensive analysis spanned classical statistical, machine learning, and deep learning approaches. Our findings highlight Reservoir Computing as a particularly well-suited model in the field of food security given both its notable resistance to over-fitting on limited data samples and its efficient training capabilities. The methodology we introduce establishes the groundwork for a global, data-driven early warning system designed to anticipate and detect food insecurity.

[288] 2312.00627

Rethinking the Domain Gap in Near-infrared Face Recognition

Heterogeneous face recognition (HFR) involves the intricate task of matching face images across the visual domains of visible (VIS) and near-infrared (NIR). While much of the existing literature on HFR identifies the domain gap as a primary challenge and directs efforts towards bridging it at either the input or feature level, our work deviates from this trend. We observe that large neural networks, unlike their smaller counterparts, when pre-trained on large scale homogeneous VIS data, demonstrate exceptional zero-shot performance in HFR, suggesting that the domain gap might be less pronounced than previously believed. By approaching the HFR problem as one of low-data fine-tuning, we introduce a straightforward framework: comprehensive pre-training, succeeded by a regularized fine-tuning strategy, that matches or surpasses the current state-of-the-art on four publicly available benchmarks. Corresponding codes can be found at

[289] 2312.00629

The Ecosystem of Trust (EoT): Enabling effective deployment of autonomous systems through collaborative and trusted ecosystems

Ecosystems are ubiquitous but trust within them is not guaranteed. Trust is paramount because stakeholders within an ecosystem must collaborate to achieve their objectives. With the twin transitions, digital transformation to go in parallel with green transition, accelerating the deployment of autonomous systems, trust has become even more critical to ensure that the deployed technology creates value. To address this need, we propose an ecosystem of trust approach to support deployment of technology by enabling trust among and between stakeholders, technologies and infrastructures, institutions and governance, and the artificial and natural environments in an ecosystem. The approach can help the stakeholders in the ecosystem to create, deliver, and receive value by addressing their concerns and aligning their objectives. We present an autonomous, zero emission ferry as a real world use case to demonstrate the approach from a stakeholder perspective. We argue that assurance, defined as grounds for justified confidence originated from evidence and knowledge, is a prerequisite to enable the approach. Assurance provides evidence and knowledge that are collected, analysed, and communicated in a systematic, targeted, and meaningful way. Assurance can enable the approach to help successfully deploy technology by ensuring that risk is managed, trust is shared, and value is created.

[290] 2312.00630

Numerical computation of the stress concentration between closely located stiff inclusions of general shapes

When two stiff inclusions are closely located, the gradient of the solution may become arbitrarily large as the distance between two inclusions tends to zero. Since blow-up of the gradient occurs in the narrow region, fine meshes should be required to compute the gradient. Thus, it is a challenging problem to numerically compute the gradient. Recent studies have shown that the major singularity can be extracted in an explicit way, so it suffices to compute the residual term for which only regular meshes are required. In this paper, we show through numerical simulations that the characterization of the singular term method can be efficiently used for the computation of the gradient when two strongly convex stiff domains of general shapes are closely located.

[291] 2312.00633

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a prevalent approach in the field of autonomous driving. Despite the demonstrated improvements in accuracy and velocity estimation compared to perspective view methods, the deployment of BEV-based techniques in real-world autonomous vehicles remains challenging. This is primarily due to their reliance on vision-transformer (ViT) based architectures, which introduce quadratic complexity with respect to the input resolution. To address this issue, we propose an efficient BEV-based 3D detection framework called BEVENet, which leverages a convolutional-only architectural design to circumvent the limitations of ViT models while maintaining the effectiveness of BEV-based methods. Our experiments show that BEVENet is 3$\times$ faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, achieving a mean average precision (mAP) of 0.456 and a nuScenes detection score (NDS) of 0.555 on the NuScenes validation dataset, with an inference speed of 47.6 frames per second. To the best of our knowledge, this study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

[292] 2312.00638

What if an SQL Statement Returned a Database?

Every SQL statement is limited to return a single, possibly denormalized, table. This design decision has far reaching consequences. (1.) for databases users in terms of slow query performance, long query result transfer times, usability-issues of SQL in web applications and object-relational mappers. In addition, (2.) for database architects it has consequences when designing query optimizers leading to logical (algebraic) join enumeration effort, memory consumption for intermediate result materialization, and physical operator selection effort. So basically, the entire query optimization stack is shaped by that design decision. In this paper, we argue that the single-table limitation should be dropped. We extend the SELECT-clause of SQL by a keyword 'RESULTDB' to support returning a result database. Our approach has clear semantics, i.e. our extended SQL returns subsets of all tables with only those tuples that would be part of the traditional (single-table) query result set, however without performing any denormalization through joins. Our SQL-extension is downward compatible. Moreover, we discuss the surprisingly long list of benefits of our approach. First, for database users: far simpler and more readable application code, better query performance, smaller query results, better query result transfer times. Second, for database architects, we present how to leverage existing closed source systems as well as change open source database systems to support our feature. We propose a couple of algorithms to integrate our feature into both closed-source as well as open source database systems. We present an initial experimental study with promising results.

[293] 2312.00639

EvE: Exploiting Generative Priors for Radiance Field Enrichment

Modeling large-scale scenes from unconstrained image collections in-the-wild has proven to be a major challenge in computer vision. Existing methods tackling in-the-wild neural rendering operate in a closed-world setting, where knowledge is limited to a scene's captured images within a training set. We propose EvE, which is, to the best of our knowledge, the first method leveraging generative priors to improve in-the-wild scene modeling. We employ pre-trained generative networks to enrich K-Planes representations with extrinsic knowledge. To this end, we define an alternating training procedure to conduct optimization guidance of K-Planes trained on the training set. We carry out extensive experiments and verify the merit of our method on synthetic data as well as real tourism photo collections. EvE enhances rendered scenes with richer details and outperforms the state of the art on the task of novel view synthesis in-the-wild. Our project page can be found at .

[294] 2312.00644

Neural networks for the approximation of Euler's elastica

Euler's elastica is a classical model of flexible slender structures, relevant in many industrial applications. Static equilibrium equations can be derived via a variational principle. The accurate approximation of solutions of this problem can be challenging due to nonlinearity and constraints. We here present two neural network based approaches for the simulation of this Euler's elastica. Starting from a data set of solutions of the discretised static equilibria, we train the neural networks to produce solutions for unseen boundary conditions. We present a $\textit{discrete}$ approach learning discrete solutions from the discrete data. We then consider a $\textit{continuous}$ approach using the same training data set, but learning continuous solutions to the problem. We present numerical evidence that the proposed neural networks can effectively approximate configurations of the planar Euler's elastica for a range of different boundary conditions.

[295] 2312.00645

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

There is a growing need to gain insight into language model capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. However, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. At the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. In this context, we propose hashmarking, a protocol for evaluating language models in the open without having to disclose the correct answers. In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. Following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.

[296] 2312.00647

MaxMem: Colocation and Performance for Big Data Applications on Tiered Main Memory Servers

We present MaxMem, a tiered main memory management system that aims to maximize Big Data application colocation and performance. MaxMem uses an application-agnostic and lightweight memory occupancy control mechanism based on fast memory miss ratios to provide application QoS under increasing colocation. By relying on memory access sampling and binning to quickly identify per-process memory heat gradients, MaxMem maximizes performance for many applications sharing tiered main memory simultaneously. MaxMem is designed as a user-space memory manager to be easily modifiable and extensible, without complex kernel code development. On a system with tiered main memory consisting of DRAM and Intel Optane persistent memory modules, our evaluation confirms that MaxMem provides 11% and 38% better throughput and up to 80% and an order of magnitude lower 99th percentile latency than HeMem and Linux AutoNUMA, respectively, with a Big Data key-value store in dynamic colocation scenarios.

[297] 2312.00648

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at .

[298] 2312.00651

TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.

[299] 2312.00655

Machine Learning for Health symposium 2023 -- Findings track

A collection of the accepted Findings papers that were presented at the 3rd Machine Learning for Health symposium (ML4H 2023), which was held on December 10, 2023, in New Orleans, Louisiana, USA. ML4H 2023 invited high-quality submissions on relevant problems in a variety of health-related disciplines including healthcare, biomedicine, and public health. Two submission tracks were offered: the archival Proceedings track, and the non-archival Findings track. Proceedings were targeted at mature work with strong technical sophistication and a high impact to health. The Findings track looked for new ideas that could spark insightful discussion, serve as valuable resources for the community, or could enable new collaborations. Submissions to the Proceedings track, if not accepted, were automatically considered for the Findings track. All the manuscripts submitted to ML4H Symposium underwent a double-blind peer-review process.

[300] 2312.00656

Simple Transferability Estimation for Regression Tasks

We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.

[301] 2312.00658

A Data-Driven Safety Preserving Control Architecture for Constrained Cyber-Physical Systems

In this paper, we propose a data-driven networked control architecture for unknown and constrained cyber-physical systems capable of detecting networked false-data-injection attacks and ensuring plant's safety. In particular, on the controller's side, we design a novel robust anomaly detector that can discover the presence of network attacks using a data-driven outer approximation of the expected robust one-step reachable set. On the other hand, on the plant's side, we design a data-driven safety verification module, which resorts to worst-case arguments to determine if the received control input is safe for the plant's evolution. Whenever necessary, the same module is in charge of replacing the networked controller with a local data-driven set-theoretic model predictive controller, whose objective is to keep the plant's trajectory in a pre-established safe configuration until an attack-free condition is recovered. Numerical simulations involving a two-tank water system illustrate the features and capabilities of the proposed control architecture.

[302] 2312.00660

Resource-constrained knowledge diffusion processes inspired by human peer learning

We consider a setting where a population of artificial learners is given, and the objective is to optimize aggregate measures of performance, under constraints on training resources. The problem is motivated by the study of peer learning in human educational systems. In this context, we study natural knowledge diffusion processes in networks of interacting artificial learners. By `natural', we mean processes that reflect human peer learning where the students' internal state and learning process is mostly opaque, and the main degree of freedom lies in the formation of peer learning groups by a coordinator who can potentially evaluate the learners before assigning them to peer groups. Among else, we empirically show that such processes indeed make effective use of the training resources, and enable the design of modular neural models that have the capacity to generalize without being prone to overfitting noisy labels.

[303] 2312.00662

Nonparametric Variational Regularisation of Pretrained Transformers

The current paradigm of large-scale pre-training and fine-tuning Transformer large language models has lead to significant improvements across the board in natural language processing. However, such large models are susceptible to overfitting to their training data, and as a result the models perform poorly when the domain changes. Also, due to the model's scale, the cost of fine-tuning the model to the new domain is large. Nonparametric Variational Information Bottleneck (NVIB) has been proposed as a regulariser for training cross-attention in Transformers, potentially addressing the overfitting problem. We extend the NVIB framework to replace all types of attention functions in Transformers, and show that existing pretrained Transformers can be reinterpreted as Nonparametric Variational (NV) models using a proposed identity initialisation. We then show that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism, which improves out-of-domain generalisation without any training. This success supports the hypothesis that pretrained Transformers are implicitly NV Bayesian models.

[304] 2312.00663

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at:

[305] 2312.00664

Model bias identification for Bayesian calibration of stochastic digital twins of bridges

Simulation-based digital twins must provide accurate, robust and reliable digital representations of their physical counterparts. Quantifying the uncertainty in their predictions plays, therefore, a key role in making better-informed decisions that impact the actual system. The update of the simulation model based on data must be then carefully implemented. When applied to complex standing structures such as bridges, discrepancies between the computational model and the real system appear as model bias, which hinders the trustworthiness of the digital twin and increases its uncertainty. Classical Bayesian updating approaches aiming to infer the model parameters often fail at compensating for such model bias, leading to overconfident and unreliable predictions. In this paper, two alternative model bias identification approaches are evaluated in the context of their applicability to digital twins of bridges. A modularized version of Kennedy and O'Hagan's approach and another one based on Orthogonal Gaussian Processes are compared with the classical Bayesian inference framework in a set of representative benchmarks. Additionally, two novel extensions are proposed for such models: the inclusion of noise-aware kernels and the introduction of additional variables not present in the computational model through the bias term. The integration of such approaches in the digital twin corrects the predictions, quantifies their uncertainty, estimates noise from unknown physical sources of error and provides further insight into the system by including additional pre-existing information without modifying the computational model.

[306] 2312.00671

CellMixer: Annotation-free Semantic Cell Segmentation of Heterogeneous Cell Populations

In recent years, several unsupervised cell segmentation methods have been presented, trying to omit the requirement of laborious pixel-level annotations for the training of a cell segmentation model. Most if not all of these methods handle the instance segmentation task by focusing on the detection of different cell instances ignoring their type. While such models prove adequate for certain tasks, like cell counting, other applications require the identification of each cell's type. In this paper, we present CellMixer, an innovative annotation-free approach for the semantic segmentation of heterogeneous cell populations. Our augmentation-based method enables the training of a segmentation model from image-level labels of homogeneous cell populations. Our results show that CellMixer can achieve competitive segmentation performance across multiple cell types and imaging modalities, demonstrating the method's scalability and potential for broader applications in medical imaging, cellular biology, and diagnostics.

[307] 2312.00674

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.

[308] 2312.00676

Minimal rank factorizations of polynomial matrices

We investigate rank revealing factorizations of rank deficient $m \times n$ polynomial matrices $P(\lambda)$ into products of three, $P(\lambda) = L(\lambda) E(\lambda) R(\lambda)$, or two, $P(\lambda) = L(\lambda) R(\lambda)$, polynomial matrices. Among all possible factorizations of these types, we focus on those for which $L(\lambda)$ and/or $R(\lambda)$ is a minimal basis, since they allow us to relate easily the degree of $P(\lambda)$ with some degree properties of the factors. We call these factorizations minimal rank factorizations. Motivated by the well-known fact that, generically, rank deficient polynomial matrices over the complex field do not have eigenvalues, we pay particular attention to the properties of the minimal rank factorizations of polynomial matrices without eigenvalues. We carefully analyze the degree properties of generic minimal rank factorizations in the set of complex $m \times n$ polynomial matrices with normal rank at most $r$ and degree at most $d$, and we prove that they are of the form $L(\lambda) R(\lambda)$, where the degrees of the $r$ columns of $L(\lambda)$ differ at most by one, the degrees of the $r$ rows of $R(\lambda)$ differ at most by one, and, for each $i=1, \ldots, r$, the sum of the degrees of the $i$th column of $L(\lambda)$ and of the $i$th row of $R(\lambda)$ is equal to $d$. Finally, we show how these sets of polynomial matrices with generic factorizations are related to the sets of polynomial matrices with generic eigenstructures.

[309] 2312.00678

The Efficiency Spectrum of Large Language Models: An Algorithmic Survey

The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains, reshaping the artificial general intelligence landscape. However, the increasing computational and memory demands of these models present substantial challenges, hindering both academic research and practical applications. To address these issues, a wide array of methods, including both algorithmic and hardware solutions, have been developed to enhance the efficiency of LLMs. This survey delivers a comprehensive review of algorithmic advancements aimed at improving LLM efficiency. Unlike other surveys that typically focus on specific areas such as training or model compression, this paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs. Specifically, it covers various topics related to efficiency, including scaling laws, data utilization, architectural innovations, training and tuning strategies, and inference techniques. This paper aims to serve as a valuable resource for researchers and practitioners, laying the groundwork for future innovations in this critical research area. Our repository of relevant references is maintained at url{}.

[310] 2312.00680

Contextualized word senses: from attention to compositionality

The neural architectures of language models are becoming increasingly complex, especially that of Transformers, based on the attention mechanism. Although their application to numerous natural language processing tasks has proven to be very fruitful, they continue to be models with little or no interpretability and explainability. One of the tasks for which they are best suited is the encoding of the contextual sense of words using contextualized embeddings. In this paper we propose a transparent, interpretable, and linguistically motivated strategy for encoding the contextual sense of words by modeling semantic compositionality. Particular attention is given to dependency relations and semantic notions such as selection preferences and paradigmatic classes. A partial implementation of the proposed model is carried out and compared with Transformer-based architectures for a given semantic task, namely the similarity calculation of word senses in context. The results obtained show that it is possible to be competitive with linguistically motivated models instead of using the black boxes underlying complex neural architectures.

[311] 2312.00681

Applicability of Blockchain Technology in Avionics Systems

Blockchain technology, within its fast widespread and superiority demonstrated by recent studies, can be also used as an informatic tool for solving various aviation problems. Aviation electronics (avionics) systems stand out as the application area of informatics methods in solving aviation problems or providing different capabilities to aircrafts. Avionics systems are electronic systems used in air and space vehicles for many purposes such as surveillance, navigation and communication. In this study, the applicability of blockchain technology as a new approach in the development of avionics systems is discussed, and in this regard, a method inspired by the previously implemented applications in electronic flight systems is proposed to help evaluate the applicability of this technology in new avionics system designs. The potential of blockchain for solving the problems especially in basic services, communication, navigation and flight management systems; the problem structures for which application of this technology would be a reliable solution; and the superiority and inferiority of its use in avionic systems are explained. A guiding paper is proposed for aviation engineers/experts to make a decision on applying blockchain into avionics systems.

[312] 2312.00685

How to Tune Autofocals: A Comparative Study of Advanced Tuning Methods

This study comprehensively evaluates tuning methods for autofocal glasses using virtual reality (VR), addressing the challenge of presbyopia. With aging, presbyopia diminishes the eye's ability to focus on nearby objects, impacting the quality of life for billions. Autofocals, employing focus-tunable lenses, dynamically adjust optical power for each fixation, promising a more natural visual experience than traditional bifocal or multifocal lenses. Our research contrasts the most common tuning methods - manual, gaze-based, and vergence - within a VR setup to mimic real-world scenarios. Utilizing the XTAL VR headset equipped with eye-tracking, the study replicated autofocal scenarios, measuring performance and usability through psychophysical tasks and NASA TLX surveys. Results show varying strengths and weaknesses across methods, with gaze control excelling in precision but not necessarily comfort and manual control providing stability and predictability. The findings guide the selection of tuning methods based on task requirements and user preferences, highlighting a balance between precision and ease of use.

[313] 2312.00686

Classification of cyber attacks on IoT and ubiquitous computing devices

As the Internet of Things (IoT) has become truly ubiquitous, so has the surrounding threat landscape. However, while the security of classical computing systems has significantly matured in the last decades, IoT cybersecurity is still typically low or fully neglected. This paper provides a classification of IoT malware. Major targets and used exploits for attacks are identified and referred to the specific malware. The lack of standard definitions of IoT devices and, therefore, security goals has been identified during this research as a profound barrier in advancing IoT cybersecurity. Furthermore, standardized reporting of IoT malware by trustworthy sources is required in the field. The majority of current IoT attacks continue to be of comparably low effort and level of sophistication and could be mitigated by existing technical measures.

[314] 2312.00688

Towards Transparency in Coreference Resolution: A Quantum-Inspired Approach

Guided by grammatical structure, words compose to form sentences, and guided by discourse structure, sentences compose to form dialogues and documents. The compositional aspect of sentence and discourse units is often overlooked by machine learning algorithms. A recent initiative called Quantum Natural Language Processing (QNLP) learns word meanings as points in a Hilbert space and acts on them via a translation of grammatical structure into Parametrised Quantum Circuits (PQCs). Previous work extended the QNLP translation to discourse structure using points in a closure of Hilbert spaces. In this paper, we evaluate this translation on a Winograd-style pronoun resolution task. We train a Variational Quantum Classifier (VQC) for binary classification and implement an end-to-end pronoun resolution system. The simulations executed on IBMQ software converged with an F1 score of 87.20%. The model outperformed two out of three classical coreference resolution systems and neared state-of-the-art SpanBERT. A mixed quantum-classical model yet improved these results with an F1 score increase of around 6%.

[315] 2312.00690

Open-vocabulary object 6D pose estimation

We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g. CAD or video sequence) is required at inference, (iii) the object is imaged from two different viewpoints of two different scenes, and (iv) the object was not observed during the training phase. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from two distinct scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 39 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Project website:

[316] 2312.00692

VisionaryVR: An Optical Simulation Tool for Evaluating and Optimizing Vision Correction Solutions in Virtual Reality

Developing and evaluating vision science methods require robust and efficient tools for assessing their performance in various real-world scenarios. This study presents a novel virtual reality (VR) simulation tool that simulates real-world optical methods while giving high experimental control to the experiment. The tool incorporates an experiment controller, to smoothly and easily handle multiple conditions, a generic eye-tracking controller, that works with most common VR eye-trackers, a configurable defocus simulator, and a generic VR questionnaire loader to assess participants' behavior in virtual reality. This VR-based simulation tool bridges the gap between theoretical and applied research on new optical methods, corrections, and therapies. It enables vision scientists to increase their research tools with a robust, realistic, and fast research environment.

[317] 2312.00694

Object Detector Differences when using Synthetic and Real Training Data

To train well-performing generalizing neural networks, sufficiently large and diverse datasets are needed. Collecting data while adhering to privacy legislation becomes increasingly difficult and annotating these large datasets is both a resource-heavy and time-consuming task. An approach to overcome these difficulties is to use synthetic data since it is inherently scalable and can be automatically annotated. However, how training on synthetic data affects the layers of a neural network is still unclear. In this paper, we train the YOLOv3 object detector on real and synthetic images from city environments. We perform a similarity analysis using Centered Kernel Alignment (CKA) to explore the effects of training on synthetic data on a layer-wise basis. The analysis captures the architecture of the detector while showing both different and similar patterns between different models. With this similarity analysis we want to give insights on how training synthetic data affects each layer and to give a better understanding of the inner workings of complex neural networks. The results show that the largest similarity between a detector trained on real data and a detector trained on synthetic data was in the early layers, and the largest difference was in the head part. The results also show that no major difference in performance or similarity could be seen between frozen and unfrozen backbone.

[318] 2312.00699

Rethinking Detection Based Table Structure Recognition for Visually Rich Documents

Table Structure Recognition (TSR) aims at transforming unstructured table images into structured formats, such as HTML sequences. One type of popular solution is using detection models to detect components of a table, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based studies often have the following limitations. First, these studies usually pay more attention to improving the detection performance, which does not necessarily lead to better performance regarding cell-level metrics, such as TEDS. Second, some solutions over-simplify the problem and can miss some critical information. Lastly, even though some studies defined the problem to detect more components to provide as much information as other types of solutions, these studies ignore the fact this problem definition is a multi-label detection because row, projected row header and column header can share identical bounding boxes. Besides, there is often a performance gap between two-stage and transformer-based detection models regarding the structure-only TEDS, even though they have similar performance regarding the COCO metrics. Therefore, we revisit the limitations of existing detection-based solutions, compare two-stage and transformer-based detection models, and identify the key design aspects for the success of a two-stage detection model for the TSR task, including the multi-class problem definition, the aspect ratio for anchor box generation, and the feature generation of the backbone network. We applied simple methods to improve these aspects of the Cascade R-CNN model, achieved state-of-the-art performance, and improved the baseline Cascade R-CNN model by 19.32%, 11.56% and 14.77% regarding the structure-only TEDS on SciTSR, FinTabNet, and PubTables1M datasets.

[319] 2312.00700

GIFT: Generative Interpretable Fine-Tuning Transformers

We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art. Our code is available at

[320] 2312.00702

A Holistic Approach for Trustworthy Distributed Systems with WebAssembly and TEEs

Publish/subscribe systems play a key role in enabling communication between numerous devices in distributed and large-scale architectures. While widely adopted, securing such systems often trades portability for additional integrity and attestation guarantees. Trusted Execution Environments (TEEs) offer a potential solution with enclaves to enhance security and trust. However, application development for TEEs is complex, and many existing solutions are tied to specific TEE architectures, limiting adaptability. Current communication protocols also inadequately manage attestation proofs or expose essential attestation information. This paper introduces a novel approach using WebAssembly to address these issues, a key enabling technology nowadays capturing academia and industry attention. We present the design of a portable and fully attested publish/subscribe middleware system as a holistic approach for trustworthy and distributed communication between various systems. Based on this proposal, we have implemented and evaluated in-depth a fully-fledged publish/subscribe broker running within Intel SGX, compiled in WebAssembly, and built on top of industry-battled frameworks and standards, i.e., MQTT and TLS protocols. Our extended TLS protocol preserves the privacy of attestation information, among other benefits. Our experimental results showcase most overheads, revealing a 1.55x decrease in message throughput when using a trusted broker. We open-source the contributions of this work to the research community to facilitate experimental reproducibility.

[321] 2312.00703

PointBeV: A Sparse Approach to BeV Predictions

Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at

[322] 2312.00707

On the Periodicity of Singular Vectors and the Holomorphic Block-Circulant SVD on the Unit Circumference

We investigate the singular value decomposition of a rectangular matrix that is analytic on the complex unit circumference, which occurs, e.g., with the matrix of transfer functions representing a broadband multiple-input multiple-output channel. Our analysis is based on the Puiseux series expansion of the eigenvalue decomposition of analytic para-Hermitian matrices on the complex unit circumference. We study the case in which the rectangular matrix does not admit a full analytic singular value factorization, either due to partly multiplexed systems or to sign ambiguity. We show how to find an SVD factorization in the ring of Puiseux series where each singular value and the associated singular vectors present the same period and multiplexing structure, and we prove that it is always possible to find an analytic pseudo-circulant factorization, meaning that any arbitrary arrangements of multiplexed systems can be converted into a parallel form. In particular, one can show that the sign ambiguity can be overcome by allowing non-real holomorphic singular values.

[323] 2312.00708

Message-Passing on Hypergraphs: Detectability, Phase Transitions and Higher-Order Information

Hypergraphs are widely adopted tools to examine systems with higher-order interactions. Despite recent advancements in methods for community detection in these systems, we still lack a theoretical analysis of their detectability limits. Here, we derive closed-form bounds for community detection in hypergraphs. Using a Message-Passing formulation, we demonstrate that detectability depends on hypergraphs' structural properties, such as the distribution of hyperedge sizes or their assortativity. Our formulation enables a characterization of the entropy of a hypergraph in relation to that of its clique expansion, showing that community detection is enhanced when hyperedges highly overlap on pairs of nodes. We develop an efficient Message-Passing algorithm to learn communities and model parameters on large systems. Additionally, we devise an exact sampling routine to generate synthetic data from our probabilistic model. With these methods, we numerically investigate the boundaries of community detection in synthetic datasets, and extract communities from real systems. Our results extend the understanding of the limits of community detection in hypergraphs and introduce flexible mathematical tools to study systems with higher-order interactions.

[324] 2312.00710

SpaCE: The Spatial Confounding Environment

Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.

[325] 2312.00713

Nonlinear-manifold reduced order models with domain decomposition

A nonlinear-manifold reduced order model (NM-ROM) is a great way of incorporating underlying physics principles into a neural network-based data-driven approach. We combine NM-ROMs with domain decomposition (DD) for efficient computation. NM-ROMs offer benefits over linear-subspace ROMs (LS-ROMs) but can be costly to train due to parameter scaling with the full-order model (FOM) size. To address this, we employ DD on the FOM, compute subdomain NM-ROMs, and then merge them into a global NM-ROM. This approach has multiple advantages: parallel training of subdomain NM-ROMs, fewer parameters than global NM-ROMs, and adaptability to subdomain-specific FOM features. Each subdomain NM-ROM uses a shallow, sparse autoencoder, enabling hyper-reduction (HR) for improved computational speed. In this paper, we detail an algebraic DD formulation for the FOM, train HR-equipped NM-ROMs for subdomains, and numerically compare them to DD LS-ROMs with HR. Results show a significant accuracy boost, on the order of magnitude, for the proposed DD NM-ROMs over DD LS-ROMs in solving the 2D steady-state Burgers' equation.

[326] 2312.00714

Zipr: A High-Impact, Robust, Open-source, Multi-platform, Static Binary Rewriter

Zipr is a tool for static binary rewriting, first published in 2016. Zipr was engineered to support arbitrary program modification with an emphasis on low overhead, robustness, and flexibility to perform security enhancements and instrumentation. Originally targeted to Linux x86-32 binaries, Zipr now supports 32- and 64-bit binaries for X86, ARM, and MIPS architectures, as well as preliminary support for Windows programs. These features have helped Zipr make a dramatic impact on research. It was first used in the DARPA Cyber Grand Challenge to take second place overall, with the best security score of any participant, Zipr has now been used in a variety of research areas by both the original authors as well as third parties. Zipr has also led to publications in artificial diversity, program instrumentation, program repair, fuzzing, autonomous vehicle security, research computing security, as well as directly contributing to two student dissertations. The open-source repository has accepted accepted patches from several external authors, demonstrating the impact of Zipr beyond the original authors.

[327] 2312.00718

Removing Biases from Molecular Representations via Information Maximization

High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at

[328] 2312.00720

Efficiently Processing Large Relational Joins on GPUs

With the growing interest in Machine Learning (ML), Graphic Processing Units (GPUs) have become key elements of any computing infrastructure. Their widespread deployment in data centers and the cloud raises the question of how to use them beyond ML use cases, with growing interest in employing them in a database context. In this paper, we explore and analyze the implementation of relational joins on GPUs from an end-to-end perspective, meaning that we take result materialization into account. We conduct a comprehensive performance study of state-of-the-art GPU-based join algorithms over diverse synthetic workloads and TPC-H/TPC-DS benchmarks. Without being restricted to the conventional setting where each input relation has only one key and one non-key with all attributes being 4-bytes long, we investigate the effect of various factors (e.g., input sizes, number of non-key columns, skewness, data types, match ratios, and number of joins) on the end-to-end throughput. Furthermore, we propose a technique called "Gather-from-Transformed-Relations" (GFTR) to reduce the long-ignored yet high materialization cost in GPU-based joins. The experimental evaluation shows significant performance improvements from GFTR, with throughput gains of up to 2.3 times over previous work. The insights gained from the performance study not only advance the understanding of GPU-based joins but also introduce a structured approach to selecting the most efficient GPU join algorithm based on the input relation characteristics.

[329] 2312.00724

Approximation Bounds for Model Reduction on Polynomially Mapped Manifolds

For projection-based linear-subspace model order reduction (MOR), it is well known that the Kolmogorov n-width describes the best-possible error for a reduced order model (ROM) of size n. In this paper, we provide approximation bounds for ROMs on polynomially mapped manifolds. In particular, we show that the approximation bounds depend on the polynomial degree p of the mapping function as well as on the linear Kolmogorov n-width for the underlying problem. This results in a Kolmogorov (n, p)-width, which describes a lower bound for the best-possible error for a ROM on polynomially mapped manifolds of polynomial degree p and reduced size n.

[330] 2312.00727

Safe Reinforcement Learning in Tensor Reproducing Kernel Hilbert Space

This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states from observations in a continuous state space poses a significant challenge, largely due to the intractable likelihood. To tackle this issue, we propose a stochastic model-based approach that guarantees RL safety almost surely in the face of unknown system dynamics and partial observation environments. We leveraged the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically, and the results in this context are provable. Furthermore, we derived essential operators from the kernel Bayes' rule, enabling the recursive estimation of future observations using various operators. Under the assumption of \textit{undercompleness}, a polynomial sample complexity is established for the RL algorithm for the infinite size of observation and action spaces, ensuring an $\epsilon-$suboptimal safe policy guarantee.

[331] 2312.00732

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

The recent Gaussian Splatting achieves high-quality and real-time novel-view synthesis of the 3D scenes. However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. We augment each Gaussian with a compact Identity Encoding, allowing the Gaussians to be grouped according to their object instance or stuff membership in the 3D scene. Instead of resorting to expensive 3D labels, we supervise the Identity Encodings during the differentiable rendering by leveraging the 2D mask predictions by SAM, along with introduced 3D spatial consistency regularization. Comparing to the implicit NeRF representation, we show that the discrete and grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency. Based on Gaussian Grouping, we further propose a local Gaussian Editing scheme, which shows efficacy in versatile scene editing applications, including 3D object removal, inpainting, colorization and scene recomposition. Our code and models will be at

[332] 2312.00737

How do correlations shape the landscape of information?

We explore a few common models on how correlations affect information. The main model considered is the Shannon mutual information $I(S:R_1,\cdots, R_i)$ over distributions with marginals $P_{S,R_i}$ fixed for each $i$, with the analogy in which $S$ is the stimulus and $R_i$'s are neurons. We work out basic models in details, using algebro-geometric tools to write down discriminants that separate distributions with distinct qualitative behaviours in the probability simplex into toric chambers and evaluate the volumes of them algebraically. As a byproduct, we provide direct translation between a decomposition of mutual information inspired by a series expansion and one from partial information decomposition (PID) problems, characterising the synergistic terms of the former. We hope this paper serves for communication between communities especially mathematics and theoretical neuroscience on the topic. KEYWORDS: information theory, algebraic statistics, mathematical neuroscience, partial information decomposition

[333] 2312.00738

SeaLLMs -- Large Language Models for Southeast Asia

Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.

[334] 2312.00739

Adversarial Score Distillation: When score distillation meets GAN

Existing score distillation methods are sensitive to classifier-free guidance (CFG) scale: manifested as over-smoothness or instability at small CFG scales, while over-saturation at large ones. To explain and analyze these issues, we revisit the derivation of Score Distillation Sampling (SDS) and decipher existing score distillation with the Wasserstein Generative Adversarial Network (WGAN) paradigm. With the WGAN paradigm, we find that existing score distillation either employs a fixed sub-optimal discriminator or conducts incomplete discriminator optimization, resulting in the scale-sensitive issue. We propose the Adversarial Score Distillation (ASD), which maintains an optimizable discriminator and updates it using the complete optimization objective. Experiments show that the proposed ASD performs favorably in 2D distillation and text-to-3D tasks against existing methods. Furthermore, to explore the generalization ability of our WGAN paradigm, we extend ASD to the image editing task, which achieves competitive results. The project page and code are at

[335] 2312.00740

Computing Networks Enabled Semantic Communications

Semantic communication has shown great potential in boosting the effectiveness and reliability of communications. However, its systems to date are mostly enabled by deep learning, which requires demanding computing resources. This article proposes a framework for the computing networks enabled semantic communication system, aiming to offer sufficient computing resources for semantic processing and transmission. Key techniques including semantic sampling and reconstruction, semantic-channel coding, semantic-aware resource allocation and optimization are introduced based on the cloud-edge-end computing coordination. Two use cases are demonstrated to show advantages of the proposed framework. The article concludes with several future research directions.

[336] 2312.00741

Crystal: Enhancing Blockchain Mining Transparency with Quorum Certificate

Researchers have discovered a series of theoretical attacks against Bitcoin's Nakamoto consensus; the most damaging ones are selfish mining, double-spending, and consistency delay attacks. These attacks have one common cause: block withholding. This paper proposes Crystal, which leverages quorum certificates to resist block withholding misbehavior. Crystal continuously elects committees from miners and requires each block to have a quorum certificate, i.e., a set of signatures issued by members of its committee. Consequently, an attacker has to publish its blocks to obtain quorum certificates, rendering block withholding impossible. To build Crystal, we design a novel two-round committee election in a Sybil-resistant, unpredictable and non-interactive way, and a reward mechanism to incentivize miners to follow the protocol. Our analysis and evaluations show that Crystal can significantly mitigate selfish mining and double-spending attacks. For example, in Bitcoin, an attacker with 30% of the total computation power will succeed in double-spending attacks with a probability of 15.6% to break the 6-confirmation rule; however, in Crystal, the success probability for the same attacker falls to 0.62%. We provide formal end-to-end safety proofs for Crystal, ensuring no unknown attacks will be introduced. To the best of our knowledge, Crystal is the first protocol that prevents selfish mining and double-spending attacks while providing safety proof.

[337] 2312.00746

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

In this study, we explore the application of Large Language Models (LLMs) in "Jubensha" (Chinese murder mystery role-playing games), a novel area in AI-driven gaming. We introduce the first Chinese dataset specifically for Jubensha, including character scripts and game rules, to foster AI agent development in this complex narrative environment. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in the game, enhancing the dynamics of Jubensha gameplay. To evaluate these AI agents, we developed specialized methods targeting their mastery of case information and reasoning skills. Furthermore, we incorporated the latest advancements in in-context learning to improve the agents' performance in critical aspects like information gathering, murderer detection, and logical reasoning. The experimental results validate the effectiveness of our proposed methods. This work aims to offer a fresh perspective on understanding LLM capabilities and establish a new benchmark for evaluating large language model-based agents to researchers in the field.

[338] 2312.00747

Reduction from sparse LPN to LPN, Dual Attack 3.0

The security of code-based cryptography relies primarily on the hardness of decoding generic linear codes. Until very recently, all the best algorithms for solving the decoding problem were information set decoders (ISD). However, recently a new algorithm called RLPN-decoding which relies on a completely different approach was introduced and it has been shown that RLPN outperforms significantly ISD decoders for a rather large range of rates. This RLPN decoder relies on two ingredients, first reducing decoding to some underlying LPN problem, and then computing efficiently many parity-checks of small weight when restricted to some positions. We revisit RLPN-decoding by noticing that, in this algorithm, decoding is in fact reduced to a sparse-LPN problem, namely with a secret whose Hamming weight is small. Our new approach consists this time in making an additional reduction from sparse-LPN to plain-LPN with a coding approach inspired by coded-BKW. It outperforms significantly the ISD's and RLPN for code rates smaller than 0.42. This algorithm can be viewed as the code-based cryptography cousin of recent dual attacks in lattice-based cryptography. We depart completely from the traditional analysis of this kind of algorithm which uses a certain number of independence assumptions that have been strongly questioned recently in the latter domain. We give instead a formula for the LPNs noise relying on duality which allows to analyze the behavior of the algorithm by relying only on the analysis of a certain weight distribution. By using only a minimal assumption whose validity has been verified experimentally we are able to justify the correctness of our algorithm. This key tool, namely the duality formula, can be readily adapted to the lattice setting and is shown to give a simple explanation for some phenomena observed on dual attacks in lattices in [DP23].

[339] 2312.00751

Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

[340] 2312.00752

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

[341] 2312.00760

${L}^{\infty}$-norm computation for linear time-invariant systems depending on parameter

This paper focuses on representing the $L^{\infty}$-norm of finite-dimensional linear time-invariant systems with parameter-dependent coefficients. Previous studies tackled the problem in a non-parametric scenario by simplifying it to finding the maximum $y$-projection of real solutions $(x, y)$ of a system of the form $\Sigma=\{P=0, \, \partial P/\partial x=0\}$, where $P \in \Z[x, y]$. To solve this problem, standard computer algebra methods were employed and analyzed \cite{bouzidi2021computation}. In this paper, we extend our approach to address the parametric case. We aim to represent the "maximal" $y$-projection of real solutions of $\Sigma$ as a function of the given parameters. %a set of parameters $\alpha$. To accomplish this, we utilize cylindrical algebraic decomposition. This method allows us to determine the desired value as a function of the parameters within specific regions of parameter space.

[342] 2312.00761

Deep Unlearning: Fast and Efficient Training-free Approach to Controlled Forgetting

Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.

[343] 2312.00763

Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses

Large language model (LLM) powered chatbots are primarily text-based today, and impose a large interactional cognitive load, especially for exploratory or sensemaking tasks such as planning a trip or learning about a new city. Because the interaction is textual, users have little scaffolding in the way of structure, informational "scent", or ability to specify high-level preferences or goals. We introduce ExploreLLM that allows users to structure thoughts, help explore different options, navigate through the choices and recommendations, and to more easily steer models to generate more personalized responses. We conduct a user study and show that users find it helpful to use ExploreLLM for exploratory or planning tasks, because it provides a useful schema-like structure to the task, and guides users in planning. The study also suggests that users can more easily personalize responses with high-level preferences with ExploreLLM. Together, ExploreLLM points to a future where users interact with LLMs beyond the form of chatbots, and instead designed to support complex user tasks with a tighter integration between natural language and graphical user interfaces.

[344] 2312.00765

Explaining Knock-on Effects of Bias Mitigation

In machine learning systems, bias mitigation approaches aim to make outcomes fairer across privileged and unprivileged groups. Bias mitigation methods work in different ways and have known "waterfall" effects, e.g., mitigating bias at one place may manifest bias elsewhere. In this paper, we aim to characterise impacted cohorts when mitigation interventions are applied. To do so, we treat intervention effects as a classification task and learn an explainable meta-classifier to identify cohorts that have altered outcomes. We examine a range of bias mitigation strategies that work at various stages of the model life cycle. We empirically demonstrate that our meta-classifier is able to uncover impacted cohorts. Further, we show that all tested mitigation strategies negatively impact a non-trivial fraction of cases, i.e., people who receive unfavourable outcomes solely on account of mitigation efforts. This is despite improvement in fairness metrics. We use these results as a basis to argue for more careful audits of static mitigation interventions that go beyond aggregate metrics.

[345] 2312.00766

Automated Material Properties Extraction For Enhanced Beauty Product Discovery and Makeup Virtual Try-on

The multitude of makeup products available can make it challenging to find the ideal match for desired attributes. An intelligent approach for product discovery is required to enhance the makeup shopping experience to make it more convenient and satisfying. However, enabling accurate and efficient product discovery requires extracting detailed attributes like color and finish type. Our work introduces an automated pipeline that utilizes multiple customized machine learning models to extract essential material attributes from makeup product images. Our pipeline is versatile and capable of handling various makeup products. To showcase the efficacy of our pipeline, we conduct extensive experiments on eyeshadow products (both single and multi-shade ones), a challenging makeup product known for its diverse range of shapes, colors, and finish types. Furthermore, we demonstrate the applicability of our approach by successfully extending it to other makeup categories like lipstick and foundation, showcasing its adaptability and effectiveness across different beauty products. Additionally, we conduct ablation experiments to demonstrate the superiority of our machine learning pipeline over human labeling methods in terms of reliability. Our proposed method showcases its effectiveness in cross-category product discovery, specifically in recommending makeup products that perfectly match a specified outfit. Lastly, we also demonstrate the application of these material attributes in enabling virtual-try-on experiences which makes makeup shopping experience significantly more engaging.

[346] 2312.00774

Context Retrieval via Normalized Contextual Latent Interaction for Conversational Agent

Conversational agents leveraging AI, particularly deep learning, are emerging in both academic research and real-world applications. However, these applications still face challenges, including disrespecting knowledge and facts, not personalizing to user preferences, and enormous demand for computational resources during training and inference. Recent research efforts have been focused on addressing these challenges from various aspects, including supplementing various types of auxiliary information to the conversational agents. However, existing methods are still not able to effectively and efficiently exploit relevant information from these auxiliary supplements to further unleash the power of the conversational agents and the language models they use. In this paper, we present a novel method, PK-NCLI, that is able to accurately and efficiently identify relevant auxiliary information to improve the quality of conversational responses by learning the relevance among persona, chat history, and knowledge background through low-level normalized contextual latent interaction. Our experimental results indicate that PK-NCLI outperforms the state-of-the-art method, PK-FoCus, by 47.80%/30.61%/24.14% in terms of perplexity, knowledge grounding, and training efficiency, respectively, and maintained the same level of persona grounding performance. We also provide a detailed analysis of how different factors, including language model choices and trade-offs on training weights, would affect the performance of PK-NCLI.

[347] 2312.00775

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation.

[348] 2312.00777

VideoBooth: Diffusion-based Video Generation with Image Prompts

Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.

[349] 2312.00778

MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video

Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360{\deg} surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360{\deg} surface reconstruction of a deformable object from a monocular RGB-D video.

[350] 2312.00781

Modeling False Data Injection Attacks on Integrated Electricity-Gas Systems

This work studies the modeling of false data injection attacks (FDIAs) on IEGSs. First, we introduce a static state estimation model and bad data detection method for IEGSs. Then, we develop FDIAs on IEGSs with complete network topology and parameter information. Next, we develop FDIAs on IEGSs when intruders have only local network topology and parameter information of an IEGS. Lastly, we explore FDIAs on IEGSs when intruders have only local network topology information of an IEGS.

[351] 2312.00784

Making Large Multimodal Models Understand Arbitrary Visual Prompts

While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

[352] 2312.00785

Sequential Modeling Enables Scalable Learning for Large Vision Models

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

[353] 2312.00786

Dense Optical Tracking: Connecting the Dots

Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: .

[354] 1910.10512

A Stochastic Block Model Approach for the Analysis of Multilevel Networks: an Application to the Sociology of Organizations

A multilevel network is defined as the junction of two interaction networks, one level representing the interactions between individuals and the other the interactions between organizations. The levels are linked by an affiliation relationship, each individual belonging to a unique organization. A new Stochastic Block Model is proposed as a unified probalistic framework tailored for multilevel networks. This model contains latent blocks accounting for heterogeneity in the patterns of connection within each level and introducing dependencies between the levels. The sought connection patterns are not specified a priori which makes this approach flexible. Variational methods are used for the model inference and an Integrated Classified Likelihood criterion is developed for choosing the number of blocks and also for deciding whether the two levels are dependent or not. A comprehensive simulation study exhibits the benefit of considering this approach, illustrates the robustness of the clustering and highlights the reliability of the criterion used for model selection. This approach is applied on a sociological dataset collected during a television program trade fair, the inter-organizational level being the economic network between companies and the inter-individual level being the informal network between their representatives. It brings a synthetic representation of the two networks unraveling their intertwined structure and confirms the coopetition at stake.

[355] 2206.00560

Learning common structures in a collection of networks. An application to food webs

Let a collection of networks represent interactions within several (social or ecological) systems. We pursue two objectives: identifying similarities in the topological structures that are held in common between the networks and clustering the collection into sub-collections of structurally homogeneous networks. We tackle these two questions with a probabilistic model based approach. We propose an extension of the Stochastic Block Model (SBM) adapted to the joint modeling of a collection of networks. The networks in the collection are assumed to be independent realizations of SBMs. The common connectivity structure is imposed through the equality of some parameters. The model parameters are estimated with a variational Expectation-Maximization (EM) algorithm. We derive an ad-hoc penalized likelihood criterion to select the number of blocks and to assess the adequacy of the consensus found between the structures of the different networks. This same criterion can also be used to cluster networks on the basis of their connectivity structure. It thus provides a partition of the collection into subsets of structurally homogeneous networks. The relevance of our proposition is assessed on two collections of ecological networks. First, an application to three stream food webs reveals the homogeneity of their structures and the correspondence between groups of species in different ecosystems playing equivalent ecological roles. Moreover, the joint analysis allows a finer analysis of the structure of smaller networks. Second, we cluster 67 food webs according to their connectivity structures and demonstrate that five mesoscale structures are sufficient to describe this collection.

[356] 2312.00001

On random pairwise comparisons matrices and their geometry

We describe a framework for random pairwise comparisons matrices, inspired by selected constructions releted to the so called inconsistency reduction of pairwise comparisons (PC) matrices. In to build up structures on random pairwise comparisons matrices, the set up for (deterministic) PC matrices for non-reciprocal PC matrices is completed. The extension of basic concepts such as inconsistency indexes and geometric mean method are extended to random pairwise comparisons matrices and completed by new notions which seem useful to us. Two procedures for (random) inconsistency reduction are sketched, based on well-known existing objects, and a fiber bundle-like decomposition of random pairwise comparisons is proposed.

[357] 2312.00015

On the $\ell_0$ Isoperimetric Coefficient of Measurable Sets

In this paper we prove that the $\ell_0$ isoperimetric coefficient for any axis-aligned cubes, $\psi_{\mathcal{C}}$, is $\Theta(n^{-1/2})$ and that the isoperimetric coefficient for any measurable body $K$, $\psi_K$, is of order $O(n^{-1/2})$. As a corollary we deduce that axis-aligned cubes essentially "maximize" the $\ell_0$ isoperimetric coefficient: There exists a positive constant $q > 0$ such that $\psi_K \leq q \cdot \psi_{\mathcal{C}}$, whenever $\mathcal{C}$ is an axis-aligned cube and $K$ is any measurable set. Lastly, we give immediate applications of our results to the mixing time of Coordinate-Hit-and-Run for sampling points uniformly from convex bodies.

[358] 2312.00037

Photonic Neural Networks and Optics-informed Deep Learning Fundamentals

The recent explosive compute growth, mainly fueled by the boost of AI and DNNs, is currently instigating the demand for a novel computing paradigm that can overcome the insurmountable barriers imposed by conventional electronic computing architectures. PNNs implemented on silicon integration platforms stand out as a promising candidate to endow NN hardware, offering the potential for energy efficient and ultra-fast computations through the utilization of the unique primitives of photonics i.e. energy efficiency, THz bandwidth and low-latency. Thus far, several demonstrations have revealed the huge potential of PNNs in performing both linear and non-linear NN operations at unparalleled speed and energy consumption metrics. Transforming this potential into a tangible reality for DL applications requires, however, a deep understanding of the basic PNN principles, requirements and challenges across all constituent architectural, technological and training aspects. In this tutorial, we, initially, review the principles of DNNs along with their fundamental building blocks, analyzing also the key mathematical operations needed for their computation in a photonic hardware. Then, we investigate, through an intuitive mathematical analysis, the interdependence of bit precision and energy efficiency in analog photonic circuitry, discussing the opportunities and challenges of PNNs. Followingly, a performance overview of PNN architectures, weight technologies and activation functions is presented, summarizing their impact in speed, scalability and power consumption. Finally, we provide an holistic overview of the optics-informed NN training framework that incorporates the physical properties of photonic building blocks into the training process in order to improve the NN classification accuracy and effectively elevate neuromorphic photonic hardware into high-performance DL computational settings.

[359] 2312.00038

A Physics-Constrained NeuralODE Approach for Robust Learning of Stiff Chemical Kinetics

The high computational cost associated with solving for detailed chemistry poses a significant challenge for predictive computational fluid dynamics (CFD) simulations of turbulent reacting flows. These models often require solving a system of coupled stiff ordinary differential equations (ODEs). While deep learning techniques have been experimented with to develop faster surrogate models, they often fail to integrate reliably with CFD solvers. This instability arises because deep learning methods optimize for training error without ensuring compatibility with ODE solvers, leading to accumulation of errors over time. Recently, NeuralODE-based techniques have offered a promising solution by effectively modeling chemical kinetics. In this study, we extend the NeuralODE framework for stiff chemical kinetics by incorporating mass conservation constraints directly into the loss function during training. This ensures that the total mass and the elemental mass are conserved, a critical requirement for reliable downstream integration with CFD solvers. Our results demonstrate that this enhancement not only improves the physical consistency with respect to mass conservation criteria but also ensures better robustness and makes the training process more computationally efficient.

[360] 2312.00042

DeepTreeGANv2: Iterative Pooling of Point Clouds

In High Energy Physics, detailed and time-consuming simulations are used for particle interactions with detectors. To bypass these simulations with a generative model, the generation of large point clouds in a short time is required, while the complex dependencies between the particles must be correctly modelled. Particle showers are inherently tree-based processes, as each particle is produced by the decay or detector interaction of a particle of the previous generation. In this work, we present a significant extension to DeepTreeGAN, featuring a critic, that is able to aggregate such point clouds iteratively in a tree-based manner. We show that this model can reproduce complex distributions, and we evaluate its performance on the public JetNet 150 dataset.

[361] 2312.00054

Is Inverse Reinforcement Learning Harder than Standard Reinforcement Learning?

Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an \emph{expert policy} -- plays a critical role in developing intelligent systems, such as those that understand and imitate human behavior. While widely used in applications, theoretical understandings of IRL admit unique challenges and remain less developed compared with standard RL theory. For example, it remains open how to do IRL efficiently in standard \emph{offline} settings with pre-collected data, where states are obtained from a \emph{behavior policy} (which could be the expert policy itself), and actions are sampled from the expert policy. This paper provides the first line of results for efficient IRL in vanilla offline and online settings using polynomial samples and runtime. We first design a new IRL algorithm for the offline setting, Reward Learning with Pessimism (RLP), and show that it achieves polynomial sample complexity in terms of the size of the MDP, a concentrability coefficient between the behavior policy and the expert policy, and the desired accuracy. Building on RLP, we further design an algorithm Reward Learning with Exploration (RLE), which operates in a natural online setting where the learner can both actively explore the environment and query the expert policy, and obtain a stronger notion of IRL guarantee from polynomial samples. We establish sample complexity lower bounds for both settings showing that RLP and RLE are nearly optimal. Finally, as an application, we show that the learned reward functions can \emph{transfer} to another target MDP with suitable guarantees when the target MDP satisfies certain similarity assumptions with the original (source) MDP.

[362] 2312.00067

Predicting breast cancer with AI for individual risk-adjusted MRI screening and early detection

Women with an increased life-time risk of breast cancer undergo supplemental annual screening MRI. We propose to predict the risk of developing breast cancer within one year based on the current MRI, with the objective of reducing screening burden and facilitating early detection. An AI algorithm was developed on 53,858 breasts from 12,694 patients who underwent screening or diagnostic MRI and accrued over 12 years, with 2,331 confirmed cancers. A first U-Net was trained to segment lesions and identify regions of concern. A second convolutional network was trained to detect malignant cancer using features extracted by the U-Net. This network was then fine-tuned to estimate the risk of developing cancer within a year in cases that radiologists considered normal or likely benign. Risk predictions from this AI were evaluated with a retrospective analysis of 9,183 breasts from a high-risk screening cohort, which were not used for training. Statistical analysis focused on the tradeoff between number of omitted exams versus negative predictive value, and number of potential early detections versus positive predictive value. The AI algorithm identified regions of concern that coincided with future tumors in 52% of screen-detected cancers. Upon directed review, a radiologist found that 71.3% of cancers had a visible correlate on the MRI prior to diagnosis, 65% of these correlates were identified by the AI model. Reevaluating these regions in 10% of all cases with higher AI-predicted risk could have resulted in up to 33% early detections by a radiologist. Additionally, screening burden could have been reduced in 16% of lower-risk cases by recommending a later follow-up without compromising current interval cancer rate. With increasing datasets and improving image quality we expect this new AI-aided, adaptive screening to meaningfully reduce screening burden and improve early detection.

[363] 2312.00070

Fully lifted random duality theory

We study a generic class of \emph{random optimization problems} (rops) and their typical behavior. The foundational aspects of the random duality theory (RDT), associated with rops, were discussed in \cite{StojnicRegRndDlt10}, where it was shown that one can often infer rops' behavior even without actually solving them. Moreover, \cite{StojnicRegRndDlt10} uncovered that various quantities relevant to rops (including, for example, their typical objective values) can be determined (in a large dimensional context) even completely analytically. The key observation was that the \emph{strong deterministic duality} implies the, so-called, \emph{strong random duality} and therefore the full exactness of the analytical RDT characterizations. Here, we attack precisely those scenarios where the strong deterministic duality is not necessarily present and connect them to the recent progress made in studying bilinearly indexed (bli) random processes in \cite{Stojnicnflgscompyx23,Stojnicsflgscompyx23}. In particular, utilizing a fully lifted (fl) interpolating comparison mechanism introduced in \cite{Stojnicnflgscompyx23}, we establish corresponding \emph{fully lifted} RDT (fl RDT). We then rely on a stationarized fl interpolation realization introduced in \cite{Stojnicsflgscompyx23} to obtain complete \emph{statitionarized} fl RDT (sfl RDT). A few well known problems are then discussed as illustrations of a wide range of practical applications implied by the generality of the considered rops.

[364] 2312.00071

Studying Hopfield models via fully lifted random duality theory

Relying on a recent progress made in studying bilinearly indexed (bli) random processes in \cite{Stojnicnflgscompyx23,Stojnicsflgscompyx23}, the main foundational principles of fully lifted random duality theory (fl RDT) were established in \cite{Stojnicflrdt23}. We here study famous Hopfield models and show that their statistical behavior can be characterized via the fl RDT. Due to a nestedly lifted nature, the resulting characterizations and, therefore, the whole analytical machinery that produces them, become fully operational only if one can successfully conduct underlying numerical evaluations. After conducting such evaluations for both positive and negative Hopfield models, we observe a remarkably fast convergence of the fl RDT mechanism. Namely, for the so-called square case, the fourth decimal precision is achieved already on the third (second non-trivial) level of lifting (3-sfl RDT) for the positive and on the fourth (third non-trivial) level of lifting (4-sfl RDT) for the corresponding negative model. In particular, we obtain the scaled ground state free energy $\approx 1.7788$ for the positive and $\approx 0.3279$ for the negative model.

[365] 2312.00073

Binary perceptrons capacity via fully lifted random duality theory

We study the statistical capacity of the classical binary perceptrons with general thresholds $\kappa$. After recognizing the connection between the capacity and the bilinearly indexed (bli) random processes, we utilize a recent progress in studying such processes to characterize the capacity. In particular, we rely on \emph{fully lifted} random duality theory (fl RDT) established in \cite{Stojnicflrdt23} to create a general framework for studying the perceptrons' capacities. Successful underlying numerical evaluations are required for the framework (and ultimately the entire fl RDT machinery) to become fully practically operational. We present results obtained in that directions and uncover that the capacity characterizations are achieved on the second (first non-trivial) level of \emph{stationarized} full lifting. The obtained results \emph{exactly} match the replica symmetry breaking predictions obtained through statistical physics replica methods in \cite{KraMez89}. Most notably, for the famous zero-threshold scenario, $\kappa=0$, we uncover the well known $\alpha\approx0.8330786$ scaled capacity.

[366] 2312.00080

PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design

Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability. To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins. We curate datasets from high-quality CATH protein data, high-throughput $\textit{de novo}$ designed proteins, and mega-scale experimental mutagenesis experiments, and in doing so, present the $\textbf{PDB-Struct}$ benchmark that evaluates both recent and previously uncompared protein design methods. Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric. We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark. Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. Code is available at

[367] 2312.00082

A Compact Implicit Neural Representation for Efficient Storage of Massive 4D Functional Magnetic Resonance Imaging

Functional Magnetic Resonance Imaging (fMRI) data is a kind of widely used four-dimensional biomedical data, demanding effective compression but presenting unique challenges for compression due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Implicit Neural Representation (INR). The proposed approach focuses on removing the various redundancies among the time series, including (i) conducting spatial correlation modeling for intra-region dynamics, (ii) decomposing reusable neuronal activation patterns, and using proper initialization together with nonlinear fusion to describe the inter-region similarity. The above scheme properly incorporates the unique features of fMRI data, and experimental results on publicly available datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art algorithms in both conventional image quality evaluation metrics and fMRI downstream tasks. This work in this paper paves the way for sharing massive fMRI data at low bandwidth and high fidelity.

[368] 2312.00086

Star colouring and locally constrained graph homomorphisms

Dvo\v{r}\'ak, Mohar and \v{S}\'amal (J. Graph Theory, 2013) proved that for every 3-regular graph $G$, the line graph of $G$ is 4-star colourable if and only if $G$ admits a locally bijective homomorphism to the cube $Q_3$. We generalise this result as follows: for $p\geq 2$, a $K_{1,p+1}$-free $2p$-regular graph $G$ admits a $(p + 2)$-star colouring if and only if $G$ admits a locally bijective homomorphism to a fixed $2p$-regular graph named $G_{2p}$. We also prove the following: (i) for $p\geq 2$, a $2p$-regular graph $G$ admits a $(p + 2)$-star colouring if and only if $G$ has an orientation $\vec{G}$ that admits an out-neighbourhood bijective homomorphism to a fixed orientation $\vec{G_{2p}}$ of $G2p$; (ii) for every 3-regular graph $G$, the line graph of $G$ is 4-star colourable if and only if $G$ is bipartite and distance-two 4-colourable; and (iii) it is NP-complete to check whether a planar 4-regular 3-connected graph is 4-star colourable.

[369] 2312.00123

Flow Matching Beyond Kinematics: Generating Jets with Particle-ID and Trajectory Displacement Information

We introduce the first generative model trained on the JetClass dataset. Our model generates jets at the constituent level, and it is a permutation-equivariant continuous normalizing flow (CNF) trained with the flow matching technique. It is conditioned on the jet type, so that a single model can be used to generate the ten different jet types of JetClass. For the first time, we also introduce a generative model that goes beyond the kinematic features of jet constituents. The JetClass dataset includes more features, such as particle-ID and track impact parameter, and we demonstrate that our CNF can accurately model all of these additional features as well. Our generative model for JetClass expands on the versatility of existing jet generation techniques, enhancing their potential utility in high-energy physics research, and offering a more comprehensive understanding of the generated jets.

[370] 2312.00125

Scalable Bayesian uncertainty quantification with data-driven priors for radio interferometric imaging

Next-generation radio interferometers like the Square Kilometer Array have the potential to unlock scientific discoveries thanks to their unprecedented angular resolution and sensitivity. One key to unlocking their potential resides in handling the deluge and complexity of incoming data. This challenge requires building radio interferometric imaging methods that can cope with the massive data sizes and provide high-quality image reconstructions with uncertainty quantification (UQ). This work proposes a method coined QuantifAI to address UQ in radio-interferometric imaging with data-driven (learned) priors for high-dimensional settings. Our model, rooted in the Bayesian framework, uses a physically motivated model for the likelihood. The model exploits a data-driven convex prior, which can encode complex information learned implicitly from simulations and guarantee the log-concavity of the posterior. We leverage probability concentration phenomena of high-dimensional log-concave posteriors that let us obtain information about the posterior, avoiding MCMC sampling techniques. We rely on convex optimisation methods to compute the MAP estimation, which is known to be faster and better scale with dimension than MCMC sampling strategies. Our method allows us to compute local credible intervals, i.e., Bayesian error bars, and perform hypothesis testing of structure on the reconstructed image. In addition, we propose a novel blazing-fast method to compute pixel-wise uncertainties at different scales. We demonstrate our method by reconstructing radio-interferometric images in a simulated setting and carrying out fast and scalable UQ, which we validate with MCMC sampling. Our method shows an improved image quality and more meaningful uncertainties than the benchmark method based on a sparsity-promoting prior. QuantifAI's source code:

[371] 2312.00128

Low latency optical-based mode tracking with machine learning deployed on FPGAs on a tokamak

Active feedback control in magnetic confinement fusion devices is desirable to mitigate plasma instabilities and enable robust operation. Optical high-speed cameras provide a powerful, non-invasive diagnostic and can be suitable for these applications. In this study, we process fast camera data, at rates exceeding 100kfps, on $\textit{in situ}$ Field Programmable Gate Array (FPGA) hardware to track magnetohydrodynamic (MHD) mode evolution and generate control signals in real-time. Our system utilizes a convolutional neural network (CNN) model which predicts the $n$=1 MHD mode amplitude and phase using camera images with better accuracy than other tested non-deep-learning-based methods. By implementing this model directly within the standard FPGA readout hardware of the high-speed camera diagnostic, our mode tracking system achieves a total trigger-to-output latency of 17.6$\mu$s and a throughput of up to 120kfps. This study at the High Beta Tokamak-Extended Pulse (HBT-EP) experiment demonstrates an FPGA-based high-speed camera data acquisition and processing system, enabling application in real-time machine-learning-based tokamak diagnostic and control as well as potential applications in other scientific domains.

[372] 2312.00134

Markovian Embeddings of Non-Markovian Quantum Systems: Coupled Stochastic and Quantum Master Equations for Non-Markovian Quantum Systems

Quantum Markov models are employed ubiquitously in quantum physics and in quantum information theory due to their relative simplicity and analytical tractability. In particular, these models are known to give accurate approximations for a wide range of quantum optical and mesoscopic systems. However, in general, the validity of the Markov approximation entails assumptions regarding properties of the system of interest and its environment, which may not be satisfied or accurate in arbitrary physical systems. Therefore, developing useful modelling tools for general non-Markovian quantum systems for which the Markov approximation is inappropriate or deficient is an undertaking of significant importance. This work considers non-Markovian principal quantum systems that can be embedded in a larger Markovian quantum system with one or more compound baths consisting of an auxiliary quantum system and a quantum white noise field, and derives a set of coupled stochastic and quantum master equations for embedded non-Markovian quantum systems. The case of a purely Hamiltonian coupling between the principal and auxiliary systems as a closed system without coupling to white noises is included as a special case. The results are expected to be of interest for (open-loop and feedback) control of continuous-time non-Markovian systems and studying reduced models for numerical simulation of such systems. They may also shed more light on the general structure of continuous-time non-Markovian quantum systems.

[373] 2312.00137

The Multiverse of Dynamic Mode Decomposition Algorithms

Dynamic Mode Decomposition (DMD) is a popular data-driven analysis technique used to decompose complex, nonlinear systems into a set of modes, revealing underlying patterns and dynamics through spectral analysis. This review presents a comprehensive and pedagogical examination of DMD, emphasizing the role of Koopman operators in transforming complex nonlinear dynamics into a linear framework. A distinctive feature of this review is its focus on the relationship between DMD and the spectral properties of Koopman operators, with particular emphasis on the theory and practice of DMD algorithms for spectral computations. We explore the diverse "multiverse" of DMD methods, categorized into three main areas: linear regression-based methods, Galerkin approximations, and structure-preserving techniques. Each category is studied for its unique contributions and challenges, providing a detailed overview of significant algorithms and their applications as outlined in Table 1. We include a MATLAB package with examples and applications to enhance the practical understanding of these methods. This review serves as both a practical guide and a theoretical reference for various DMD methods, accessible to both experts and newcomers, and enabling readers to delve into their areas of interest in the expansive field of DMD.

[374] 2312.00140

The Stochastic Dynamic Post-Disaster Inventory Allocation Problem with Trucks and UAVs

Humanitarian logistics operations face increasing difficulties due to rising demands for aid in disaster areas. This paper investigates the dynamic allocation of scarce relief supplies across multiple affected districts over time. It introduces a novel stochastic dynamic post-disaster inventory allocation problem with trucks and unmanned aerial vehicles delivering relief goods under uncertain supply and demand. The relevance of this humanitarian logistics problem lies in the importance of considering the inter-temporal social impact of deliveries. We achieve this by incorporating deprivation costs when allocating scarce supplies. Furthermore, we consider the inherent uncertainties of disaster areas and the potential use of cargo UAVs to enhance operational efficiency. This study proposes two anticipatory solution methods based on approximate dynamic programming, specifically decomposed linear value function approximation and neural network value function approximation to effectively manage uncertainties in the dynamic allocation process. We compare DL-VFA and NN-VFA with various state-of-the-art methods (exact re-optimization, PPO) and results show a 6-8% improvement compared to the best benchmarks. NN-VFA provides the best performance and captures nonlinearities in the problem, whereas DL-VFA shows excellent scalability against a minor performance loss. The experiments reveal that consideration of deprivation costs results in improved allocation of scarce supplies both across affected districts and over time. Finally, results show that deploying UAVs can play a crucial role in the allocation of relief goods, especially in the first stages after a disaster. The use of UAVs reduces transportation- and deprivation costs together by 16-20% and reduces maximum deprivation times by 19-40%, while maintaining similar levels of demand coverage, showcasing efficient and effective operations.

[375] 2312.00174

Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices

People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.

[376] 2312.00186

Planning Reliability Assurance Tests for Autonomous Vehicles

Artificial intelligence (AI) technology has become increasingly prevalent and transforms our everyday life. One important application of AI technology is the development of autonomous vehicles (AV). However, the reliability of an AV needs to be carefully demonstrated via an assurance test so that the product can be used with confidence in the field. To plan for an assurance test, one needs to determine how many AVs need to be tested for how many miles and the standard for passing the test. Existing research has made great efforts in developing reliability demonstration tests in the other fields of applications for product development and assessment. However, statistical methods have not been utilized in AV test planning. This paper aims to fill in this gap by developing statistical methods for planning AV reliability assurance tests based on recurrent events data. We explore the relationship between multiple criteria of interest in the context of planning AV reliability assurance tests. Specifically, we develop two test planning strategies based on homogeneous and non-homogeneous Poisson processes while balancing multiple objectives with the Pareto front approach. We also offer recommendations for practical use. The disengagement events data from the California Department of Motor Vehicles AV testing program is used to illustrate the proposed assurance test planning methods.

[377] 2312.00191

Enhancing Ligand Pose Sampling for Molecular Docking

Deep learning promises to dramatically improve scoring functions for molecular docking, leading to substantial advances in binding pose prediction and virtual screening. To train scoring functions-and to perform molecular docking-one must generate a set of candidate ligand binding poses. Unfortunately, the sampling protocols currently used to generate candidate poses frequently fail to produce any poses close to the correct, experimentally determined pose, unless information about the correct pose is provided. This limits the accuracy of learned scoring functions and molecular docking. Here, we describe two improved protocols for pose sampling: GLOW (auGmented sampLing with sOftened vdW potential) and a novel technique named IVES (IteratiVe Ensemble Sampling). Our benchmarking results demonstrate the effectiveness of our methods in improving the likelihood of sampling accurate poses, especially for binding pockets whose shape changes substantially when different ligands bind. This improvement is observed across both experimentally determined and AlphaFold-generated protein structures. Additionally, we present datasets of candidate ligand poses generated using our methods for each of around 5,000 protein-ligand cross-docking pairs, for training and testing scoring functions. To benefit the research community, we provide these cross-docking datasets and an open-source Python implementation of GLOW and IVES at .

[378] 2312.00223

Convolutional Neural Networks for Segmentation of Malignant Pleural Mesothelioma: Analysis of Probability Map Thresholds (CALGB 30901, Alliance)

Malignant pleural mesothelioma (MPM) is the most common form of mesothelioma. To assess response to treatment, tumor measurements are acquired and evaluated based on a patient's longitudinal computed tomography (CT) scans. Tumor volume, however, is the more accurate metric for assessing tumor burden and response. Automated segmentation methods using deep learning can be employed to acquire volume, which otherwise is a tedious task performed manually. The deep learning-based tumor volume and contours can then be compared with a standard reference to assess the robustness of the automated segmentations. The purpose of this study was to evaluate the impact of probability map threshold on MPM tumor delineations generated using a convolutional neural network (CNN). Eighty-eight CT scans from 21 MPM patients were segmented by a VGG16/U-Net CNN. A radiologist modified the contours generated at a 0.5 probability threshold. Percent difference of tumor volume and overlap using the Dice Similarity Coefficient (DSC) were compared between the standard reference provided by the radiologist and CNN outputs for thresholds ranging from 0.001 to 0.9. CNN annotations consistently yielded smaller tumor volumes than radiologist contours. Reducing the probability threshold from 0.5 to 0.1 decreased the absolute percent volume difference, on average, from 43.96% to 24.18%. Median and mean DSC ranged from 0.58 to 0.60, with a peak at a threshold of 0.5; no distinct threshold was found for percent volume difference. No single output threshold in the CNN probability maps was optimal for both tumor volume and DSC. This work underscores the need to assess tumor volume and spatial overlap when evaluating CNN performance. While automated segmentations may yield comparable tumor volumes to that of the reference standard, the spatial region delineated by the CNN at a specific threshold is equally important.

[379] 2312.00258

Precipitation Nowcasting With Spatial And Temporal Transfer Learning Using Swin-UNETR

Climate change has led to an increase in frequency of extreme weather events. Early warning systems can prevent disasters and loss of life. Managing such events remain a challenge for both public and private institutions. Precipitation nowcasting can help relevant institutions to better prepare for such events. Numerical weather prediction (NWP) has traditionally been used to make physics based forecasting, and recently deep learning based approaches have been used to reduce turn-around time for nowcasting. In this work, recently proposed Swin-UNETR (Swin UNEt TRansformer) is used for precipitation nowcasting for ten different regions of Europe. Swin-UNETR utilizes a U-shaped network within which a swin transformer-based encoder extracts multi-scale features from multiple input channels of satellite image, while CNN-based decoder makes the prediction. Trained model is capable of nowcasting not only for the regions for which data is available, but can also be used for new regions for which data is not available.

[380] 2312.00264

Skipper: Improving the Reach and Fidelity of Quantum Annealers by Skipping Long Chains

Quantum Annealers (QAs) operate as single-instruction machines, lacking a SWAP operation to overcome limited qubit connectivity. Consequently, multiple physical qubits are chained to form a program qubit with higher connectivity, resulting in a drastically diminished effective QA capacity by up to 33x. We observe that in QAs: (a) chain lengths exhibit a power-law distribution, a few dominant chains holding substantially more qubits than others; and (b) about 25% of physical qubits remain unused, getting isolated between these chains. We propose Skipper, a software technique that enhances the capacity and fidelity of QAs by skipping dominant chains and substituting their program qubit with two readout results. Using a 5761-qubit QA, we demonstrate that Skipper can tackle up to 59% (Avg. 28%) larger problems when eleven chains are skipped. Additionally, Skipper can improve QA fidelity by up to 44% (Avg. 33%) when cutting five chains (32 runs). Users can specify up to eleven chain cuts in Skipper, necessitating about 2,000 distinct quantum executable runs. To mitigate this, we introduce Skipper-G, a greedy scheme that skips sub-problems less likely to hold the global optimum, executing a maximum of 23 quantum executables with eleven chain trims. Skipper-G can boost QA fidelity by up to 41% (Avg. 29%) when cutting five chains (11 runs).

[381] 2312.00286

Complexity-theoretic foundations of BosonSampling with a linear number of modes

BosonSampling is the leading candidate for demonstrating quantum computational advantage in photonic systems. While we have recently seen many impressive experimental demonstrations, there is still a formidable distance between the complexity-theoretic hardness arguments and current experiments. One of the largest gaps involves the ratio of photons to modes: all current hardness evidence assumes a "high-mode" regime in which the number of linear optical modes scales at least quadratically in the number of photons. By contrast, current experiments operate in a "low-mode" regime with a linear number of modes. In this paper we bridge this gap, bringing the hardness evidence for the low-mode experiments to the same level as had been previously established for the high-mode regime. This involves proving a new worst-to-average-case reduction for computing the Permanent that is robust to large numbers of row repetitions and also to distributions over matrices with correlated entries.

[382] 2312.00299

QIENet: Quantitative irradiance estimation network using recurrent neural network based on satellite remote sensing data

Global horizontal irradiance (GHI) plays a vital role in estimating solar energy resources, which are used to generate sustainable green energy. In order to estimate GHI with high spatial resolution, a quantitative irradiance estimation network, named QIENet, is proposed. Specifically, the temporal and spatial characteristics of remote sensing data of the satellite Himawari-8 are extracted and fused by recurrent neural network (RNN) and convolution operation, respectively. Not only remote sensing data, but also GHI-related time information (hour, day, and month) and geographical information (altitude, longitude, and latitude), are used as the inputs of QIENet. The satellite spectral channels B07 and B11 - B15 and time are recommended as model inputs for QIENet according to the spatial distributions of annual solar energy. Meanwhile, QIENet is able to capture the impact of various clouds on hourly GHI estimates. More importantly, QIENet does not overestimate ground observations and can also reduce RMSE by 27.51%/18.00%, increase R2 by 20.17%/9.42%, and increase r by 8.69%/3.54% compared with ERA5/NSRDB. Furthermore, QIENet is capable of providing a high-fidelity hourly GHI database with spatial resolution 0.02{\deg} * 0.02{\deg}(approximately 2km * 2km) for many applied energy fields.

[383] 2312.00305

Multiple Testing of Linear Forms for Noisy Matrix Completion

Many important tasks of large-scale recommender systems can be naturally cast as testing multiple linear forms for noisy matrix completion. These problems, however, present unique challenges because of the subtle bias-and-variance tradeoff of and an intricate dependence among the estimated entries induced by the low-rank structure. In this paper, we develop a general approach to overcome these difficulties by introducing new statistics for individual tests with sharp asymptotics both marginally and jointly, and utilizing them to control the false discovery rate (FDR) via a data splitting and symmetric aggregation scheme. We show that valid FDR control can be achieved with guaranteed power under nearly optimal sample size requirements using the proposed methodology. Extensive numerical simulations and real data examples are also presented to further illustrate its practical merits.

[384] 2312.00306

RadioGalaxyNET: Dataset and Novel Computer Vision Algorithms for the Detection of Extended Radio Galaxies and Infrared Hosts

Creating radio galaxy catalogues from next-generation deep surveys requires automated identification of associated components of extended sources and their corresponding infrared hosts. In this paper, we introduce RadioGalaxyNET, a multimodal dataset, and a suite of novel computer vision algorithms designed to automate the detection and localization of multi-component extended radio galaxies and their corresponding infrared hosts. The dataset comprises 4,155 instances of galaxies in 2,800 images with both radio and infrared channels. Each instance provides information about the extended radio galaxy class, its corresponding bounding box encompassing all components, the pixel-level segmentation mask, and the keypoint position of its corresponding infrared host galaxy. RadioGalaxyNET is the first dataset to include images from the highly sensitive Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope, corresponding infrared images, and instance-level annotations for galaxy detection. We benchmark several object detection algorithms on the dataset and propose a novel multimodal approach to simultaneously detect radio galaxies and the positions of infrared hosts.

[385] 2312.00339

Propagation of chaos in path spaces via information theory

We study the mean-field limit of the stochastic interacting particle systems via tools from information theory. The key in our method is that, after applying the data processing inequality, one only needs to handle independent copies of solutions to the mean-field McKean stochastic differential equations, which then allows one to apply the law of large numbers. Our result on the propagation of chaos in path space is valid for both first and second order interacting particle systems; in particular, for the latter one our convergence rate is independent of the particle mass and also only linear in time. Our framework is different from current approaches in literature and could provide new insight for the study of interacting particle systems.

[386] 2312.00352

Quantum Kernel t-Distributed Stochastic Neighbor Embedding

Data visualization is important in understanding the characteristics of data that are difficult to see directly. It is used to visualize loss landscapes and optimization trajectories to analyze optimization performance. Popular optimization analysis is performed by visualizing a loss landscape around the reached local or global minimum using principal component analysis. However, this visualization depends on the variational parameters of a quantum circuit rather than quantum states, which makes it difficult to understand the mechanism of optimization process through the property of quantum states. Here, we propose a quantum data visualization method using quantum kernels, which enables us to offer fast and highly accurate visualization of quantum states. In our numerical experiments, we visualize hand-written digits dataset and apply $k$-nearest neighbor algorithm to the low-dimensional data to quantitatively evaluate our proposed method compared with a classical kernel method. As a result, our proposed method achieves comparable accuracy to the state-of-the-art classical kernel method, meaning that the proposed visualization method based on quantum machine learning does not degrade the separability of the input higher dimensional data. Furthermore, we visualize the optimization trajectories of finding the ground states of transverse field Ising model and successfully find the trajectory characteristics. Since quantum states are higher dimensional objects that can only be seen via observables, our visualization method, which inherits the similarity of quantum data, would be useful in understanding the behavior of quantum circuits and algorithms.

[387] 2312.00356

Transfer learning for predicting source terms of principal component transport in chemically reactive flow

The objective of this study is to evaluate whether the number of requisite training samples can be reduced with the use of various transfer learning models for predicting, for example, the chemical source terms of the data-driven reduced-order model that represents the homogeneous ignition process of a hydrogen/air mixture. Principal component analysis is applied to reduce the dimensionality of the hydrogen/air mixture in composition space. Artificial neural networks (ANNs) are used to tabulate the reaction rates of principal components, and subsequently, a system of ordinary differential equations is solved. As the number of training samples decreases at the target task (i.e.,for T0 > 1000 K and various phi), the reduced-order model fails to predict the ignition evolution of a hydrogen/air mixture. Three transfer learning strategies are then applied to the training of the ANN model with a sparse dataset. The performance of the reduced-order model with a sparse dataset is found to be remarkably enhanced if the training of the ANN model is restricted by a regularization term that controls the degree of knowledge transfer from source to target tasks. To this end, a novel transfer learning method is introduced, parameter control via partial initialization and regularization (PaPIR), whereby the amount of knowledge transferred is systemically adjusted for the initialization and regularization of the ANN model in the target task. It is found that an additional performance gain can be achieved by changing the initialization scheme of the ANN model in the target task when the task similarity between source and target tasks is relatively low.

[388] 2312.00357

A Generalizable Deep Learning System for Cardiac MRI

Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank, and two additional publicly available external datasets. We explore emergent zero-shot capabilities of our system, and demonstrate remarkable performance across a range of tasks; including the problem of left ventricular ejection fraction regression, and the diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep learning system is capable of not only understanding the staggering complexity of human cardiovascular disease, but can be directed towards clinical problems of interest yielding impressive, clinical grade diagnostic accuracy with a fraction of the training data typically required for such tasks.

[389] 2312.00358

Impact of Data Augmentation on QCNNs

In recent years, Classical Convolutional Neural Networks (CNNs) have been applied for image recognition successfully. Quantum Convolutional Neural Networks (QCNNs) are proposed as a novel generalization to CNNs by using quantum mechanisms. The quantum mechanisms lead to an efficient training process in QCNNs by reducing the size of input from $N$ to $log_2N$. This paper implements and compares both CNNs and QCNNs by testing losses and prediction accuracy on three commonly used datasets. The datasets include the MNIST hand-written digits, Fashion MNIST and cat/dog face images. Additionally, data augmentation (DA), a technique commonly used in CNNs to improve the performance of classification by generating similar images based on original inputs, is also implemented in QCNNs. Surprisingly, the results showed that data augmentation didn't improve QCNNs performance. The reasons and logic behind this result are discussed, hoping to expand our understanding of Quantum machine learning theory.

[390] 2312.00366

Unbounded Donoho-Stark-Elad-Bruckstein-Ricaud-Torrésani Uncertainty Principles

Let $(\Omega, \mu)$, $(\Delta, \nu)$ be measure spaces and $p=1$ or $p=\infty$. Let $(\{f_\alpha\}_{\alpha\in \Omega}, \{\tau_\alpha\}_{\alpha\in \Omega})$ and $(\{g_\beta\}_{\beta\in \Delta}, \{\omega_\beta\}_{\beta\in \Delta})$ be unbounded continuous p-Schauder frames for a Banach space $\mathcal{X}$. Then for every $x \in ( \mathcal{D}(\theta_f) \cap\mathcal{D}(\theta_g))\setminus\{0\}$, we show that \begin{align}\label{UB} (1) \quad \quad \quad \quad \mu(\operatorname{supp}(\theta_f x))\nu(\operatorname{supp}(\theta_g x)) \geq \frac{1}{\left(\displaystyle\sup_{\alpha \in \Omega, \beta \in \Delta}|f_\alpha(\omega_\beta)|\right)\left(\displaystyle\sup_{\alpha \in \Omega , \beta \in \Delta}|g_\beta(\tau_\alpha)|\right)}, \end{align} where \begin{align*} &\theta_f:\mathcal{D}(\theta_f) \ni x \mapsto \theta_fx \in \mathcal{L}^p(\Omega, \mu); \quad \theta_fx: \Omega \ni \alpha \mapsto (\theta_fx) (\alpha):= f_\alpha (x) \in \mathbb{K},\\ &\theta_g: \mathcal{D}(\theta_g) \ni x \mapsto \theta_gx \in \mathcal{L}^p(\Delta, \nu); \quad \theta_gx: \Delta \ni \beta \mapsto (\theta_gx) (\beta):= g_\beta (x) \in \mathbb{K}. \end{align*} We call Inequality (1) as \textbf{Unbounded Donoho-Stark-Elad-Bruckstein-Ricaud-Torr\'{e}sani Uncertainty Principle}. Along with recent \textbf{Functional Continuous Uncertainty Principle} [arXiv:2308.00312], Inequality (1) also improves Ricaud-Torr\'{e}sani uncertainty principle [IEEE Trans. Inform. Theory, 2013]. In particular, it improves Elad-Bruckstein uncertainty principle [IEEE Trans. Inform. Theory, 2002] and Donoho-Stark uncertainty principle [SIAM J. Appl. Math., 1989].

[391] 2312.00386

Local monotone operator learning using non-monotone operators: MnM-MOL

The recovery of magnetic resonance (MR) images from undersampled measurements is a key problem that has seen extensive research in recent years. Unrolled approaches, which rely on end-to-end training of convolutional neural network (CNN) blocks within iterative reconstruction algorithms, offer state-of-the-art performance. These algorithms require a large amount of memory during training, making them difficult to employ in high-dimensional applications. Deep equilibrium (DEQ) models and the recent monotone operator learning (MOL) approach were introduced to eliminate the need for unrolling, thus reducing the memory demand during training. Both approaches require a Lipschitz constraint on the network to ensure that the forward and backpropagation iterations converge. Unfortunately, the constraint often results in reduced performance compared to unrolled methods. The main focus of this work is to relax the constraint on the CNN block in two different ways. Inspired by convex-non-convex regularization strategies, we now impose the monotone constraint on the sum of the gradient of the data term and the CNN block, rather than constrain the CNN itself to be a monotone operator. This approach enables the CNN to learn possibly non-monotone score functions, which can translate to improved performance. In addition, we only restrict the operator to be monotone in a local neighborhood around the image manifold. Our theoretical results show that the proposed algorithm is guaranteed to converge to the fixed point and that the solution is robust to input perturbations, provided that it is initialized close to the true solution. Our empirical results show that the relaxed constraints translate to improved performance and that the approach enjoys robustness to input perturbations similar to MOL.

[392] 2312.00387

Partition-based K-space Synthesis for Multi-contrast Parallel Imaging

Multi-contrast magnetic resonance imaging is a significant and essential medical imaging technique.However, multi-contrast imaging has longer acquisition time and is easy to cause motion artifacts. In particular, the acquisition time for a T2-weighted image is prolonged due to its longer repetition time (TR). On the contrary, T1-weighted image has a shorter TR. Therefore,utilizing complementary information across T1 and T2-weighted image is a way to decrease the overall imaging time. Previous T1-assisted T2 reconstruction methods have mostly focused on image domain using whole-based image fusion approaches. The image domain reconstruction method has the defects of high computational complexity and limited flexibility. To address this issue, we propose a novel multi-contrast imaging method called partition-based k-space synthesis (PKS) which can achieve super reconstruction quality of T2-weighted image by feature fusion. Concretely, we first decompose fully-sampled T1 k-space data and under-sampled T2 k-space data into two sub-data, separately. Then two new objects are constructed by combining the two sub-T1/T2 data. After that, the two new objects as the whole data to realize the reconstruction of T2-weighted image. Finally, the objective T2 is synthesized by extracting the sub-T2 data of each part. Experimental results showed that our combined technique can achieve comparable or better results than using traditional k-space parallel imaging(SAKE) that processes each contrast independently.

[393] 2312.00427

From Mutual Information to Expected Dynamics: New Generalization Bounds for Heavy-Tailed SGD

Understanding the generalization abilities of modern machine learning algorithms has been a major research topic over the past decades. In recent years, the learning dynamics of Stochastic Gradient Descent (SGD) have been related to heavy-tailed dynamics. This has been successfully applied to generalization theory by exploiting the fractal properties of those dynamics. However, the derived bounds depend on mutual information (decoupling) terms that are beyond the reach of computability. In this work, we prove generalization bounds over the trajectory of a class of heavy-tailed dynamics, without those mutual information terms. Instead, we introduce a geometric decoupling term by comparing the learning dynamics (depending on the empirical risk) with an expected one (depending on the population risk). We further upper-bound this geometric term, by using techniques from the heavy-tailed and the fractal literature, making it fully computable. Moreover, as an attempt to tighten the bounds, we propose a PAC-Bayesian setting based on perturbed dynamics, in which the same geometric term plays a crucial role and can still be bounded using the techniques described above.

[394] 2312.00429

Polygraphs: From Rewriting to Higher Categories

Polygraphs are a higher-dimensional generalization of the notion of directed graph. Based on those as unifying concept, this monograph on polygraphs revisits the theory of rewriting in the context of strict higher categories, adopting the abstract point of view offered by homotopical algebra. The first half explores the theory of polygraphs in low dimensions and its applications to the computation of the coherence of algebraic structures. It is meant to be progressive, with little requirements on the background of the reader, apart from basic category theory, and is illustrated with algorithmic computations on algebraic structures. The second half introduces and studies the general notion of n-polygraph, dealing with the homotopy theory of those. It constructs the folk model structure on the category of strict higher categories and exhibits polygraphs as cofibrant objects. This allows extending to higher dimensional structures the coherence results developed in the first half.

[395] 2312.00456

Auto-encoding GPS data to reveal individual and collective behaviour

We propose an innovative and generic methodology to analyse individual and collective behaviour through individual trajectory data. The work is motivated by the analysis of GPS trajectories of fishing vessels collected from regulatory tracking data in the context of marine biodiversity conservation and ecosystem-based fisheries management. We build a low-dimensional latent representation of trajectories using convolutional neural networks as non-linear mapping. This is done by training a conditional variational auto-encoder taking into account covariates. The posterior distributions of the latent representations can be linked to the characteristics of the actual trajectories. The latent distributions of the trajectories are compared with the Bhattacharyya coefficient, which is well-suited for comparing distributions. Using this coefficient, we analyse the variation of the individual behaviour of each vessel during time. For collective behaviour analysis, we build proximity graphs and use an extension of the stochastic block model for multiple networks. This model results in a clustering of the individuals based on their set of trajectories. The application to French fishing vessels enables us to obtain groups of vessels whose individual and collective behaviours exhibit spatio-temporal patterns over the period 2014-2018.

[396] 2312.00509

Bayesian causal discovery from unknown general interventions

We consider the problem of learning causal Directed Acyclic Graphs (DAGs) using combinations of observational and interventional experimental data. Current methods tailored to this setting assume that interventions either destroy parent-child relations of the intervened (target) nodes or only alter such relations without modifying the parent sets, even when the intervention targets are unknown. We relax this assumption by proposing a Bayesian method for causal discovery from general interventions, which allow for modifications of the parent sets of the unknown targets. Even in this framework, DAGs and general interventions may be identifiable only up to some equivalence classes. We provide graphical characterizations of such interventional Markov equivalence and devise compatible priors for Bayesian inference that guarantee score equivalence of indistinguishable structures. We then develop a Markov Chain Monte Carlo (MCMC) scheme to approximate the posterior distribution over DAGs, intervention targets and induced parent sets. Finally, we evaluate the proposed methodology on both simulated and real protein expression data.

[397] 2312.00529

Algorithm-based diagnostic application for diabetic retinopathy detection

Diabetic retinopathy (DR) is a growing health problem worldwide and is a leading cause of visual impairment and blindness, especially among working people aged 20-65. Its incidence is increasing along with the number of diabetes cases, and it is more common in developed countries than in developing countries. Recent research in the field of diabetic retinopathy diagnosis is using advanced technologies, such as analysis of images obtained by ophthalmoscopy. Automatic methods for analyzing eye images based on neural networks, deep learning and image analysis algorithms can improve the efficiency of diagnosis. This paper describes an automatic DR diagnosis method that includes processing and analysis of ophthalmoscopic images of the eye. It uses morphological algorithms to identify the optic disc and lesions characteristic of DR, such as microaneurysms, hemorrhages and exudates. Automated DR diagnosis has the potential to improve the efficiency of early detection of this disease and contribute to reducing the number of cases of diabetes-related visual impairment. The final step was to create an application with a graphical user interface that allowed retinal images taken at cooperating ophthalmology offices to be uploaded to the server. These images were then analyzed using a developed algorithm to make a diagnosis.

[398] 2312.00535

RIS-Based On-the-Air Semantic Communications -- a Diffractional Deep Neural Network Approach

Semantic communication has gained significant attention recently due to its advantages in achieving higher transmission efficiency by focusing on semantic information instead of bit-level information. However, current AI-based semantic communication methods require digital hardware for implementation. With the rapid advancement on reconfigurable intelligence surfaces (RISs), a new approach called on-the-air diffractional deep neural networks (D$^2$NN) can be utilized to enable semantic communications on the wave domain. This paper proposes a new paradigm of RIS-based on-the-air semantic communications, where the computational process occurs inherently as wireless signals pass through RISs. We present the system model and discuss the data and control flows of this scheme, followed by a performance analysis using image transmission as an example. In comparison to traditional hardware-based approaches, RIS-based semantic communications offer appealing features, such as light-speed computation, low computational power requirements, and the ability to handle multiple tasks simultaneously.

[399] 2312.00555

Dense, irregular, yet always graphic $3$-uniform hypergraph degree sequences

A $3$-uniform hypergraph is a generalization of simple graphs where each hyperedge is a subset of vertices of size $3$. The degree of a vertex in a hypergraph is the number of hyperedges incident with it. The degree sequence of a hypergraph is the sequence of the degrees of its vertices. The degree sequence problem for $3$-uniform hypergraphs is to decide if a $3$-uniform hypergraph exists with a prescribed degree sequence. Such a hypergraph is called a realization. Recently, Deza \emph{et al.} proved that the degree sequence problem for $3$-uniform hypergraphs is NP-complete. Some special cases are easy; however, polynomial algorithms have been known so far only for some very restricted degree sequences. The main result of our research is the following. If all degrees are between $\frac{2n^2}{63}+O(n)$ and $\frac{5n^2}{63}-O(n)$ in a degree sequence $D$, further, the number of vertices is at least $45$, and the degree sum can be divided by $3$, then $D$ has a $3$-uniform hypergraph realization. Our proof is constructive and in fact, it constructs a hypergraph realization in polynomial time for any degree sequence satisfying the properties mentioned above. To our knowledge, this is the first polynomial running time algorithm to construct a $3$-uniform hypergraph realization of a highly irregular and dense degree sequence.

[400] 2312.00585

Adaptive Parameter-Free Robust Learning using Latent Bernoulli Variables

We present an efficient parameter-free approach for statistical learning from corrupted training sets. We identify corrupted and non-corrupted samples using latent Bernoulli variables, and therefore formulate the robust learning problem as maximization of the likelihood where latent variables are marginalized out. The resulting optimization problem is solved via variational inference using an efficient Expectation-Maximization based method. The proposed approach improves over the state-of-the-art by automatically inferring the corruption level and identifying outliers, while adding minimal computational overhead. We demonstrate our robust learning method on a wide variety of machine learning tasks including online learning and deep learning where it exhibits ability to adapt to different levels of noise and attain high prediction accuracy.

[401] 2312.00621

Weighted Riesz Particles

Markov chain Monte Carlo (MCMC) methods are simulated by local exploration of complex statistical distributions, and while bypassing the cumbersome requirement of a specific analytical expression for the target, this stochastic exploration of an uncertain parameter space comes at the expense of a large number of samples, and this computational complexity increases with parameter dimensionality. Although at the exploration level, some methods are proposed to accelerate the convergence of the algorithm, such as tempering, Hamiltonian Monte Carlo, Rao-redwellization, and scalable methods for better performance, it cannot avoid the stochastic nature of this exploration. We consider the target distribution as a mapping where the infinite-dimensional Eulerian space of the parameters consists of a number of deterministic submanifolds and propose a generalized energy metric, termed weighted Riesz energy, where a number of points is generated through pairwise interactions, to discretize rectifiable submanifolds. We study the properties of the point, called Riesz particle, and embed it into sequential MCMC, and we find that there will be higher acceptance rates with fewer evaluations, we validate it through experimental comparative analysis from a linear Gaussian state-space model with synthetic data and a non-linear stochastic volatility model with real-world data.

[402] 2312.00634

A Recent Survey of Vision Transformers for Medical Image Segmentation

Medical image segmentation plays a crucial role in various healthcare applications, enabling accurate diagnosis, treatment planning, and disease monitoring. In recent years, Vision Transformers (ViTs) have emerged as a promising technique for addressing the challenges in medical image segmentation. In medical images, structures are usually highly interconnected and globally distributed. ViTs utilize their multi-scale attention mechanism to model the long-range relationships in the images. However, they do lack image-related inductive bias and translational invariance, potentially impacting their performance. Recently, researchers have come up with various ViT-based approaches that incorporate CNNs in their architectures, known as Hybrid Vision Transformers (HVTs) to capture local correlation in addition to the global information in the images. This survey paper provides a detailed review of the recent advancements in ViTs and HVTs for medical image segmentation. Along with the categorization of ViT and HVT-based medical image segmentation approaches we also present a detailed overview of their real-time applications in several medical image modalities. This survey may serve as a valuable resource for researchers, healthcare practitioners, and students in understanding the state-of-the-art approaches for ViT-based medical image segmentation.

[403] 2312.00640

One to beat them all: "RYU'' -- a unifying framework for the construction of safe balls

In this paper, we put forth a novel framework (named ``RYU'') for the construction of ``safe'' balls, i.e. regions that provably contain the dual solution of a target optimization problem. We concentrate on the standard setup where the cost function is the sum of two terms: a closed, proper, convex Lipschitz-smooth function and a closed, proper, convex function. The RYU framework is shown to generalize or improve upon all the results proposed in the last decade for the considered family of optimization problems.

[404] 2312.00661

Dual-Domain Multi-Contrast MRI Reconstruction with Synthesis-based Fusion Network

Purpose: To develop an efficient dual-domain reconstruction framework for multi-contrast MRI, with the focus on minimising cross-contrast misalignment in both the image and the frequency domains to enhance optimisation. Theory and Methods: Our proposed framework, based on deep learning, facilitates the optimisation for under-sampled target contrast using fully-sampled reference contrast that is quicker to acquire. The method consists of three key steps: 1) Learning to synthesise data resembling the target contrast from the reference contrast; 2) Registering the multi-contrast data to reduce inter-scan motion; and 3) Utilising the registered data for reconstructing the target contrast. These steps involve learning in both domains with regularisation applied to ensure their consistency. We also compare the reconstruction performance with existing deep learning-based methods using a dataset of brain MRI scans. Results: Extensive experiments demonstrate the superiority of our proposed framework, for up to an 8-fold acceleration rate, compared to state-of-the-art algorithms. Comprehensive analysis and ablation studies further present the effectiveness of the proposed components. Conclusion:Our dual-domain framework offers a promising approach to multi-contrast MRI reconstruction. It can also be integrated with existing methods to further enhance the reconstruction.

[405] 2312.00677

Unsupervised Adaptive Implicit Neural Representation Learning for Scan-Specific MRI Reconstruction

In recent studies on MRI reconstruction, advances have shown significant promise for further accelerating the MRI acquisition. Most state-of-the-art methods require a large amount of fully-sampled data to optimise reconstruction models, which is impractical and expensive under certain clinical settings. On the other hand, for unsupervised scan-specific reconstruction methods, overfitting is likely to happen due to insufficient supervision, while restrictions on acceleration rates and under-sampling patterns further limit their applicability. To this end, we propose an unsupervised, adaptive coarse-to-fine framework that enhances reconstruction quality without being constrained by the sparsity levels or patterns in under-sampling. The framework employs an implicit neural representation for scan-specific MRI reconstruction, learning a mapping from multi-dimensional coordinates to their corresponding signal intensities. Moreover, we integrate a novel learning strategy that progressively refines the use of acquired k-space signals for self-supervision. This approach effectively adjusts the proportion of supervising signals from unevenly distributed information across different frequency bands, thus mitigating the issue of overfitting while improving the overall reconstruction. Comprehensive evaluation on a public dataset, including both 2D and 3D data, has shown that our method outperforms current state-of-the-art scan-specific MRI reconstruction techniques, for up to 8-fold under-sampling.

[406] 2312.00689

Infrared Image Super-Resolution via GAN

The ability of generative models to accurately fit data distributions has resulted in their widespread adoption and success in fields such as computer vision and natural language processing. In this chapter, we provide a brief overview of the application of generative models in the domain of infrared (IR) image super-resolution, including a discussion of the various challenges and adversarial training methods employed. We propose potential areas for further investigation and advancement in the application of generative models for IR image super-resolution.

[407] 2312.00725

Algebra of Nonlocal Boxes and the Collapse of Communication Complexity

Communication complexity quantifies how difficult it is for two distant computers to evaluate a function $f(X,Y)$ where the strings $X$ and $Y$ are distributed to the first and second computer, respectively and under the constraint of exchanging as few bits as possible. Surprisingly, some nonlocal boxes, which are resources shared by the two computers, are so powerful that they allow to collapse communication complexity, in the sense that any Boolean function $f$ can be correctly estimated with the exchange of only one bit of communication. The Popescu-Rohrlich (PR) box is an example of such a collapsing resource, but a comprehensive description of the set of collapsing nonlocal boxes remains elusive. In this work, we carry out an algebraic study of the structure of wirings connecting nonlocal boxes, thus defining the notion of the "product of boxes" $\mathtt{P}\boxtimes\mathtt{Q}$, and we show related associativity and commutativity results. This gives rise to the notion of the "orbit of a box", unveiling surprising geometrical properties about the alignment and parallelism of distilled boxes. The power of this new framework is that it allows to prove previously-reported numerical intuitions concerning the best way to wire consecutive boxes, and to numerically and analytically recover recently-identified noisy PR boxes that collapse communication complexity for different types of noise models.

[408] 2312.00730

Asymptotic-preserving gyrokinetic implicit particle-orbit integrator for arbitrary electromagnetic fields

We extend the asymptotic preserving and energy conserving time integrator for charged-particle motion developed in [Ricketson & Chac\'on, JCP, 2020] to include finite Larmor-radius (FLR) effects in the presence of electric-field length-scales comparable to the particle gyro-radius (the gyro-kinetic limit). We introduce two modifications to the earlier scheme. The first is the explicit gyro-averaging of the electric field at the half time-step, along with an analogous modification to the current deposition, which we show preserves total energy conservation in implicit PIC schemes. The number of gyrophase samples is chosen adaptively, ensuring proper averaging for large timesteps, and the recovery of full-orbit dynamics in the small time-step limit. The second modification is an alternating large and small time-step strategy that ensures the particle trajectory samples gyrophases evenly. We show that this strategy relaxes the time-step restrictions on the scheme, allowing even larger speed-ups than previously achievable. We demonstrate the new method with several single-particle motion tests in a variety of electromagnetic field configurations featuring gyro-scale variation in the electric field. The results demonstrate the advertised ability to capture FLR effects accurately even when significantly stepping over the gyration time-scale.

[409] 2312.00733

Provable bounds for noise-free expectation values computed from noisy samples

In this paper, we explore the impact of noise on quantum computing, particularly focusing on the challenges when sampling bit strings from noisy quantum computers as well as the implications for optimization and machine learning applications. We formally quantify the sampling overhead to extract good samples from noisy quantum computers and relate it to the layer fidelity, a metric to determine the performance of noisy quantum processors. Further, we show how this allows us to use the Conditional Value at Risk of noisy samples to determine provable bounds on noise-free expectation values. We discuss how to leverage these bounds for different algorithms and demonstrate our findings through experiments on a real quantum computer involving up to 127 qubits. The results show a strong alignment with theoretical predictions.

[410] 2312.00742

Scalable Meta-Learning with Gaussian Processes

Meta-learning is a powerful approach that exploits historical data to quickly solve new tasks from the same distribution. In the low-data regime, methods based on the closed-form posterior of Gaussian processes (GP) together with Bayesian optimization have achieved high performance. However, these methods are either computationally expensive or introduce assumptions that hinder a principled propagation of uncertainty between task models. This may disrupt the balance between exploration and exploitation during optimization. In this paper, we develop ScaML-GP, a modular GP model for meta-learning that is scalable in the number of tasks. Our core contribution is a carefully designed multi-task kernel that enables hierarchical training and task scalability. Conditioning ScaML-GP on the meta-data exposes its modular nature yielding a test-task prior that combines the posteriors of meta-task GPs. In synthetic and real-world meta-learning experiments, we demonstrate that ScaML-GP can learn efficiently both with few and many meta-tasks.