New articles on Computer Science


[1] 2303.12795

Named Entity Recognition Based Automatic Generation of Research Highlights

A scientific paper is traditionally prefaced by an abstract that summarizes the paper. Recently, research highlights that focus on the main findings of the paper have emerged as a complementary summary in addition to an abstract. However, highlights are not yet as common as abstracts, and are absent in many papers. In this paper, we aim to automatically generate research highlights using different sections of a research paper as input. We investigate whether the use of named entity recognition on the input improves the quality of the generated highlights. In particular, we have used two deep learning-based models: the first is a pointer-generator network, and the second augments the first model with coverage mechanism. We then augment each of the above models with named entity recognition features. The proposed method can be used to produce highlights for papers with missing highlights. Our experiments show that adding named entity information improves the performance of the deep learning-based summarizers in terms of ROUGE, METEOR and BERTScore measures.


[2] 2303.12796

An Analysis of Abstractive Text Summarization Using Pre-trained Models

People nowadays use search engines like Google, Yahoo, and Bing to find information on the Internet. Due to explosion in data, it is helpful for users if they are provided relevant summaries of the search results rather than just links to webpages. Text summarization has become a vital approach to help consumers swiftly grasp vast amounts of information.In this paper, different pre-trained models for text summarization are evaluated on different datasets. Specifically, we have used three different pre-trained models, namely, google/pegasus-cnn-dailymail, T5-base, facebook/bart-large-cnn. We have considered three different datasets, namely, CNN-dailymail, SAMSum and BillSum to get the output from the above three models. The pre-trained models are compared over these different datasets, each of 2000 examples, through ROUGH and BLEU metrics.


[3] 2303.12797

An algorithmic framework for the optimization of deep neural networks architectures and hyperparameters

In this paper, we propose an algorithmic framework to automatically generate efficient deep neural networks and optimize their associated hyperparameters. The framework is based on evolving directed acyclic graphs (DAGs), defining a more flexible search space than the existing ones in the literature. It allows mixtures of different classical operations: convolutions, recurrences and dense layers, but also more newfangled operations such as self-attention. Based on this search space we propose neighbourhood and evolution search operators to optimize both the architecture and hyper-parameters of our networks. These search operators can be used with any metaheuristic capable of handling mixed search spaces. We tested our algorithmic framework with an evolutionary algorithm on a time series prediction benchmark. The results demonstrate that our framework was able to find models outperforming the established baseline on numerous datasets.


[4] 2303.12798

Interpersonal Distance Tracking with mmWave Radar and IMUs

Tracking interpersonal distances is essential for real-time social distancing management and {\em ex-post} contact tracing to prevent spreads of contagious diseases. Bluetooth neighbor discovery has been employed for such purposes in combating COVID-19, but does not provide satisfactory spatiotemporal resolutions. This paper presents ImmTrack, a system that uses a millimeter wave radar and exploits the inertial measurement data from user-carried smartphones or wearables to track interpersonal distances. By matching the movement traces reconstructed from the radar and inertial data, the pseudo identities of the inertial data can be transferred to the radar sensing results in the global coordinate system. The re-identified, radar-sensed movement trajectories are then used to track interpersonal distances. In a broader sense, ImmTrack is the first system that fuses data from millimeter wave radar and inertial measurement units for simultaneous user tracking and re-identification. Evaluation with up to 27 people in various indoor/outdoor environments shows ImmTrack's decimeters-seconds spatiotemporal accuracy in contact tracing, which is similar to that of the privacy-intrusive camera surveillance and significantly outperforms the Bluetooth neighbor discovery approach.


[5] 2303.12799

Time Series as Images: Vision Transformer for Irregularly Sampled Time Series

Irregularly sampled time series are becoming increasingly prevalent in various domains, especially in medical applications. Although different highly-customized methods have been proposed to tackle irregularity, how to effectively model their complicated dynamics and high sparsity is still an open problem. This paper studies the problem from a whole new perspective: transforming irregularly sampled time series into line graph images and adapting powerful vision transformers to perform time series classification in the same way as image classification. Our approach largely simplifies algorithm designs without assuming prior knowledge and can be potentially extended as a general-purpose framework. Despite its simplicity, we show that it substantially outperforms state-of-the-art specialized algorithms on several popular healthcare and human activity datasets. Especially in the challenging leave-sensors-out setting where a subset of variables is masked during testing, the performance improvement is up to 54.0\% in absolute F1 score points. Our code and data are available at \url{https://github.com/Leezekun/ViTST}.


[6] 2303.12800

IoT Device Identification Based on Network Communication Analysis Using Deep Learning

Attack vectors for adversaries have increased in organizations because of the growing use of less secure IoT devices. The risk of attacks on an organization's network has also increased due to the bring your own device (BYOD) policy which permits employees to bring IoT devices onto the premises and attach them to the organization's network. To tackle this threat and protect their networks, organizations generally implement security policies in which only white listed IoT devices are allowed on the organization's network. To monitor compliance with such policies, it has become essential to distinguish IoT devices permitted within an organization's network from non white listed (unknown) IoT devices. In this research, deep learning is applied to network communication for the automated identification of IoT devices permitted on the network. In contrast to existing methods, the proposed approach does not require complex feature engineering of the network communication, because the 'communication behavior' of IoT devices is represented as small images which are generated from the device's network communication payload. The proposed approach is applicable for any IoT device, regardless of the protocol used for communication. As our approach relies on the network communication payload, it is also applicable for the IoT devices behind a network address translation (NAT) enabled router. In this study, we trained various classifiers on a publicly accessible dataset to identify IoT devices in different scenarios, including the identification of known and unknown IoT devices, achieving over 99% overall average detection accuracy.


[7] 2303.12802

Distributed Learning Meets 6G: A Communication and Computing Perspective

With the ever-improving computing capabilities and storage capacities of mobile devices in line with evolving telecommunication network paradigms, there has been an explosion of research interest towards exploring Distributed Learning (DL) frameworks to realize stringent key performance indicators (KPIs) that are expected in next-generation/6G cellular networks. In conjunction with Edge Computing, Federated Learning (FL) has emerged as the DL architecture of choice in prominent wireless applications. This article lays an outline of how DL in general and FL-based strategies specifically can contribute towards realizing a part of the 6G vision and strike a balance between communication and computing constraints. As a practical use case, we apply Multi-Agent Reinforcement Learning (MARL) within the FL framework to the Dynamic Spectrum Access (DSA) problem and present preliminary evaluation results. Top contemporary challenges in applying DL approaches to 6G networks are also highlighted.


[8] 2303.12803

Evolving Populations of Diverse RL Agents with MAP-Elites

Quality Diversity (QD) has emerged as a powerful alternative optimization paradigm that aims at generating large and diverse collections of solutions, notably with its flagship algorithm MAP-ELITES (ME) which evolves solutions through mutations and crossovers. While very effective for some unstructured problems, early ME implementations relied exclusively on random search to evolve the population of solutions, rendering them notoriously sample-inefficient for high-dimensional problems, such as when evolving neural networks. Follow-up works considered exploiting gradient information to guide the search in order to address these shortcomings through techniques borrowed from either Black-Box Optimization (BBO) or Reinforcement Learning (RL). While mixing RL techniques with ME unlocked state-of-the-art performance for robotics control problems that require a good amount of exploration, it also plagued these ME variants with limitations common among RL algorithms that ME was free of, such as hyperparameter sensitivity, high stochasticity as well as training instability, including when the population size increases as some components are shared across the population in recent approaches. Furthermore, existing approaches mixing ME with RL tend to be tied to a specific RL algorithm, which effectively prevents their use on problems where the corresponding RL algorithm fails. To address these shortcomings, we introduce a flexible framework that allows the use of any RL algorithm and alleviates the aforementioned limitations by evolving populations of agents (whose definition include hyperparameters and all learnable parameters) instead of just policies. We demonstrate the benefits brought about by our framework through extensive numerical experiments on a number of robotics control problems, some of which with deceptive rewards, taken from the QD-RL literature.


[9] 2303.12804

Features matching using natural language processing

The feature matching is a basic step in matching different datasets. This article proposes shows a new hybrid model of a pretrained Natural Language Processing (NLP) based model called BERT used in parallel with a statistical model based on Jaccard similarity to measure the similarity between list of features from two different datasets. This reduces the time required to search for correlations or manually match each feature from one dataset to another.


[10] 2303.12805

Digital Twins for Trust Building in Autonomous Drones through Dynamic Safety Evaluation

The adoption process of innovative software-intensive technologies leverages complex trust concerns in different forms and shapes. Perceived safety plays a fundamental role in technology adoption, being especially crucial in the case of those innovative software-driven technologies characterized by a high degree of dynamism and unpredictability, like collaborating autonomous systems. These systems need to synchronize their maneuvers in order to collaboratively engage in reactions to unpredictable incoming hazardous situations. That is however only possible in the presence of mutual trust. In this paper, we propose an approach for machine-to-machine dynamic trust assessment for collaborating autonomous systems that supports trust-building based on the concept of dynamic safety assurance within the collaborative process among the software-intensive autonomous systems. In our approach, we leverage the concept of digital twins which are abstract models fed with real-time data used in the run-time dynamic exchange of information. The information exchange is performed through the execution of specialized models that embed the necessary safety properties. More particularly, we examine the possible role of the Digital Twins in machine-to-machine trust building and present their design in supporting dynamic trust assessment of autonomous drones. Ultimately, we present a proof of concept of direct and indirect trust assessment by employing the Digital Twin in a use case involving two autonomous collaborating drones.


[11] 2303.12807

Granular-ball Optimization Algorithm

The existing intelligent optimization algorithms are designed based on the finest granularity, i.e., a point. This leads to weak global search ability and inefficiency. To address this problem, we proposed a novel multi-granularity optimization algorithm, namely granular-ball optimization algorithm (GBO), by introducing granular-ball computing. GBO uses many granular-balls to cover the solution space. Quite a lot of small and fine-grained granular-balls are used to depict the important parts, and a little number of large and coarse-grained granular-balls are used to depict the inessential parts. Fine multi-granularity data description ability results in a higher global search capability and faster convergence speed. In comparison with the most popular and state-of-the-art algorithms, the experiments on twenty benchmark functions demonstrate its better performance. The faster speed, higher approximation ability of optimal solution, no hyper-parameters, and simpler design of GBO make it an all-around replacement of most of the existing popular intelligent optimization algorithms.


[12] 2303.12808

PACO: Provocation Involving Action, Culture, and Oppression

In India, people identify with a particular group based on certain attributes such as religion. The same religious groups are often provoked against each other. Previous studies show the role of provocation in increasing tensions between India's two prominent religious groups: Hindus and Muslims. With the advent of the Internet, such provocation also surfaced on social media platforms such as WhatsApp. By leveraging an existing dataset of Indian WhatsApp posts, we identified three categories of provoking sentences against Indian Muslims. Further, we labeled 7,000 sentences for three provocation categories and called this dataset PACO. We leveraged PACO to train a model that can identify provoking sentences from a WhatsApp post. Our best model is fine-tuned RoBERTa and achieved a 0.851 average AUC score over five-fold cross-validation. Automatically identifying provoking sentences could stop provoking text from reaching out to the masses, and can prevent possible discrimination or violence against the target religious group. Further, we studied the provocative speech through a pragmatic lens, by identifying the dialog acts and impoliteness super-strategies used against the religious group.


[13] 2303.12810

Are LLMs the Master of All Trades? : Exploring Domain-Agnostic Reasoning Skills of LLMs

The potential of large language models (LLMs) to reason like humans has been a highly contested topic in Machine Learning communities. However, the reasoning abilities of humans are multifaceted and can be seen in various forms, including analogical, spatial and moral reasoning, among others. This fact raises the question whether LLMs can perform equally well across all these different domains. This research work aims to investigate the performance of LLMs on different reasoning tasks by conducting experiments that directly use or draw inspirations from existing datasets on analogical and spatial reasoning. Additionally, to evaluate the ability of LLMs to reason like human, their performance is evaluted on more open-ended, natural language questions. My findings indicate that LLMs excel at analogical and moral reasoning, yet struggle to perform as proficiently on spatial reasoning tasks. I believe these experiments are crucial for informing the future development of LLMs, particularly in contexts that require diverse reasoning proficiencies. By shedding light on the reasoning abilities of LLMs, this study aims to push forward our understanding of how they can better emulate the cognitive abilities of humans.


[14] 2303.12811

SignCRF: Scalable Channel-agnostic Data-driven Radio Authentication System

Radio Frequency Fingerprinting through Deep Learning (RFFDL) is a data-driven IoT authentication technique that leverages the unique hardware-level manufacturing imperfections associated with a particular device to recognize (fingerprint) the device based on variations introduced in the transmitted waveform. The proposed SignCRF is a scalable, channel-agnostic, data-driven radio authentication platform with unmatched precision in fingerprinting wireless devices based on their unique manufacturing impairments and independent of the dynamic channel irregularities caused by mobility. SignCRF consists of (i) a baseline classifier finely trained to authenticate devices with high accuracy and at scale; (ii) an environment translator carefully designed and trained to remove the dynamic channel impact from RF signals while maintaining the radio's specific signature; (iii) a Max-Rule module that selects the highest precision authentication technique between the baseline classifier and the environment translator per radio. We design, train, and validate the performance of SignCRF for multiple technologies in dynamic environments and at scale (100 LoRa and 20 WiFi devices). We demonstrate that SignCRF significantly improves the RFFDL performance by achieving as high as 5x and 8x improvement in correct authentication of WiFi and LoRa devices when compared to the state-of-the-art, respectively.


[15] 2303.12812

A Comparison of Graph Neural Networks for Malware Classification

Managing the threat posed by malware requires accurate detection and classification techniques. Traditional detection strategies, such as signature scanning, rely on manual analysis of malware to extract relevant features, which is labor intensive and requires expert knowledge. Function call graphs consist of a set of program functions and their inter-procedural calls, providing a rich source of information that can be leveraged to classify malware without the labor intensive feature extraction step of traditional techniques. In this research, we treat malware classification as a graph classification problem. Based on Local Degree Profile features, we train a wide range of Graph Neural Network (GNN) architectures to generate embeddings which we then classify. We find that our best GNN models outperform previous comparable research involving the well-known MalNet-Tiny Android malware dataset. In addition, our GNN models do not suffer from the overfitting issues that commonly afflict non-GNN techniques, although GNN models require longer training times.


[16] 2303.12816

From Wide to Deep: Dimension Lifting Network for Parameter-efficient Knowledge Graph Embedding

Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream tasks. Conventional KGE methods require relatively high-dimensional entity representations to preserve the structural information of knowledge graph, but lead to oversized model parameters. Recent methods reduce model parameters by adopting low-dimensional entity representations, while developing techniques (e.g., knowledge distillation) to compensate for the reduced dimension. However, such operations produce degraded model accuracy and limited reduction of model parameters. Specifically, we view the concatenation of all entity representations as an embedding layer, and then conventional KGE methods that adopt high-dimensional entity representations equal to enlarging the width of the embedding layer to gain expressiveness. To achieve parameter efficiency without sacrificing accuracy, we instead increase the depth and propose a deeper embedding network for entity representations, i.e., a narrow embedding layer and a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that the proposed method (implemented based on TransE and DistMult) with 4-dimensional entity representations achieves more accurate link prediction results than counterpart parameter-efficient KGE methods and strong KGE baselines, including TransE and DistMult with 512-dimensional entity representations.


[17] 2303.12817

IRIS: a Record and Replay Framework to Enable Hardware-assisted Virtualization Fuzzing

Nowadays, industries are looking into virtualization as an effective means to build safe applications, thanks to the isolation it can provide among virtual machines (VMs) running on the same hardware. In this context, a fundamental issue is understanding to what extent the isolation is guaranteed, despite possible (or induced) problems in the virtualization mechanisms. Uncovering such isolation issues is still an open challenge, especially for hardware-assisted virtualization, since the search space should include all the possible VM states (and the linked hypervisor state), which is prohibitive. In this paper, we propose IRIS, a framework to record (learn) sequences of inputs (i.e., VM seeds) from the real guest execution (e.g., OS boot), replay them as-is to reach valid and complex VM states, and finally use them as valid seed to be mutated for enabling fuzzing solutions for hardware-assisted hypervisors. We demonstrate the accuracy and efficiency of IRIS in automatically reproducing valid VM behaviors, with no need to execute guest workloads. We also provide a proof-of-concept fuzzer, based on the proposed architecture, showing its potential on the Xen hypervisor.


[18] 2303.12818

An Empirical Analysis of the Shift and Scale Parameters in BatchNorm

Batch Normalization (BatchNorm) is a technique that improves the training of deep neural networks, especially Convolutional Neural Networks (CNN). It has been empirically demonstrated that BatchNorm increases performance, stability, and accuracy, although the reasons for such improvements are unclear. BatchNorm includes a normalization step as well as trainable shift and scale parameters. In this paper, we empirically examine the relative contribution to the success of BatchNorm of the normalization step, as compared to the re-parameterization via shifting and scaling. To conduct our experiments, we implement two new optimizers in PyTorch, namely, a version of BatchNorm that we refer to as AffineLayer, which includes the re-parameterization step without normalization, and a version with just the normalization step, that we call BatchNorm-minus. We compare the performance of our AffineLayer and BatchNorm-minus implementations to standard BatchNorm, and we also compare these to the case where no batch normalization is used. We experiment with four ResNet architectures (ResNet18, ResNet34, ResNet50, and ResNet101) over a standard image dataset and multiple batch sizes. Among other findings, we provide empirical evidence that the success of BatchNorm may derive primarily from improved weight initialization.


[19] 2303.12821

Towards A Visual Programming Tool to Create Deep Learning Models

Deep Learning (DL) developers come from different backgrounds, e.g., medicine, genomics, finance, and computer science. To create a DL model, they must learn and use high-level programming languages (e.g., Python), thus needing to handle related setups and solve programming errors. This paper presents DeepBlocks, a visual programming tool that allows DL developers to design, train, and evaluate models without relying on specific programming languages. DeepBlocks works by building on the typical model structure: a sequence of learnable functions whose arrangement defines the specific characteristics of the model. We derived DeepBlocks' design goals from a 5-participants formative interview, and we validated the first implementation of the tool through a typical use case. Results are promising and show that developers could visually design complex DL architectures.


[20] 2303.12822

Co-Speech Gesture Synthesis using Discrete Gesture Token Learning

Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions that can drive a humanoid robot to interact and communicate with human users. Such capability will improve the impressions of the robots by human users and will find applications in education, training, and medical services. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. The deterministic regression methods can not resolve the conflicting samples and may produce over-smoothed or damped motions. We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes. Our method utilizes RQ-VAE in the first stage to learn a discrete codebook consisting of gesture tokens from training data. In the second stage, a two-level autoregressive transformer model is used to learn the prior distribution of residual codes conditioned on input speech context. Since the inference is formulated as token sampling, multiple gesture sequences could be generated given the same speech input using top-k sampling. The quantitative results and the user study showed the proposed method outperforms the previous methods and is able to generate realistic and diverse gesture motions.


[21] 2303.12823

Data-Driven Leader-following Consensus for Nonlinear Multi-Agent Systems against Composite Attacks: A Twins Layer Approach

This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequently, the resilient control task against CAs can be divided into two parts: One is distributed estimation against DoS attacks on the TL and the other is resilient decentralized tracking control against actuation attacks on the CPL. %The data-driven scheme is used to deal with both model non-linearity and model uncertainty, in which only the input and output data of the system are employed throughout the whole control process. First, a distributed observer based on switching estimation law against DoS is designed on TL. Second, a distributed model free adaptive control (DMFAC) protocol based on attack compensation against AAs is designed on CPL. Moreover, the uniformly ultimately bounded convergence of consensus error of the proposed double-layer DMFAC algorithm is strictly proved. Finally, the simulation verifies the effectiveness of the resilient double-layer control scheme.


[22] 2303.12848

Test-time Defense against Adversarial Attacks: Detection and Reconstruction of Adversarial Examples via Masked Autoencoder

Existing defense methods against adversarial attacks can be categorized into training time and test time defenses. Training time defense, i.e., adversarial training, requires a significant amount of extra time for training and is often not able to be generalized to unseen attacks. On the other hand, test time defense by test time weight adaptation requires access to perform gradient descent on (part of) the model weights, which could be infeasible for models with frozen weights. To address these challenges, we propose DRAM, a novel defense method to Detect and Reconstruct multiple types of Adversarial attacks via Masked autoencoder (MAE). We demonstrate how to use MAE losses to build a KS-test to detect adversarial attacks. Moreover, the MAE losses can be used to repair adversarial samples from unseen attack types. In this sense, DRAM neither requires model weight updates in test time nor augments the training set with more adversarial samples. Evaluating DRAM on the large-scale ImageNet data, we achieve the best detection rate of 82% on average on eight types of adversarial attacks compared with other detection baselines. For reconstruction, DRAM improves the robust accuracy by 6% ~ 41% for Standard ResNet50 and 3% ~ 8% for Robust ResNet50 compared with other self-supervision tasks, such as rotation prediction and contrastive learning.


[23] 2303.12850

Polyhedral Aspects of Feedback Vertex Set and Pseudoforest Deletion Set

We consider the feedback vertex set problem in undirected graphs (FVS). The input to FVS is an undirected graph $G=(V,E)$ with non-negative vertex costs. The goal is to find a least cost subset of vertices $S \subseteq V$ such that $G-S$ is acyclic. FVS is a well-known NP-hard problem with no $(2-\epsilon)$-approximation assuming the Unique Games Conjecture and it admits a $2$-approximation via combinatorial local-ratio methods (Bafna, Berman and Fujito, Algorithms and Computations '95; Becker and Geiger, Artificial Intelligence '96) which can also be interpreted as LP-based primal-dual algorithms (Chudak, Goemans, Hochbaum and Williamson, Operations Research Letters '98). Despite the existence of these algorithms for several decades, there is no known polynomial-time solvable LP relaxation for FVS with a provable integrality gap of at most $2$. More recent work (Chekuri and Madan SODA '16) developed a polynomial-sized LP relaxation for a more general problem, namely Subset FVS, and showed that its integrality gap is at most $13$ for Subset FVS, and hence also for FVS. Motivated by this gap in our knowledge, we undertake a polyhedral study of FVS and related problems. In this work, we formulate new integer linear programs (ILPs) for FVS whose LP-relaxation can be solved in polynomial time, and whose integrality gap is at most $2$. The new insights in this process also enable us to prove that the formulation in (Chekuri and Madan, SODA '16) has an integrality gap of at most $2$ for FVS. Our results for FVS are inspired by new formulations and polyhedral results for the closely-related pseudoforest deletion set problem (PFDS). Our formulations for PFDS are in turn inspired by a connection to the densest subgraph problem. We also conjecture an extreme point property for a LP-relaxation for FVS, and give evidence for the conjecture via a corresponding result for PFDS.


[24] 2303.12851

Semi-Oblivious Chase Termination for Linear Existential Rules: An Experimental Study

The chase procedure is a fundamental algorithmic tool in databases that allows us to reason with constraints, such as existential rules, with a plethora of applications. It takes as input a database and a set of constraints, and iteratively completes the database as dictated by the constraints. A key challenge, though, is the fact that it may not terminate, which leads to the problem of checking whether it terminates given a database and a set of constraints. In this work, we focus on the semi-oblivious version of the chase, which is well-suited for practical implementations, and linear existential rules, a central class of constraints with several applications. In this setting, there is a mature body of theoretical work that provides syntactic characterizations of when the chase terminates, algorithms for checking chase termination, precise complexity results, and worst-case optimal bounds on the size of the result of the chase (whenever is finite). Our main objective is to experimentally evaluate the existing chase termination algorithms with the aim of understanding which input parameters affect their performance, clarifying whether they can be used in practice, and revealing their performance limitations.


[25] 2303.12853

Three iterations of $(1-d)$-WL test distinguish non isometric clouds of $d$-dimensional points

The Weisfeiler--Lehman (WL) test is a fundamental iterative algorithm for checking isomorphism of graphs. It has also been observed that it underlies the design of several graph neural network architectures, whose capabilities and performance can be understood in terms of the expressive power of this test. Motivated by recent developments in machine learning applications to datasets involving three-dimensional objects, we study when the WL test is {\em complete} for clouds of euclidean points represented by complete distance graphs, i.e., when it can distinguish, up to isometry, any arbitrary such cloud. Our main result states that the $(d-1)$-dimensional WL test is complete for point clouds in $d$-dimensional Euclidean space, for any $d\ge 2$, and that only three iterations of the test suffice. Our result is tight for $d = 2, 3$. We also observe that the $d$-dimensional WL test only requires one iteration to achieve completeness.


[26] 2303.12856

Anti-symmetric Barron functions and their approximation with sums of determinants

A fundamental problem in quantum physics is to encode functions that are completely anti-symmetric under permutations of identical particles. The Barron space consists of high-dimensional functions that can be parameterized by infinite neural networks with one hidden layer. By explicitly encoding the anti-symmetric structure, we prove that the anti-symmetric functions which belong to the Barron space can be efficiently approximated with sums of determinants. This yields a factorial improvement in complexity compared to the standard representation in the Barron space and provides a theoretical explanation for the effectiveness of determinant-based architectures in ab-initio quantum chemistry.


[27] 2303.12860

Salient Span Masking for Temporal Understanding

Salient Span Masking (SSM) has shown itself to be an effective strategy to improve closed-book question answering performance. SSM extends general masked language model pretraining by creating additional unsupervised training sentences that mask a single entity or date span, thus oversampling factual information. Despite the success of this paradigm, the span types and sampling strategies are relatively arbitrary and not widely studied for other tasks. Thus, we investigate SSM from the perspective of temporal tasks, where learning a good representation of various temporal expressions is important. To that end, we introduce Temporal Span Masking (TSM) intermediate training. First, we find that SSM alone improves the downstream performance on three temporal tasks by an avg. +5.8 points. Further, we are able to achieve additional improvements (avg. +0.29 points) by adding the TSM task. These comprise the new best reported results on the targeted tasks. Our analysis suggests that the effectiveness of SSM stems from the sentences chosen in the training data rather than the mask choice: sentences with entities frequently also contain temporal expressions. Nonetheless, the additional targeted spans of TSM can still improve performance, especially in a zero-shot context.


[28] 2303.12862

LP-IOANet: Efficient High Resolution Document Shadow Removal

Document shadow removal is an integral task in document enhancement pipelines, as it improves visibility, readability and thus the overall quality. Assuming that the majority of practical document shadow removal scenarios require real-time, accurate models that can produce high-resolution outputs in-the-wild, we propose Laplacian Pyramid with Input/Output Attention Network (LP-IOANet), a novel pipeline with a lightweight architecture and an upsampling module. Furthermore, we propose three new datasets which cover a wide range of lighting conditions, images, shadow shapes and viewpoints. Our results show that we outperform the state-of-the-art by a 35% relative improvement in mean average error (MAE), while running real-time in four times the resolution (of the state-of-the-art method) on a mobile device.


[29] 2303.12865

NeRF-GAN Distillation for Efficient 3D-Aware Generation with Convolutions

Pose-conditioned convolutional generative models struggle with high-quality 3D-consistent image generation from single-view datasets, due to their lack of sufficient 3D priors. Recently, the integration of Neural Radiance Fields (NeRFs) and generative models, such as Generative Adversarial Networks (GANs), has transformed 3D-aware generation from single-view images. NeRF-GANs exploit the strong inductive bias of 3D neural representations and volumetric rendering at the cost of higher computational complexity. This study aims at revisiting pose-conditioned 2D GANs for efficient 3D-aware generation at inference time by distilling 3D knowledge from pretrained NeRF-GANS. We propose a simple and effective method, based on re-using the well-disentangled latent space of a pre-trained NeRF-GAN in a pose-conditioned convolutional network to directly generate 3D-consistent images corresponding to the underlying 3D representations. Experiments on several datasets demonstrate that the proposed method obtains results comparable with volumetric rendering in terms of quality and 3D consistency while benefiting from the superior computational advantage of convolutional networks. The code will be available at: https://github.com/mshahbazi72/NeRF-GAN-Distillation


[30] 2303.12869

JaCoText: A Pretrained Model for Java Code-Text Generation

Pretrained transformer-based models have shown high performance in natural language generation task. However, a new wave of interest has surged: automatic programming language generation. This task consists of translating natural language instructions to a programming code. Despite the fact that well-known pretrained models on language generation have achieved good performance in learning programming languages, effort is still needed in automatic code generation. In this paper, we introduce JaCoText, a model based on Transformers neural network. It aims to generate java source code from natural language text. JaCoText leverages advantages of both natural language and code generation models. More specifically, we study some findings from the state of the art and use them to (1) initialize our model from powerful pretrained models, (2) explore additional pretraining on our java dataset, (3) carry out experiments combining the unimodal and bimodal data in the training, and (4) scale the input and output length during the fine-tuning of the model. Conducted experiments on CONCODE dataset show that JaCoText achieves new state-of-the-art results.


[31] 2303.12870

Consensus on Unknown Torus with Dense Byzantine Faults

We present a solution to consensus on a torus with Byzantine faults. Any solution to classic consensus that is tolerant to $f$ Byzantine faults requires $2f+1$ node-disjoint paths. Due to limited torus connectivity, this bound necessitates spatial separation between faults. Our solution does not require this many disjoint paths and tolerates dense faults. Specifically, we consider the case where all faults are in the one column. We address the version of consensus where only processes in fault-free columns must agree. We prove that even this weaker version is not solvable if the column may be completely faulty. We then present a solution for the case where at least one row is fault-free. The correct processes share orientation but do not know the identities of other processes or the torus dimensions. The communication is synchronous. To achieve our solution, we build and prove correct an all-to-all broadcast algorithm \PROG{BAT} that guarantees delivery to all processes in fault-free columns. We use this algorithm to solve our weak consensus problem. Our solution, \PROG{CBAT}, runs in $O(H+W)$ rounds, where $H$ and $W$ are torus height and width respectively. We extend our consensus solution to the fixed message size model where it runs in $O(H^3W^2)$ rounds. Our results are immediately applicable if the faults are located in a single row, rather than a column.


[32] 2303.12872

Human Uncertainty in Concept-Based AI Systems

Placing a human in the loop may abate the risks of deploying AI systems in safety-critical settings (e.g., a clinician working with a medical AI system). However, mitigating risks arising from human error and uncertainty within such human-AI interactions is an important and understudied issue. In this work, we study human uncertainty in the context of concept-based models, a family of AI systems that enable human feedback via concept interventions where an expert intervenes on human-interpretable concepts relevant to the task. Prior work in this space often assumes that humans are oracles who are always certain and correct. Yet, real-world decision-making by humans is prone to occasional mistakes and uncertainty. We study how existing concept-based models deal with uncertain interventions from humans using two novel datasets: UMNIST, a visual dataset with controlled simulated uncertainty based on the MNIST dataset, and CUB-S, a relabeling of the popular CUB concept dataset with rich, densely-annotated soft labels from humans. We show that training with uncertain concept labels may help mitigate weaknesses of concept-based systems when handling uncertain interventions. These results allow us to identify several open challenges, which we argue can be tackled through future multidisciplinary research on building interactive uncertainty-aware systems. To facilitate further research, we release a new elicitation platform, UElic, to collect uncertain feedback from humans in collaborative prediction tasks.


[33] 2303.12876

A Survey on Task Allocation and Scheduling in Robotic Network Systems

Cloud Robotics is helping to create a new generation of robots that leverage the nearly unlimited resources of large data centers (i.e., the cloud), overcoming the limitations imposed by on-board resources. Different processing power, capabilities, resource sizes, energy consumption, and so forth, make scheduling and task allocation critical components. The basic idea of task allocation and scheduling is to optimize performance by minimizing completion time, energy consumption, delays between two consecutive tasks, along with others, and maximizing resource utilization, number of completed tasks in a given time interval, and suchlike. In the past, several works have addressed various aspects of task allocation and scheduling. In this paper, we provide a comprehensive overview of task allocation and scheduling strategies and related metrics suitable for robotic network cloud systems. We discuss the issues related to allocation and scheduling methods and the limitations that need to be overcome. The literature review is organized according to three different viewpoints: Architectures and Applications, Methods and Parameters. In addition, the limitations of each method are highlighted for future research.


[34] 2303.12877

Resilient Trajectory Tracking to Partial Loss of Control Authority over Actuators with Actuation Delay

After the loss of control authority over thrusters of the Nauka module, the International Space Station lost attitude control for 45 minutes with potentially disastrous consequences. Motivated by a scenario of orbital inspection, we consider a similar malfunction occurring to the inspector satellite and investigate whether its mission can still be safely fulfilled. While a natural approach is to counteract in real-time the uncontrolled and undesirable thrust with the remaining controlled thrusters, vehicles are often subject to actuation delays hindering this approach. Instead, we extend resilience theory to systems suffering from actuation delay and build a resilient trajectory tracking controller with stability guarantees relying on a state predictor. We demonstrate that this controller can track accurately the reference trajectory of the inspection mission despite the actuation delay and the loss of control authority over one of the thrusters.


[35] 2303.12878

Robust Consensus in Ranking Data Analysis: Definitions, Properties and Computational Issues

As the issue of robustness in AI systems becomes vital, statistical learning techniques that are reliable even in presence of partly contaminated data have to be developed. Preference data, in the form of (complete) rankings in the simplest situations, are no exception and the demand for appropriate concepts and tools is all the more pressing given that technologies fed by or producing this type of data (e.g. search engines, recommending systems) are now massively deployed. However, the lack of vector space structure for the set of rankings (i.e. the symmetric group $\mathfrak{S}_n$) and the complex nature of statistics considered in ranking data analysis make the formulation of robustness objectives in this domain challenging. In this paper, we introduce notions of robustness, together with dedicated statistical methods, for Consensus Ranking the flagship problem in ranking data analysis, aiming at summarizing a probability distribution on $\mathfrak{S}_n$ by a median ranking. Precisely, we propose specific extensions of the popular concept of breakdown point, tailored to consensus ranking, and address the related computational issues. Beyond the theoretical contributions, the relevance of the approach proposed is supported by an experimental study.


[36] 2303.12883

HAPS-UAV-Enabled Heterogeneous Networks: A Deep Reinforcement Learning Approach

The integrated use of non-terrestrial network (NTN) entities such as the high-altitude platform station (HAPS) and low-altitude platform station (LAPS) has become essential elements in the space-air-ground integrated networks (SAGINs). However, the complexity, mobility, and heterogeneity of NTN entities and resources present various challenges from system design to deployment. This paper proposes a novel approach to designing a heterogeneous network consisting of HAPSs and unmanned aerial vehicles (UAVs) being LAPS entities. Our approach involves jointly optimizing the three-dimensional trajectory and channel allocation for aerial base stations, with a focus on ensuring fairness and the provision of quality of service (QoS) to ground users. Furthermore, we consider the load on base stations and incorporate this information into the optimization problem. The proposed approach utilizes a combination of deep reinforcement learning and fixed-point iteration techniques to determine the UAV locations and channel allocation strategies. Simulation results reveal that our proposed deep learning-based approach significantly outperforms learning-based and conventional benchmark models.


[37] 2303.12888

A dynamic risk score for early prediction of cardiogenic shock using machine learning

Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the US. The morbidity and mortality are highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock is critical. Prompt implementation of treatment measures can prevent the deleterious spiral of ischemia, low blood pressure, and reduced cardiac output due to cardiogenic shock. However, early identification of cardiogenic shock has been challenging due to human providers' inability to process the enormous amount of data in the cardiac intensive care unit (ICU) and lack of an effective risk stratification tool. We developed a deep learning-based risk stratification tool, called CShock, for patients admitted into the cardiac ICU with acute decompensated heart failure and/or myocardial infarction to predict onset of cardiogenic shock. To develop and validate CShock, we annotated cardiac ICU datasets with physician adjudicated outcomes. CShock achieved an area under the receiver operator characteristic curve (AUROC) of 0.820, which substantially outperformed CardShock (AUROC 0.519), a well-established risk score for cardiogenic shock prognosis. CShock was externally validated in an independent patient cohort and achieved an AUROC of 0.800, demonstrating its generalizability in other cardiac ICUs.


[38] 2303.12889

AVOID: Autonomous Vehicle Operation Incident Dataset Across the Globe

Crash data of autonomous vehicles (AV) or vehicles equipped with advanced driver assistance systems (ADAS) are the key information to understand the crash nature and to enhance the automation systems. However, most of the existing crash data sources are either limited by the sample size or suffer from missing or unverified data. To contribute to the AV safety research community, we introduce AVOID: an open AV crash dataset. Three types of vehicles are considered: Advanced Driving System (ADS) vehicles, Advanced Driver Assistance Systems (ADAS) vehicles, and low-speed autonomous shuttles. The crash data are collected from the National Highway Traffic Safety Administration (NHTSA), California Department of Motor Vehicles (CA DMV) and incident news worldwide, and the data are manually verified and summarized in ready-to-use format. In addition, land use, weather, and geometry information are also provided. The dataset is expected to accelerate the research on AV crash analysis and potential risk identification by providing the research community with data of rich samples, diverse data sources, clear data structure, and high data quality.


[39] 2303.12890

Scale space radon transform-based inertia axis and object central symmetry estimation

Inertia Axes are involved in many techniques for image content measurement when involving information obtained from lines, angles, centroids... etc. We investigate, here, the estimation of the main axis of inertia of an object in the image. We identify the coincidence conditions of the Scale Space Radon Transform (SSRT) maximum and the inertia main axis. We show, that by choosing the appropriate scale parameter, it is possible to match the SSRT maximum and the main axis of inertia location and orientation of the embedded object in the image. Furthermore, an example of use case is presented where binary objects central symmetry computation is derived by means of SSRT projections and the axis of inertia orientation. To this end, some SSRT characteristics have been highlighted and exploited. The experimentations show the SSRT-based main axis of inertia computation effectiveness. Concerning the central symmetry, results are very satisfying as experimentations carried out on randomly created images dataset and existing datasets have permitted to divide successfully these images bases into centrally symmetric and non-centrally symmetric objects.


[40] 2303.12891

Feature Reduction Method Comparison Towards Explainability and Efficiency in Cybersecurity Intrusion Detection Systems

In the realm of cybersecurity, intrusion detection systems (IDS) detect and prevent attacks based on collected computer and network data. In recent research, IDS models have been constructed using machine learning (ML) and deep learning (DL) methods such as Random Forest (RF) and deep neural networks (DNN). Feature selection (FS) can be used to construct faster, more interpretable, and more accurate models. We look at three different FS techniques; RF information gain (RF-IG), correlation feature selection using the Bat Algorithm (CFS-BA), and CFS using the Aquila Optimizer (CFS-AO). Our results show CFS-BA to be the most efficient of the FS methods, building in 55% of the time of the best RF-IG model while achieving 99.99% of its accuracy. This reinforces prior contributions attesting to CFS-BA's accuracy while building upon the relationship between subset size, CFS score, and RF-IG score in final results.


[41] 2303.12892

A Small-Scale Switch Transformer and NLP-based Model for Clinical Narratives Classification

In recent years, Transformer-based models such as the Switch Transformer have achieved remarkable results in natural language processing tasks. However, these models are often too complex and require extensive pre-training, which limits their effectiveness for small clinical text classification tasks with limited data. In this study, we propose a simplified Switch Transformer framework and train it from scratch on a small French clinical text classification dataset at CHU Sainte-Justine hospital. Our results demonstrate that the simplified small-scale Transformer models outperform pre-trained BERT-based models, including DistillBERT, CamemBERT, FlauBERT, and FrALBERT. Additionally, using a mixture of expert mechanisms from the Switch Transformer helps capture diverse patterns; hence, the proposed approach achieves better results than a conventional Transformer with the self-attention mechanism. Finally, our proposed framework achieves an accuracy of 87\%, precision at 87\%, and recall at 85\%, compared to the third-best pre-trained BERT-based model, FlauBERT, which achieved an accuracy of 84\%, precision at 84\%, and recall at 84\%. However, Switch Transformers have limitations, including a generalization gap and sharp minima. We compare it with a multi-layer perceptron neural network for small French clinical narratives classification and show that the latter outperforms all other models.


[42] 2303.12896

Overcoming Algorithm Aversion: A Comparison between Process and Outcome Control

Algorithm aversion occurs when humans are reluctant to use algorithms despite their superior performance. Studies show that giving users outcome control by providing agency over how models' predictions are incorporated into decision-making mitigates algorithm aversion. We study whether algorithm aversion is mitigated by process control, wherein users can decide what input factors and algorithms to use in model training. We conduct a replication study of outcome control, and test novel process control study conditions on Amazon Mechanical Turk (MTurk) and Prolific. Our results partly confirm prior findings on the mitigating effects of outcome control, while also forefronting reproducibility challenges. We find that process control in the form of choosing the training algorithm mitigates algorithm aversion, but changing inputs does not. Furthermore, giving users both outcome and process control does not reduce algorithm aversion more than outcome or process control alone. This study contributes to design considerations around mitigating algorithm aversion.


[43] 2303.12897

Gyroscopic polynomials

Gyroscopic alignment of a fluid occurs when flow structures align with the rotation axis. This often gives rise to highly spatially anisotropic columnar structures that in combination with complex domain boundaries pose challenges for efficient numerical discretizations and computations. We define gyroscopic polynomials to be three-dimensional polynomials expressed in a coordinate system that conforms to rotational alignment. We remap the original domain with radius-dependent boundaries onto a right cylindrical or annular domain to create the computational domain in this coordinate system. We find the volume element expressed in gyroscopic coordinates leads naturally to a hierarchy of orthonormal bases. We build the bases out of Jacobi polynomials in the vertical and generalized Jacobi polynomials in the radial. Because these coordinates explicitly conform to flow structures found in rapidly rotating systems the bases represent fields with a relatively small number of modes. We develop the operator structure for one-dimensional semi-classical orthogonal polynomials as a building block for differential operators in the full three-dimensional cylindrical and annular domains. The differential operators of generalized Jacobi polynomials generate a sparse linear system for discretization of differential operators acting on the gyroscopic bases. This enables efficient simulation of systems with strong gyroscopic alignment.


[44] 2303.12898

Towards Understanding the Generalization of Medical Text-to-SQL Models and Datasets

Electronic medical records (EMRs) are stored in relational databases. It can be challenging to access the required information if the user is unfamiliar with the database schema or general database fundamentals. Hence, researchers have explored text-to-SQL generation methods that provide healthcare professionals direct access to EMR data without needing a database expert. However, currently available datasets have been essentially "solved" with state-of-the-art models achieving accuracy greater than or near 90%. In this paper, we show that there is still a long way to go before solving text-to-SQL generation in the medical domain. To show this, we create new splits of the existing medical text-to-SQL dataset MIMICSQL that better measure the generalizability of the resulting models. We evaluate state-of-the-art language models on our new split showing substantial drops in performance with accuracy dropping from up to 92% to 28%, thus showing substantial room for improvement. Moreover, we introduce a novel data augmentation approach to improve the generalizability of the language models. Overall, this paper is the first step towards developing more robust text-to-SQL models in the medical domain.\footnote{The dataset and code will be released upon acceptance.


[45] 2303.12901

Dynasparse: Accelerating GNN Inference through Dynamic Sparsity Exploitation

Graph Neural Network (GNN) inference is used in many real-world applications. Data sparsity in GNN inference, including sparsity in the input graph and the GNN model, offer opportunities to further speed up inference. Also, many pruning techniques have been proposed for model compression that increase the data sparsity of GNNs. We propose Dynasparse, a comprehensive hardware-software codesign on FPGA to accelerate GNN inference through dynamic sparsity exploitation. For this, we decouple the GNN computation kernels from the basic computation primitives, and explore hardware-software codesign as follows: 1) Hardware design: We propose a novel unified accelerator design on FPGA to efficiently execute various computation primitives. We develop a customized soft processor that is tightly coupled with the accelerator to execute a runtime system. Moreover, we develop efficient hardware mechanisms to profile the data sparsity and perform on-the-fly data format transformation to prepare the input data for various computation primitives; 2) Software design: We develop a runtime system that works synergistically with the accelerator to perform dynamic kernel-to-primitive mapping based on data sparsity. We implement Dynasparse on a state-of-the-art FPGA platform, Xilinx Alveo U250, and evaluate the design using widely used GNN models (GCN, GraphSAGE, GIN and SGC). For the above GNN models and various input graphs, the proposed accelerator and dynamic kernel-to-primitive mapping reduces the inference latency by $3.73\times$ on the average compared with the static mapping strategies employed in the state-of-the-art GNN accelerators. Compared with state-of-the-art CPU (GPU) implementations, Dynasparse achieves up to $56.9\times$ ($2.37\times$) speedup in end-to-end latency.


[46] 2303.12909

Ethics in Computing Education: Challenges and Experience with Embedded Ethics

The next generation of computer engineers and scientists must be proficient in not just the technical knowledge required to analyze, optimize, and create emerging microelectronics systems, but also with the skills required to make ethical decisions during design. Teaching computer ethics in computing curricula is therefore becoming an important requirement with significant ramifications for our increasingly connected and computing-reliant society. In this paper, we reflect on the many challenges and questions with effectively integrating ethics into modern computing curricula. We describe a case study of integrating ethics modules into the computer engineering curricula at Colorado State University.


[47] 2303.12910

Cross-Layer Design for AI Acceleration with Non-Coherent Optical Computing

Emerging AI applications such as ChatGPT, graph convolutional networks, and other deep neural networks require massive computational resources for training and inference. Contemporary computing platforms such as CPUs, GPUs, and TPUs are struggling to keep up with the demands of these AI applications. Non-coherent optical computing represents a promising approach for light-speed acceleration of AI workloads. In this paper, we show how cross-layer design can overcome challenges in non-coherent optical computing platforms. We describe approaches for optical device engineering, tuning circuit enhancements, and architectural innovations to adapt optical computing to a variety of AI workloads. We also discuss techniques for hardware/software co-design that can intelligently map and adapt AI software to improve its performance on non-coherent optical computing platforms.


[48] 2303.12913

What do Transgender Software Professionals say about a Career in the Software Industry?

Diversity is an essential aspect of software development because technology influences almost every aspect of modern society, and if the software industry lacks diversity, software products might unintentionally constrain groups of individuals instead of promoting an equalitarian experience to all. In this study, we investigate the perspectives of transgender software professionals about a career in software engineering as one of the aspects of diversity in the software industry. Our findings demonstrate that, on the one hand, trans people choose careers in software engineering for two primary reasons: a) even though software development environments are not exempt from discrimination, the software industry is safer than other industries for transgenders; b) trans people occasionally have to deal with gender dysphoria, anxiety, and fear of judgment, and the work flexibility offered by software companies allow them to cope with these issues more efficiently.


[49] 2303.12914

TRON: Transformer Neural Network Acceleration with Non-Coherent Silicon Photonics

Transformer neural networks are rapidly being integrated into state-of-the-art solutions for natural language processing (NLP) and computer vision. However, the complex structure of these models creates challenges for accelerating their execution on conventional electronic platforms. We propose the first silicon photonic hardware neural network accelerator called TRON for transformer-based models such as BERT, and Vision Transformers. Our analysis demonstrates that TRON exhibits at least 14x better throughput and 8x better energy efficiency, in comparison to state-of-the-art transformer accelerators.


[50] 2303.12915

Self-distillation for surgical action recognition

Surgical scene understanding is a key prerequisite for contextaware decision support in the operating room. While deep learning-based approaches have already reached or even surpassed human performance in various fields, the task of surgical action recognition remains a major challenge. With this contribution, we are the first to investigate the concept of self-distillation as a means of addressing class imbalance and potential label ambiguity in surgical video analysis. Our proposed method is a heterogeneous ensemble of three models that use Swin Transfomers as backbone and the concepts of self-distillation and multi-task learning as core design choices. According to ablation studies performed with the CholecT45 challenge data via cross-validation, the biggest performance boost is achieved by the usage of soft labels obtained by self-distillation. External validation of our method on an independent test set was achieved by providing a Docker container of our inference model to the challenge organizers. According to their analysis, our method outperforms all other solutions submitted to the latest challenge in the field. Our approach thus shows the potential of self-distillation for becoming an important tool in medical image analysis applications.


[51] 2303.12916

Deep learning-based stereo camera multi-video synchronization

Stereo vision is essential for many applications. Currently, the synchronization of the streams coming from two cameras is done using mostly hardware. A software-based synchronization method would reduce the cost, weight and size of the entire system and allow for more flexibility when building such systems. With this goal in mind, we present here a comparison of different deep learning-based systems and prove that some are efficient and generalizable enough for such a task. This study paves the way to a production ready software-based video synchronization system.


[52] 2303.12920

VRMoVi: Towards an Expressive Visualization for Human Motion and Object Interaction in Virtual Reality

Virtual reality (VR)-based immersive analysis has become an alternative to traditional approaches for analyzing complex, multidimensional human motion data. However, existing VR-based methods lack detailed information about hand motion and object interaction, which is essential for interpreting human activities and identifying their needs. To address that, we present a new VR system, VRMoVi, with a unique design of three expressive visualization layers: 1) a 3D tube layer for hand/object general motion, 2) a hand-object avatar layer for hand-object interaction animation, and 3) a particle-with-arrow layer for detailed hand positions and orientations. We validated VRMoVi with a real-world VR human motion dataset and conducted a user study with 24 participants. Compared with other visualization conditions, VRMoVi performed significantly better than the traditional 2D condition and slightly better than the standard VR-based condition; users found VRMoVi to be comprehensible, immersive, easy to use, and useful for interpreting human activity data.


[53] 2303.12921

Stability is Stable: Connections between Replicability, Privacy, and Adaptive Generalization

The notion of replicable algorithms was introduced in Impagliazzo et al. [STOC '22] to describe randomized algorithms that are stable under the resampling of their inputs. More precisely, a replicable algorithm gives the same output with high probability when its randomness is fixed and it is run on a new i.i.d. sample drawn from the same distribution. Using replicable algorithms for data analysis can facilitate the verification of published results by ensuring that the results of an analysis will be the same with high probability, even when that analysis is performed on a new data set. In this work, we establish new connections and separations between replicability and standard notions of algorithmic stability. In particular, we give sample-efficient algorithmic reductions between perfect generalization, approximate differential privacy, and replicability for a broad class of statistical problems. Conversely, we show any such equivalence must break down computationally: there exist statistical problems that are easy under differential privacy, but that cannot be solved replicably without breaking public-key cryptography. Furthermore, these results are tight: our reductions are statistically optimal, and we show that any computational separation between DP and replicability must imply the existence of one-way functions. Our statistical reductions give a new algorithmic framework for translating between notions of stability, which we instantiate to answer several open questions in replicability and privacy. This includes giving sample-efficient replicable algorithms for various PAC learning, distribution estimation, and distribution testing problems, algorithmic amplification of $\delta$ in approximate DP, conversions from item-level to user-level privacy, and the existence of private agnostic-to-realizable learning reductions under structured distributions.


[54] 2303.12922

Revisiting the Fragility of Influence Functions

In the last few years, many works have tried to explain the predictions of deep learning models. Few methods, however, have been proposed to verify the accuracy or faithfulness of these explanations. Recently, influence functions, which is a method that approximates the effect that leave-one-out training has on the loss function, has been shown to be fragile. The proposed reason for their fragility remains unclear. Although previous work suggests the use of regularization to increase robustness, this does not hold in all cases. In this work, we seek to investigate the experiments performed in the prior work in an effort to understand the underlying mechanisms of influence function fragility. First, we verify influence functions using procedures from the literature under conditions where the convexity assumptions of influence functions are met. Then, we relax these assumptions and study the effects of non-convexity by using deeper models and more complex datasets. Here, we analyze the key metrics and procedures that are used to validate influence functions. Our results indicate that the validation procedures may cause the observed fragility.


[55] 2303.12928

Leveraging Multi-time Hamilton-Jacobi PDEs for Certain Scientific Machine Learning Problems

Hamilton-Jacobi partial differential equations (HJ PDEs) have deep connections with a wide range of fields, including optimal control, differential games, and imaging sciences. By considering the time variable to be a higher dimensional quantity, HJ PDEs can be extended to the multi-time case. In this paper, we establish a novel theoretical connection between specific optimization problems arising in machine learning and the multi-time Hopf formula, which corresponds to a representation of the solution to certain multi-time HJ PDEs. Through this connection, we increase the interpretability of the training process of certain machine learning applications by showing that when we solve these learning problems, we also solve a multi-time HJ PDE and, by extension, its corresponding optimal control problem. As a first exploration of this connection, we develop the relation between the regularized linear regression problem and the Linear Quadratic Regulator (LQR). We then leverage our theoretical connection to adapt standard LQR solvers (namely, those based on the Riccati ordinary differential equations) to design new training approaches for machine learning. Finally, we provide some numerical examples that demonstrate the versatility and possible computational advantages of our Riccati-based approach in the context of continual learning, post-training calibration, transfer learning, and sparse dynamics identification.


[56] 2303.12930

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.


[57] 2303.12934

Real-World Community-in-the-Loop Smart Video Surveillance -- A Case Study at a Community College

Smart Video surveillance systems have become important recently for ensuring public safety and security, especially in smart cities. However, applying real-time artificial intelligence technologies combined with low-latency notification and alarming has made deploying these systems quite challenging. This paper presents a case study for designing and deploying smart video surveillance systems based on a real-world testbed at a community college. We primarily focus on a smart camera-based system that can identify suspicious/abnormal activities and alert the stakeholders and residents immediately. The paper highlights and addresses different algorithmic and system design challenges to guarantee real-time high-accuracy video analytics processing in the testbed. It also presents an example of cloud system infrastructure and a mobile application for real-time notification to keep students, faculty/staff, and responsible security personnel in the loop. At the same time, it covers the design decision to maintain communities' privacy and ethical requirements as well as hardware configuration and setups. We evaluate the system's performance using throughput and end-to-end latency. The experiment results show that, on average, our system's end-to-end latency to notify the end users in case of detecting suspicious objects is 5.3, 5.78, and 11.11 seconds when running 1, 4, and 8 cameras, respectively. On the other hand, in case of detecting anomalous behaviors, the system could notify the end users with 7.3, 7.63, and 20.78 seconds average latency. These results demonstrate that the system effectively detects and notifies abnormal behaviors and suspicious objects to the end users within a reasonable period. The system can run eight cameras simultaneously at a 32.41 Frame Per Second (FPS) rate.


[58] 2303.12936

Analyzing the Generalizability of Deep Contextualized Language Representations For Text Classification

This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT, on supervised learning of binary protest news classification and sentiment analysis of product reviews. A "cross-context" setting is enabled using test sets that are distinct from the training data. Specifically, in the news classification task, the models are developed on local news from India and tested on the local news from China. In the sentiment analysis task, the models are trained on movie reviews and tested on customer reviews. This comparison is aimed at exploring the limits of the representative power of today's Natural Language Processing systems on the path to the systems that are generalizable to real-life scenarios. The models are fine-tuned and fed into a Feed-Forward Neural Network and a Bidirectional Long Short Term Memory network. Multinomial Naive Bayes and Linear Support Vector Machine are used as traditional baselines. The results show that, in binary text classification, DistilBERT is significantly better than ELMo on generalizing to the cross-context setting. ELMo is observed to be significantly more robust to the cross-context test data than both baselines. On the other hand, the baselines performed comparably well to ELMo when the training and test data are subsets of the same corpus (no cross-context). DistilBERT is also found to be 30% smaller and 83% faster than ELMo. The results suggest that DistilBERT can transfer generic semantic knowledge to other domains better than ELMo. DistilBERT is also favorable in incorporating into real-life systems for it requires a smaller computational training budget. When generalization is not the utmost preference and test domain is similar to the training domain, the traditional ML algorithms can still be considered as more economic alternatives to deep language representations.


[59] 2303.12937

Wireless Network Demands of Data Products from Small Uncrewed Aerial Systems at Hurricane Ian

Data collected at Hurricane Ian (2022) quantifies the demands that small uncrewed aerial systems (UAS), or drones, place on the network communication infrastructure and identifies gaps in the field. Drones have been increasingly used since Hurricane Katrina (2005) for disaster response, however getting the data from the drone to the appropriate decision makers throughout incident command in a timely fashion has been problematic. These delays have persisted even as countries such as the USA have made significant investments in wireless infrastructure, rapidly deployable nodes, and an increase in commercial satellite solutions. Hurricane Ian serves as a case study of the mismatch between communications needs and capabilities. In the first four days of the response, nine drone teams flew 34 missions under the direction of the State of Florida FL-UAS1, generating 636GB of data. The teams had access to six different wireless communications networks but had to resort to physically transferring data to the nearest intact emergency operations center in order to make the data available to the relevant agencies. The analysis of the mismatch contributes a model of the drone data-to-decision workflow in a disaster and quantifies wireless network communication requirements throughout the workflow in five factors. Four of the factors-availability, bandwidth, burstiness, and spatial distribution-were previously identified from analyses of Hurricanes Harvey (2017) and Michael (2018). This work adds upload rate as a fifth attribute. The analysis is expected to improve drone design and edge computing schemes as well as inform wireless communication research and development.


[60] 2303.12939

Managing Cyber Risk, a Science in the Making

Not a day goes by without news about a cyber attack. Fear spreads out and lots of wrong ideas circulate. This survey aims at showing how all these uncertainties about cyber can be transformed into manageable risk. After reviewing the main characteristics of cyber risk, we consider the three layers of cyber space: hardware, software and psycho-cognitive layer. We ask ourselves how is this risk different from others, how modelling has been tackled and needs to evolve, and what are the multi-facetted aspects of cyber risk management. This wide exploration pictures a science in the making and points out the questions to be solved for building a resilient society.


[61] 2303.12940

Cryptocurrency wallets: assessment and security

Digital wallet as a software program or a digital device allows users to conduct various transactions. Hot and cold digital wallets are considered as two types of this wallet. Digital wallets need an online connection fall into the first group, whereas digital wallets can operate without internet connection belong to the second group. Prior to buying a digital wallet, it is important to define for what purpose it will be utilized. The ease with which a mobile phone transaction may be completed in a couple of seconds and the speed with which transactions are executed are reflection of efficiency. One of the most important elements of digital wallets is data organization. Digital wallets are significantly less expensive than classic methods of transaction, which entails various charges and fees. Constantly, demand for their usage is growing due to speed, security, and the ability to conduct transactions between two users without the need of a third party. As the popularity of digital currency wallets grows, the number of security concerns impacting them increases significantly. The current status of digital wallets on the market, as well as the options for an efficient solution for obtaining and utilizing digital wallets. Finally, the digital wallets' security and future improvement prospects are discussed in this chapter.


[62] 2303.12942

A Survey on Explainable Artificial Intelligence for Network Cybersecurity

The black-box nature of artificial intelligence (AI) models has been the source of many concerns in their use for critical applications. Explainable Artificial Intelligence (XAI) is a rapidly growing research field that aims to create machine learning models that can provide clear and interpretable explanations for their decisions and actions. In the field of network cybersecurity, XAI has the potential to revolutionize the way we approach network security by enabling us to better understand the behavior of cyber threats and to design more effective defenses. In this survey, we review the state of the art in XAI for cybersecurity in network systems and explore the various approaches that have been proposed to address this important problem. The review follows a systematic classification of network-driven cybersecurity threats and issues. We discuss the challenges and limitations of current XAI methods in the context of cybersecurity and outline promising directions for future research.


[63] 2303.12944

Use of Federated Learning and Blockchain towards Securing Financial Services

In recent days, the proliferation of several existing and new cyber-attacks pose an axiomatic threat to the stability of financial services. It is hard to predict the nature of attacks that can trigger a serious financial crisis. The unprecedented digital transformation to financial services has been accelerated during the COVID-19 pandemic and it is still ongoing. Attackers are taking advantage of this transformation and pose a new global threat to financial stability and integrity. Many large organizations are switching from centralized finance (CeFi) to decentralized finance (DeFi) because decentralized finance has many advantages. Blockchain can bring big and far-reaching effects on the trustworthiness, safety, accessibility, cost-effectiveness, and openness of the financial sector. The present paper gives an in-depth look at how blockchain and federated learning (FL) are used in financial services. It starts with an overview of recent developments in both use cases. This paper explores and discusses existing financial service vulnerabilities, potential threats, and consequent risks. So, we explain the problems that can be fixed in financial services and how blockchain and FL could help solve them. These problems include data protection, storage optimization, and making more money in financial services. We looked at many blockchain-enabled FL methods and came up with some possible solutions that could be used in financial services to solve several challenges like cost-effectiveness, automation, and security control. Finally, we point out some future directions at the end of this study.


[64] 2303.12946

Underwater Camouflage Object Detection Dataset

We have made a dataset of camouflage object detection mainly for complex seabed scenes, and named it UnderWater RGB&Sonar,or UW-RS for short. The UW-RS dataset contains a total of 1972 image data. The dataset mainly consists of two parts, namely underwater optical data part (UW-R dataset) and underwater sonar data part (UW-S dataset).


[65] 2303.12947

Deep Attention Recognition for Attack Identification in 5G UAV scenarios: Novel Architecture and End-to-End Evaluation

Despite the robust security features inherent in the 5G framework, attackers will still discover ways to disrupt 5G unmanned aerial vehicle (UAV) operations and decrease UAV control communication performance in Air-to-Ground (A2G) links. Operating under the assumption that the 5G UAV communications infrastructure will never be entirely secure, we propose Deep Attention Recognition (DAtR) as a solution to identify attacks based on a small deep network embedded in authenticated UAVs. Our proposed solution uses two observable parameters: the Signal-to-Interference-plus-Noise Ratio (SINR) and the Reference Signal Received Power (RSSI) to recognize attacks under Line-of-Sight (LoS), Non-Line-of-Sight (NLoS), and a probabilistic combination of the two conditions. In the tested scenarios, a number of attackers are located in random positions, while their power is varied in each simulation. Moreover, terrestrial users are included in the network to impose additional complexity on attack detection. To improve the systems overall performance in the attack scenarios, we propose complementing the deep network decision with two mechanisms based on data manipulation and majority voting techniques. We compare several performance parameters in our proposed Deep Network. For example, the impact of Long Short-Term-Memory (LSTM) and Attention layers in terms of their overall accuracy, the window size effect, and test the accuracy when only partial data is available in the training process. Finally, we benchmark our deep network with six widely used classifiers regarding classification accuracy. Our algorithms accuracy exceeds 4% compared with the eXtreme Gradient Boosting (XGB) classifier in LoS condition and around 3% in the short distance NLoS condition. Considering the proposed deep network, all other classifiers present lower accuracy than XGB.


[66] 2303.12948

FTSO: Effective NAS via First Topology Second Operator

Existing one-shot neural architecture search (NAS) methods have to conduct a search over a giant super-net, which leads to the huge computational cost. To reduce such cost, in this paper, we propose a method, called FTSO, to divide the whole architecture search into two sub-steps. Specifically, in the first step, we only search for the topology, and in the second step, we search for the operators. FTSO not only reduces NAS's search time from days to 0.68 seconds, but also significantly improves the found architecture's accuracy. Our extensive experiments on ImageNet show that within 18 seconds, FTSO can achieve a 76.4% testing accuracy, 1.5% higher than the SOTA, PC-DARTS. In addition, FTSO can reach a 97.77% testing accuracy, 0.27% higher than the SOTA, with nearly 100% (99.8%) search time saved, when searching on CIFAR10.


[67] 2303.12949

Self-triggered output feedback control for nonlinear networked control systems based on hybrid Lyapunov functions

Most approaches for self-triggered control (STC) of nonlinear networked control systems (NCS) require measurements of the full system state to determine transmission times. However, for most control systems only a lower dimensional output is available. To bridge this gap, we present in this paper an output-feedback STC approach for nonlinear NCS. An asymptotically stable observer is used to reconstruct the plant state and transmission times are determined based on the observer state. The approach employs hybrid Lyapunov functions and a dynamic variable to encode past state information and to maximize the time between transmissions. It is non-conservative in the sense that the assumptions on plant and controller are the same as for dynamic STC based on hybrid Lyapunov functions with full state measurements and any asymptotically stabilizing observer can be used. We conclude that the proposed STC approach guarantees asymptotic stability of the origin for the closed-loop system.


[68] 2303.12950

LightPainter: Interactive Portrait Relighting with Freehand Scribble

Recent portrait relighting methods have achieved realistic results of portrait lighting effects given a desired lighting representation such as an environment map. However, these methods are not intuitive for user interaction and lack precise lighting control. We introduce LightPainter, a scribble-based relighting system that allows users to interactively manipulate portrait lighting effect with ease. This is achieved by two conditional neural networks, a delighting module that recovers geometry and albedo optionally conditioned on skin tone, and a scribble-based module for relighting. To train the relighting module, we propose a novel scribble simulation procedure to mimic real user scribbles, which allows our pipeline to be trained without any human annotations. We demonstrate high-quality and flexible portrait lighting editing capability with both quantitative and qualitative experiments. User study comparisons with commercial lighting editing tools also demonstrate consistent user preference for our method.


[69] 2303.12952

TSI-GAN: Unsupervised Time Series Anomaly Detection using Convolutional Cycle-Consistent Generative Adversarial Networks

Anomaly detection is widely used in network intrusion detection, autonomous driving, medical diagnosis, credit card frauds, etc. However, several key challenges remain open, such as lack of ground truth labels, presence of complex temporal patterns, and generalizing over different datasets. This paper proposes TSI-GAN, an unsupervised anomaly detection model for time-series that can learn complex temporal patterns automatically and generalize well, i.e., no need for choosing dataset-specific parameters, making statistical assumptions about underlying data, or changing model architectures. To achieve these goals, we convert each input time-series into a sequence of 2D images using two encoding techniques with the intent of capturing temporal patterns and various types of deviance. Moreover, we design a reconstructive GAN that uses convolutional layers in an encoder-decoder network and employs cycle-consistency loss during training to ensure that inverse mappings are accurate as well. In addition, we also instrument a Hodrick-Prescott filter in post-processing to mitigate false positives. We evaluate TSI-GAN using 250 well-curated and harder-than-usual datasets and compare with 8 state-of-the-art baseline methods. The results demonstrate the superiority of TSI-GAN to all the baselines, offering an overall performance improvement of 13% and 31% over the second-best performer MERLIN and the third-best performer LSTM-AE, respectively.


[70] 2303.12957

Reinforcement Learning with Exogenous States and Rewards

Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous space, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.


[71] 2303.12959

Variantional autoencoder with decremental information bottleneck for disentanglement

One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous incremental methods with only on latent space cannot optimize these two targets simultaneously, so they expand the Information Bottleneck while training to {optimize from disentanglement to reconstruction. However, a large bottleneck will lose the constraint of disentanglement, causing the information diffusion problem. To tackle this issue, we present a novel decremental variational autoencoder with disentanglement-invariant transformations to optimize multiple objectives in different layers, termed DeVAE, for balancing disentanglement and reconstruction fidelity by decreasing the information bottleneck of diverse latent spaces gradually. Benefiting from the multiple latent spaces, DeVAE allows simultaneous optimization of multiple objectives to optimize reconstruction while keeping the constraint of disentanglement, avoiding information diffusion. DeVAE is also compatible with large models with high-dimension latent space. Experimental results on dSprites and Shapes3D that DeVAE achieves \fix{R2q6}{a good balance between disentanglement and reconstruction.DeVAE shows high tolerant of hyperparameters and on high-dimensional latent spaces.


[72] 2303.12961

The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs

The successes of foundation models such as ChatGPT and AlphaFold have spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. We review over 80 foundation models trained on non-imaging EMR data (i.e. clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g. MIMIC-III) or broad, public biomedical corpora (e.g. PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. In light of these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.


[73] 2303.12963

Forecast-Aware Model Driven LSTM

Poor air quality can have a significant impact on human health. The National Oceanic and Atmospheric Administration (NOAA) air quality forecasting guidance is challenged by the increasing presence of extreme air quality events due to extreme weather events such as wild fires and heatwaves. These extreme air quality events further affect human health. Traditional methods used to correct model bias make assumptions about linearity and the underlying distribution. Extreme air quality events tend to occur without a strong signal leading up to the event and this behavior tends to cause existing methods to either under or over compensate for the bias. Deep learning holds promise for air quality forecasting in the presence of extreme air quality events due to its ability to generalize and learn nonlinear problems. However, in the presence of these anomalous air quality events, standard deep network approaches that use a single network for generalizing to future forecasts, may not always provide the best performance even with a full feature-set including geography and meteorology. In this work we describe a method that combines unsupervised learning and a forecast-aware bi-directional LSTM network to perform bias correction for operational air quality forecasting using AirNow station data for ozone and PM2.5 in the continental US. Using an unsupervised clustering method trained on station geographical features such as latitude and longitude, urbanization, and elevation, the learned clusters direct training by partitioning the training data for the LSTM networks. LSTMs are forecast-aware and implemented using a unique way to perform learning forward and backwards in time across forecasting days. When comparing the RMSE of the forecast model to the RMSE of the bias corrected model, the bias corrected model shows significant improvement (27\% lower RMSE for ozone) over the base forecast.


[74] 2303.12964

Continuous Indeterminate Probability Neural Network

This paper introduces a general model called CIPNN - Continuous Indeterminate Probability Neural Network, and this model is based on IPNN, which is used for discrete latent random variables. Currently, posterior of continuous latent variables is regarded as intractable, with the new theory proposed by IPNN this problem can be solved. Our contributions are Four-fold. First, we derive the analytical solution of the posterior calculation of continuous latent random variables and propose a general classification model (CIPNN). Second, we propose a general auto-encoder called CIPAE - Continuous Indeterminate Probability Auto-Encoder, the decoder part is not a neural network and uses a fully probabilistic inference model for the first time. Third, we propose a new method to visualize the latent random variables, we use one of N dimensional latent variables as a decoder to reconstruct the input image, which can work even for classification tasks, in this way, we can see what each latent variable has learned. Fourth, IPNN has shown great classification capability, CIPNN has pushed this classification capability to infinity. Theoretical advantages are reflected in experimental results.


[75] 2303.12965

Efficient Meshy Neural Fields for Animatable Human Avatars

Efficiently digitizing high-fidelity animatable human avatars from videos is a challenging and active research topic. Recent volume rendering-based neural representations open a new way for human digitization with their friendly usability and photo-realistic reconstruction quality. However, they are inefficient for long optimization times and slow inference speed; their implicit nature results in entangled geometry, materials, and dynamics of humans, which are hard to edit afterward. Such drawbacks prevent their direct applicability to downstream applications, especially the prominent rasterization-based graphic ones. We present EMA, a method that Efficiently learns Meshy neural fields to reconstruct animatable human Avatars. It jointly optimizes explicit triangular canonical mesh, spatial-varying material, and motion dynamics, via inverse rendering in an end-to-end fashion. Each above component is derived from separate neural fields, relaxing the requirement of a template, or rigging. The mesh representation is highly compatible with the efficient rasterization-based renderer, thus our method only takes about an hour of training and can render in real-time. Moreover, only minutes of optimization is enough for plausible reconstruction results. The disentanglement of meshes enables direct downstream applications. Extensive experiments illustrate the very competitive performance and significant speed boost against previous methods. We also showcase applications including novel pose synthesis, material editing, and relighting. The project page: https://xk-huang.github.io/ema/.


[76] 2303.12968

Ambient Intelligence for Next-Generation AR

Next-generation augmented reality (AR) promises a high degree of context-awareness - a detailed knowledge of the environmental, user, social and system conditions in which an AR experience takes place. This will facilitate both the closer integration of the real and virtual worlds, and the provision of context-specific content or adaptations. However, environmental awareness in particular is challenging to achieve using AR devices alone; not only are these mobile devices' view of an environment spatially and temporally limited, but the data obtained by onboard sensors is frequently inaccurate and incomplete. This, combined with the fact that many aspects of core AR functionality and user experiences are impacted by properties of the real environment, motivates the use of ambient IoT devices, wireless sensors and actuators placed in the surrounding environment, for the measurement and optimization of environment properties. In this book chapter we categorize and examine the wide variety of ways in which these IoT sensors and actuators can support or enhance AR experiences, including quantitative insights and proof-of-concept systems that will inform the development of future solutions. We outline the challenges and opportunities associated with several important research directions which must be addressed to realize the full potential of next-generation AR.


[77] 2303.12970

Examining Cashless Payment Services in a Post-Pandemic Environment

The global pandemic COVID-19 posed numerous challenges for U.S. restaurants and food services. Many businesses adopted contactless ordering and cashless payment policies to comply with emergency health mandates. Even with national and public health emergency mandates set to expire in May 2023, cashless payment services continue to thrive through online ordering platforms such as DoorDash and Uber Eats and social payment platforms such as Snackpass. At present, designers and policymakers must address the socioeconomic politics of cashless payment services and service accessibility for marginalized groups.


[78] 2303.12973

Uncertainty Calibration for Counterfactual Propensity Estimation in Recommendation

In recommendation systems, a large portion of the ratings are missing due to the selection biases, which is known as Missing Not At Random. The counterfactual inverse propensity scoring (IPS) was used to weight the imputation error of every observed rating. Although effective in multiple scenarios, we argue that the performance of IPS estimation is limited due to the uncertainty miscalibration of propensity estimation. In this paper, we propose the uncertainty calibration for the propensity estimation in recommendation systems with multiple representative uncertainty calibration techniques. Theoretical analysis on the bias and generalization bound shows the superiority of the calibrated IPS estimator over the uncalibrated one. Experimental results on the coat and yahoo datasets shows that the uncertainty calibration is improved and hence brings the better recommendation results.


[79] 2303.12974

Performance Analysis and Evaluation of Cloud Vision Emotion APIs

Facial expression is a way of communication that can be used to interact with computers or other electronic devices and the recognition of emotion from faces is an emerging practice with application in many fields. There are many cloud-based vision application programming interfaces available that recognize emotion from facial images and video. In this article, the performances of two well-known APIs were compared using a public dataset of 980 images of facial emotions. For these experiments, a client program was developed which iterates over the image set, calls the cloud services, and caches the results of the emotion detection for each image. The performance was evaluated in each class of emotions using prediction accuracy. It has been found that the prediction accuracy for each emotion varies according to the cloud service being used. Similarly, each service provider presents a strong variation of performance according to the class being analyzed, as can be seen with more detail in this artilects.


[80] 2303.12976

NVAutoNet: Fast and Accurate 360$^{\circ}$ 3D Perception For Self Driving

Robust real-time perception of 3D world is essential to the autonomous vehicle. We introduce an end-to-end surround camera perception system for self-driving. Our perception system is a novel multi-task, multi-camera network which takes a variable set of time-synced camera images as input and produces a rich collection of 3D signals such as sizes, orientations, locations of obstacles, parking spaces and free-spaces, etc. Our perception network is modular and end-to-end: 1) the outputs can be consumed directly by downstream modules without any post-processing such as clustering and fusion -- improving speed of model deployment and in-car testing 2) the whole network training is done in one single stage -- improving speed of model improvement and iterations. The network is well designed to have high accuracy while running at 53 fps on NVIDIA Orin SoC (system-on-a-chip). The network is robust to sensor mounting variations (within some tolerances) and can be quickly customized for different vehicle types via efficient model fine-tuning thanks of its capability of taking calibration parameters as additional inputs during training and testing. Most importantly, our network has been successfully deployed and being tested on real roads.


[81] 2303.12981

Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems

The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the optimization objective as a function of the policy parameter and reward satisfies a stronger "equiconnectedness" property. To our best knowledge, these are novel and previously unknown discoveries. We present an application of the connectedness of these superlevel sets to the derivation of minimax theorems for robust reinforcement learning. We show that any minimax optimization program which is convex on one side and is equiconnected on the other side observes the minimax equality (i.e. has a Nash equilibrium). We find that this exact structure is exhibited by an interesting robust reinforcement learning problem under an adversarial reward attack, and the validity of its minimax equality immediately follows. This is the first time such a result is established in the literature.


[82] 2303.12982

Fault Prognosis of Turbofan Engines: Eventual Failure Prediction and Remaining Useful Life Estimation

In the era of industrial big data, prognostics and health management is essential to improve the prediction of future failures to minimize inventory, maintenance, and human costs. Used for the 2021 PHM Data Challenge, the new Commercial Modular Aero-Propulsion System Simulation dataset from NASA is an open-source benchmark containing simulated turbofan engine units flown under realistic flight conditions. Deep learning approaches implemented previously for this application attempt to predict the remaining useful life of the engine units, but have not utilized labeled failure mode information, impeding practical usage and explainability. To address these limitations, a new prognostics approach is formulated with a customized loss function to simultaneously predict the current health state, the eventual failing component(s), and the remaining useful life. The proposed method incorporates principal component analysis to orthogonalize statistical time-domain features, which are inputs into supervised regressors such as random forests, extreme random forests, XGBoost, and artificial neural networks. The highest performing algorithm, ANN-Flux, achieves AUROC and AUPR scores exceeding 0.95 for each classification. In addition, ANN-Flux reduces the remaining useful life RMSE by 38% for the same test split of the dataset compared to past work, with significantly less computational cost.


[83] 2303.12984

LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec.


[84] 2303.12987

Theoretical Model Construction of Deformation-Force for Soft Grippers Part I: Co-rotational Modeling and Force Control for Design Optimization

Compliant grippers, owing to adaptivity and safety, have attracted considerable attention for unstructured grasping in real applications, such as industrial or logistic scenarios. However, accurate construction of the mathematical model depicting the bidirectional relationship between shape deformation and contact force for such grippers, such as the Fin-Ray grippers, remains stagnant to date. To address this research gap, this article devises, presents, and experimentally validates a universal bidirectional force-displacement mathematical model for compliant grippers based on the co-rotational concept, which endows such grippers with an intrinsic force sensing capability and offers a better insight into the design optimization. In Part 1 of the article, we introduce the fundamental theory of the co-rotational approach, where arbitrary large deformation of beam elements can be modeled. Its intrinsic principle enables the theoretical modeling to consider various types of configurations and key design parameters with very few assumptions made. Further, a force control algorithm is proposed, providing accurate displacement estimations of the gripper under external forces with minor computational loads. The performance of the proposed method is experimentally verified through comparison with Finite Element Analysis, where the influence of four key design parameters on the gripper s performance is investigated, facilitating systematical design optimization. Part 2 of this article demonstrating the force sensing capabilities and the effects of representative co-rotational modeling parameters on model accuracy is released in Google Drive.


[85] 2303.12990

On Constant-Weight Binary $B_2$-Sequences

Motivated by applications in polymer-based data storage we introduced the new problem of characterizing the code rate and designing constant-weight binary $B_2$-sequences. Binary $B_2$-sequences are collections of binary strings of length $n$ with the property that the real-valued sums of all distinct pairs of strings are distinct. In addition to this defining property, constant-weight binary $B_2$-sequences also satisfy the constraint that each string has a fixed, relatively small weight $\omega$ that scales linearly with $n$. The constant-weight constraint ensures low-cost synthesis and uniform processing of the data readout via tandem mass spectrometers. Our main results include upper bounds on the size of the codes formulated as entropy-optimization problems and constructive lower bounds based on Sidon sequences.


[86] 2303.12992

A Survey of Historical Learning: Learning Models with Learning History

New knowledge originates from the old. The various types of elements, deposited in the training history, are a large amount of wealth for improving learning deep models. In this survey, we comprehensively review and summarize the topic--``Historical Learning: Learning Models with Learning History'', which learns better neural models with the help of their learning history during its optimization, from three detailed aspects: Historical Type (what), Functional Part (where) and Storage Form (how). To our best knowledge, it is the first survey that systematically studies the methodologies which make use of various historical statistics when training deep neural networks. The discussions with related topics like recurrent/memory networks, ensemble learning, and reinforcement learning are demonstrated. We also expose future challenges of this topic and encourage the community to pay attention to the think of historical learning principles when designing algorithms. The paper list related to historical learning is available at \url{https://github.com/Martinser/Awesome-Historical-Learning.}


[87] 2303.12993

Backdoor Defense via Adaptively Splitting Poisoned Dataset

Backdoor defenses have been studied to alleviate the threat of deep neural networks (DNNs) being backdoor attacked and thus maliciously altered. Since DNNs usually adopt some external training data from an untrusted third party, a robust backdoor defense strategy during the training stage is of importance. We argue that the core of training-time defense is to select poisoned samples and to handle them properly. In this work, we summarize the training-time defenses from a unified framework as splitting the poisoned dataset into two data pools. Under our framework, we propose an adaptively splitting dataset-based defense (ASD). Concretely, we apply loss-guided split and meta-learning-inspired split to dynamically update two data pools. With the split clean data pool and polluted data pool, ASD successfully defends against backdoor attacks during training. Extensive experiments on multiple benchmark datasets and DNN models against six state-of-the-art backdoor attacks demonstrate the superiority of our ASD. Our code is available at https://github.com/KuofengGao/ASD.


[88] 2303.12996

Perturbation-Resilient Sets for Dynamic Service Balancing

Balanced and swap-robust minimal trades, introduced in [1], are important for studying the balance and stability of server access request protocols under data popularity changes. Constructions of such trades have so far relied on paired sets obtained through iterative combining of smaller sets that have provable stability guarantees, coupled with exhaustive computer search. Currently, there exists a nonnegligible gap between the resulting total dynamic balance discrepancy and the known theoretical lower bound. We present both new upper and lower bounds on the total service requests discrepancy under limited popularity changes. Our constructive near-optimal approach uses a new class of paired graphs whose vertices are two balanced sets with edges (arcs) that capture the balance and potential balance changes induced by limited-magnitude popularity changes (swaps).


[89] 2303.12997

FER-former: Multi-modal Transformer for Facial Expression Recognition

The ever-increasing demands for intuitive interactions in Virtual Reality has triggered a boom in the realm of Facial Expression Recognition (FER). To address the limitations in existing approaches (e.g., narrow receptive fields and homogenous supervisory signals) and further cement the capacity of FER tools, a novel multifarious supervision-steering Transformer for FER in the wild is proposed in this paper. Referred as FER-former, our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision. In specific, to dig deep into the merits of the combination of features provided by prevailing CNNs and Transformers, a hybrid stem is designed to cascade two types of learning paradigms simultaneously. Wherein, a FER-specific transformer mechanism is devised to characterize conventional hard one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. To ease the issue of annotation ambiguity, a heterogeneous domains-steering supervision module is proposed to make image features also have text-space semantic correlations by supervising the similarity between image features and text features. On top of the collaboration of multifarious token heads, diverse global receptive fields with multi-modal semantic cues are captured, thereby delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.


[90] 2303.12998

The Universal NFT Vector Database: A Scaleable Vector Database for NFT Similarity Matching

Non-Fungible Tokens (NFTs) are a type of digital asset that represents a proof of ownership over a particular digital item such as art, music, or real estate. Due to the non-fungible nature of NFTs, duplicate tokens should not possess the same value. However, with the surge of new blockchains and a massive influx of NFTs being created, a wealth of NFT data is being generated without a method of tracking similarity. This enables people to create almost identical NFTs by changing one pixel or one byte of data. Despite the similarity among NFTs, each NFT is assigned a completely different token ID. To address the NFT duplication issue, we developed a modular, easily-extendable, hardware-agnostic, cloud-centered NFT processing system that represents NFTs as vectors. We established a database containing a vector representation of the NFTs in accordance with the Ethereum Request for Comment 721 (ERC-721) token standards to initiate the process of aggregating NFT data from various blockchains. Finally, we developed an NFT visualization dashboard application with a user-friendly graphical user interface (GUI) to provide non-technical users access to the aggregated NFT data. The Universal NFT Vector Database is an off-chain framework for NFT data aggregation based on similarity, which provides an organized way to query and analyze NFT data that was previously unavailable through on-chain solutions.


[91] 2303.12999

Automated Federated Learning in Mobile Edge Networks -- Fast Adaptation and Convergence

Federated Learning (FL) can be used in mobile edge networks to train machine learning models in a distributed manner. Recently, FL has been interpreted within a Model-Agnostic Meta-Learning (MAML) framework, which brings FL significant advantages in fast adaptation and convergence over heterogeneous datasets. However, existing research simply combines MAML and FL without explicitly addressing how much benefit MAML brings to FL and how to maximize such benefit over mobile edge networks. In this paper, we quantify the benefit from two aspects: optimizing FL hyperparameters (i.e., sampled data size and the number of communication rounds) and resource allocation (i.e., transmit power) in mobile edge networks. Specifically, we formulate the MAML-based FL design as an overall learning time minimization problem, under the constraints of model accuracy and energy consumption. Facilitated by the convergence analysis of MAML-based FL, we decompose the formulated problem and then solve it using analytical solutions and the coordinate descent method. With the obtained FL hyperparameters and resource allocation, we design a MAML-based FL algorithm, called Automated Federated Learning (AutoFL), that is able to conduct fast adaptation and convergence. Extensive experimental results verify that AutoFL outperforms other benchmark algorithms regarding the learning time and convergence performance.


[92] 2303.13000

Amalgamated Intermittent Computing Systems

Intermittent computing systems undergo frequent power failure, hindering necessary data sample capture or timely on-device computation. These missing samples and deadlines limit the potential usage of intermittent computing systems in many time-sensitive and fault-tolerant applications. However, a group/swarm of intermittent nodes may amalgamate to sense and process all the samples by taking turns in waking up and extending their collective on-time. However, coordinating a swarm of intermittent computing nodes requires frequent and power-hungry communication, often infeasible with limited energy. Though previous works have shown promises when all intermittent nodes have access to the same amount of energy to harvest, work has yet to be looked into scenarios when the available energy distribution is different for each node. The proposed AICS framework provides an amalgamated intermittent computing system where each node schedules its wake-up schedules based on the duty cycle without communication overhead. We propose one offline tailored duty cycle selection method (Prime-Co-Prime), which schedules wake-up and sleep cycles for each node based on the measured energy to harvest for each node and the prior knowledge or estimation regarding the relative energy distribution. However, when the energy is variable, the problem is formulated as a Decentralized-Partially Observable Markov Decision Process (Dec-POMDP). Each node uses a group of heuristics to solve the Dec-POMDP and schedule its wake-up cycle.


[93] 2303.13001

Is ChatGPT A Good Keyphrase Generator? A Preliminary Study

The emergence of ChatGPT has recently garnered significant attention from the computational linguistics community. To demonstrate its capabilities as a keyphrase generator, we conduct a preliminary evaluation of ChatGPT for the keyphrase generation task. We evaluate its performance in various aspects, including keyphrase generation prompts, keyphrase generation diversity, multi-domain keyphrase generation, and long document understanding. Our evaluation is based on six benchmark datasets, and we adopt the prompt suggested by OpenAI while extending it to six candidate prompts. We find that ChatGPT performs exceptionally well on all six candidate prompts, with minor performance differences observed across the datasets. Based on our findings, we conclude that ChatGPT has great potential for keyphrase generation. Moreover, we discover that ChatGPT still faces challenges when it comes to generating absent keyphrases. Meanwhile, in the final section, we also present some limitations and future expansions of this report.


[94] 2303.13002

Planning Goals for Exploration

Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "Planning Exploratory Goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command. Website: https://penn-pal-lab.github.io/peg/


[95] 2303.13003

Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance

Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures. Despite its effectiveness and convenience, the reliability of PTQ methods in the presence of some extrem cases such as distribution shift and data noise remains largely unexplored. This paper first investigates this problem on various commonly-used PTQ methods. We aim to answer several research questions related to the influence of calibration set distribution variations, calibration paradigm selection, and data augmentation or sampling strategies on PTQ reliability. A systematic evaluation process is conducted across a wide range of tasks and commonly-used PTQ paradigms. The results show that most existing PTQ methods are not reliable enough in term of the worst-case group performance, highlighting the need for more robust methods. Our findings provide insights for developing PTQ methods that can effectively handle distribution shift scenarios and enable the deployment of quantized DNNs in real-world applications.


[96] 2303.13004

Adversarially Contrastive Estimation of Conditional Neural Processes

Conditional Neural Processes~(CNPs) formulate distributions over functions and generate function observations with exact conditional likelihoods. CNPs, however, have limited expressivity for high-dimensional observations, since their predictive distribution is factorized into a product of unconstrained (typically) Gaussian outputs. Previously, this could be handled using latent variables or autoregressive likelihood, but at the expense of intractable training and quadratically increased complexity. Instead, we propose calibrating CNPs with an adversarial training scheme besides regular maximum likelihood estimates. Specifically, we train an energy-based model (EBM) with noise contrastive estimation, which enforces EBM to identify true observations from the generations of CNP. In this way, CNP must generate predictions closer to the ground-truth to fool EBM, instead of merely optimizing with respect to the fixed-form likelihood. From generative function reconstruction to downstream regression and classification tasks, we demonstrate that our method fits mainstream CNP members, showing effectiveness when unconstrained Gaussian likelihood is defined, requiring minimal computation overhead while preserving foundation properties of CNPs.


[97] 2303.13005

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at https://github.com/yzd-v/cls_KD.


[98] 2303.13006

Controllable Inversion of Black-Box Face-Recognition Models via Diffusion

Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been proposed in literature for this task, but they have serious shortcomings such as a lack of realistic outputs, long inference times, and strong requirements for the data set and accessibility of the face recognition model. Through an analysis of the black-box inversion problem, we show that the conditional diffusion model loss naturally emerges and that we can effectively sample from the inverse distribution even without an identity-specific loss. Our method, named identity denoising diffusion probabilistic model (ID3PM), leverages the stochastic nature of the denoising diffusion process to produce high-quality, identity-preserving face images with various backgrounds, lighting, poses, and expressions. We demonstrate state-of-the-art performance in terms of identity preservation and diversity both qualitatively and quantitatively. Our method is the first black-box face recognition model inversion method that offers intuitive control over the generation process and does not suffer from any of the common shortcomings from competing methods.


[99] 2303.13009

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.


[100] 2303.13010

Semantic Image Attack for Visual Model Diagnosis

In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models. This is partially due to the fact that obtaining a balanced, diverse, and perfectly labeled dataset is typically expensive, time-consuming, and error-prone. Rather than relying on a carefully designed test set to assess ML models' failures, fairness, or robustness, this paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images to allow model diagnosis, interpretability, and robustness. Traditional adversarial training is a popular methodology for robustifying ML models against attacks. However, existing adversarial methods do not combine the two aspects that enable the interpretation and analysis of the model's flaws: semantic traceability and perceptual quality. SIA combines the two features via iterative gradient ascent on a predefined semantic attribute space and the image space. We illustrate the validity of our approach in three scenarios for keypoint detection and classification. (1) Model diagnosis: SIA generates a histogram of attributes that highlights the semantic vulnerability of the ML model (i.e., attributes that make the model fail). (2) Stronger attacks: SIA generates adversarial examples with visually interpretable attributes that lead to higher attack success rates than baseline methods. The adversarial training on SIA improves the transferable robustness across different gradient-based attacks. (3) Robustness to imbalanced datasets: we use SIA to augment the underrepresented classes, which outperforms strong augmentation and re-balancing baselines.


[101] 2303.13013

GesGPT: Speech Gesture Synthesis With Text Parsing from GPT

Gesture synthesis has gained significant attention as a critical research area, focusing on producing contextually appropriate and natural gestures corresponding to speech or textual input. Although deep learning-based approaches have achieved remarkable progress, they often overlook the rich semantic information present in the text, leading to less expressive and meaningful gestures. We propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of Large Language Models (LLMs), such as GPT. By capitalizing on the strengths of LLMs for text analysis, we design prompts to extract gesture-related information from textual input. Our method entails developing prompt principles that transform gesture generation into an intention classification problem based on GPT, and utilizing a curated gesture library and integration module to produce semantically rich co-speech gestures. Experimental results demonstrate that GesGPT effectively generates contextually appropriate and expressive gestures, offering a new perspective on semantic co-speech gesture generation.


[102] 2303.13014

Semantic Ray: Learning a Generalizable Semantic Field with Cross-Reprojection Attention

In this paper, we aim to learn a semantic radiance field from multiple scenes that is accurate, efficient and generalizable. While most existing NeRFs target at the tasks of neural scene rendering, image synthesis and multi-view reconstruction, there are a few attempts such as Semantic-NeRF that explore to learn high-level semantic understanding with the NeRF structure. However, Semantic-NeRF simultaneously learns color and semantic label from a single ray with multiple heads, where the single ray fails to provide rich semantic information. As a result, Semantic NeRF relies on positional encoding and needs to train one specific model for each scene. To address this, we propose Semantic Ray (S-Ray) to fully exploit semantic information along the ray direction from its multi-view reprojections. As directly performing dense attention over multi-view reprojected rays would suffer from heavy computational cost, we design a Cross-Reprojection Attention module with consecutive intra-view radial and cross-view sparse attentions, which decomposes contextual information along reprojected rays and cross multiple views and then collects dense connections by stacking the modules. Experiments show that our S-Ray is able to learn from multiple scenes, and it presents strong generalization ability to adapt to unseen scenes.


[103] 2303.13015

Failure-tolerant Distributed Learning for Anomaly Detection in Wireless Networks

The analysis of distributed techniques is often focused upon their efficiency, without considering their robustness (or lack thereof). Such a consideration is particularly important when devices or central servers can fail, which can potentially cripple distributed systems. When such failures arise in wireless communications networks, important services that they use/provide (like anomaly detection) can be left inoperable and can result in a cascade of security problems. In this paper, we present a novel method to address these risks by combining both flat- and star-topologies, combining the performance and reliability benefits of both. We refer to this method as "Tol-FL", due to its increased failure-tolerance as compared to the technique of Federated Learning. Our approach both limits device failure risks while outperforming prior methods by up to 8% in terms of anomaly detection AUROC in a range of realistic settings that consider client as well as server failure, all while reducing communication costs. This performance demonstrates that Tol-FL is a highly suitable method for distributed model training for anomaly detection, especially in the domain of wireless networks.


[104] 2303.13016

Feedback and Control of Dynamics and Robotics using Augmented Reality

Human-machine interaction (HMI) and human-robot interaction (HRI) can assist structural monitoring and structural dynamics testing in the laboratory and field. In vibratory experimentation, one mode of generating vibration is to use electrodynamic exciters. Manual control is a common way of setting the input of the exciter by the operator. To measure the structural responses to these generated vibrations sensors are attached to the structure. These sensors can be deployed by repeatable robots with high endurance, which require on-the-fly control. If the interface between operators and the controls was augmented, then operators can visualize the experiments, exciter levels, and define robot input with a better awareness of the area of interest. Robots can provide better aid to humans if intelligent on-the-fly control of the robot is: (1) quantified and presented to the human; (2) conducted in real-time for human feedback informed by data. Information provided by the new interface would be used to change the control input based on their understanding of real-time parameters. This research proposes using Augmented Reality (AR) applications to provide humans with sensor feedback and control of actuators and robots. This method improves cognition by allowing the operator to maintain awareness of structures while adjusting conditions accordingly with the assistance of the new real-time interface. One interface application is developed to plot sensor data in addition to voltage, frequency, and duration controls for vibration generation. Two more applications are developed under similar framework, one to control the position of a mediating robot and one to control the frequency of the robot movement. This paper presents the proposed model for the new control loop and then compares the new approach with a traditional method by measuring time delay in control input and user efficiency.


[105] 2303.13018

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Existing transformer-based offline Mono3D models adopt grid-based vision tokens, which is suboptimal when using coarse tokens due to the limited available computational power. In this paper, we propose an online Mono3D framework, called MonoATT, which leverages a novel vision transformer with heterogeneous tokens of varying shapes and sizes to facilitate mobile Mono3D. The core idea of MonoATT is to adaptively assign finer tokens to areas of more significance before utilizing a transformer to enhance Mono3D. To this end, we first use prior knowledge to design a scoring network for selecting the most important areas of the image, and then propose a token clustering and merging network with an attention mechanism to gradually merge tokens around the selected areas in multiple stages. Finally, a pixel-level feature map is reconstructed from heterogeneous tokens before employing a SOTA Mono3D detector as the underlying detection core. Experiment results on the real-world KITTI dataset demonstrate that MonoATT can effectively improve the Mono3D accuracy for both near and far objects and guarantee low latency. MonoATT yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.


[106] 2303.13019

Construction Methods Based Minimum Weight Distribution for Polar Codes with Successive Cancellation List Decoding

In this paper, we focus on the construction methods based MWD for polar codes to improve the performance with successive cancellation list (SCL) decoding. We first propose an ordered and nested reliability sequence, namely MWD sequence, to improve the ML performance of polar codes and apply fast construction without the original channel information. In the MWD sequence, the synthetic channels are sorted by the partial MWD which is used to evaluate the influence of information bit on MWD and we prove the MWD sequence is the optimum sequence under ML decoding. Then, since the list size of SCL decoding is limited, we introduce an entropy constraint to establish a relationship between the list size and the ML performance and propose a heuristic and greedy construction method named bit grouping reorder based MWD (BGR-MWD) algorithm. In the algorithm, we divide the synthetic channels into groups by the partial MWD and greedily reorder the synthetic channels in some groups until the entropy constraint is satisfied. The simulation results show the MWD sequence is suitable for constructing polar codes with short code length. Meanwhile, the BGR-MWD algorithm has superior performance over the traditional construction methods for long code length.


[107] 2303.13022

ENVIDR: Implicit Differentiable Renderer with Neural Environment Lighting

Recent advances in neural rendering have shown great potential for reconstructing scenes from multiview images. However, accurately representing objects with glossy surfaces remains a challenge for existing methods. In this work, we introduce ENVIDR, a rendering and modeling framework for high-quality rendering and reconstruction of surfaces with challenging specular reflections. To achieve this, we first propose a novel neural renderer with decomposed rendering components to learn the interaction between surface and environment lighting. This renderer is trained using existing physically based renderers and is decoupled from actual scene representations. We then propose an SDF-based neural surface model that leverages this learned neural renderer to represent general scenes. Our model additionally synthesizes indirect illuminations caused by inter-reflections from shiny surfaces by marching surface-reflected rays. We demonstrate that our method outperforms state-of-art methods on challenging shiny scenes, providing high-quality rendering of specular reflections while also enabling material editing and scene relighting.


[108] 2303.13024

Self-Supervised Clustering of Multivariate Time-Series Data for Identifying TBI Physiological States

Determining clinically relevant physiological states from multivariate time series data with missing values is essential for providing appropriate treatment for acute conditions such as Traumatic Brain Injury (TBI), respiratory failure, and heart failure. Utilizing non-temporal clustering or data imputation and aggregation techniques may lead to loss of valuable information and biased analyses. In our study, we apply the SLAC-Time algorithm, an innovative self-supervision-based approach that maintains data integrity by avoiding imputation or aggregation, offering a more useful representation of acute patient states. By using SLAC-Time to cluster data in a large research dataset, we identified three distinct TBI physiological states and their specific feature profiles. We employed various clustering evaluation metrics and incorporated input from a clinical domain expert to validate and interpret the identified physiological states. Further, we discovered how specific clinical events and interventions can influence patient states and state transitions.


[109] 2303.13026

A Cycle-level Unified DRAM Cache Controller Model for 3DXPoint Memory Systems in gem5

To accommodate the growing memory footprints of today's applications, CPU vendors have employed large DRAM caches, backed by large non-volatile memories like Intel Optane (e.g., Intel's Cascade Lake). The existing computer architecture simulators do not provide support to model and evaluate systems which use DRAM devices as a cache to the non-volatile main memory. In this work, we present a cycle-level DRAM cache model which is integrated with gem5. This model leverages the flexibility of gem5's memory devices models and full system support to enable exploration of many different DRAM cache designs. We demonstrate the usefulness of this new tool by exploring the design space of a DRAM cache controller through several case studies including the impact of scheduling policies, required buffering, combining different memory technologies (e.g., HBM, DDR3/4/5, 3DXPoint, High latency) as the cache and main memory, and the effect of wear-leveling when DRAM cache is backed by NVM main memory. We also perform experiments with real workloads in full-system simulations to validate the proposed model and show the sensitivity of these workloads to the DRAM cache sizes.


[110] 2303.13029

Enabling Design Space Exploration of DRAM Caches in Emerging Memory Systems

The increasing growth of applications' memory capacity and performance demands has led the CPU vendors to deploy heterogeneous memory systems either within a single system or via disaggregation. For instance, systems like Intel's Knights Landing and Sapphire Rapids can be configured to use high bandwidth memory as a cache to main memory. While there is significant research investigating the designs of DRAM caches, there has been little research investigating DRAM caches from a full system point of view, because there is not a suitable model available to the community to accurately study largescale systems with DRAM caches at a cycle-level. In this work we describe a new cycle-level DRAM cache model in the gem5 simulator which can be used for heterogeneous and disaggregated systems. We believe this model enables the community to perform a design space exploration for future generation of memory systems supporting DRAM caches.


[111] 2303.13031

Learning a Practical SDR-to-HDRTV Up-conversion using New Dataset and Degradation Models

In media industry, the demand of SDR-to-HDRTV up-conversion arises when users possess HDR-WCG (high dynamic range-wide color gamut) TVs while most off-the-shelf footage is still in SDR (standard dynamic range). The research community has started tackling this low-level vision task by learning-based approaches. When applied to real SDR, yet, current methods tend to produce dim and desaturated result, making nearly no improvement on viewing experience. Different from other network-oriented methods, we attribute such deficiency to training set (HDR-SDR pair). Consequently, we propose new HDRTV dataset (dubbed HDRTV4K) and new HDR-to-SDR degradation models. Then, it's used to train a luminance-segmented network (LSN) consisting of a global mapping trunk, and two Transformer branches on bright and dark luminance range. We also update assessment criteria by tailored metrics and subjective experiment. Finally, ablation studies are conducted to prove the effectiveness. Our work is available at: https://github.com/AndreGuo/HDRTVDM.


[112] 2303.13032

V2V-based Collision-avoidance Decision Strategy for Autonomous Vehicles Interacting with Fully Occluded Pedestrians at Midblock on Multilane Roadways

Pedestrian occlusion is challenging for autonomous vehicles (AVs) at midblock locations on multilane roadways because an AV cannot detect crossing pedestrians that are fully occluded by downstream vehicles in adjacent lanes. This paper tests the capability of vehicle-to-vehicle (V2V) communication between an AV and its downstream vehicles to share midblock pedestrian crossings information. The researchers developed a V2V-based collision-avoidance decision strategy and compared it to a base scenario (i.e., decision strategy without the utilization of V2V). Simulation results showed that for the base scenario, the near-zero time-to-collision (TTC) indicated no time for the AV to take appropriate action and resulted in dramatic braking followed by collisions. But the V2V-based collision-avoidance decision strategy allowed for a proportional braking approach to increase the TTC allowing the pedestrian to cross safely. To conclude, the V2V-based collision-avoidance decision strategy has higher safety benefits for an AV interacting with fully occluded pedestrians at midblock locations on multilane roadways.


[113] 2303.13034

Preference-Aware Constrained Multi-Objective Bayesian Optimization

This paper addresses the problem of constrained multi-objective optimization over black-box objective functions with practitioner-specified preferences over the objectives when a large fraction of the input space is infeasible (i.e., violates constraints). This problem arises in many engineering design problems including analog circuits and electric power system design. Our overall goal is to approximate the optimal Pareto set over the small fraction of feasible input designs. The key challenges include the huge size of the design space, multiple objectives and large number of constraints, and the small fraction of feasible input designs which can be identified only after performing expensive simulations. We propose a novel and efficient preference-aware constrained multi-objective Bayesian optimization approach referred to as PAC-MOO to address these challenges. The key idea is to learn surrogate models for both output objectives and constraints, and select the candidate input for evaluation in each iteration that maximizes the information gained about the optimal constrained Pareto front while factoring in the preferences over objectives. Our experiments on two real-world analog circuit design optimization problems demonstrate the efficacy of PAC-MOO over prior methods.


[114] 2303.13035

SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization

Electronic health records (EHRs) store an extensive array of patient information, encompassing medical histories, diagnoses, treatments, and test outcomes. These records are crucial for enabling healthcare providers to make well-informed decisions regarding patient care. Summarizing clinical notes further assists healthcare professionals in pinpointing potential health risks and making better-informed decisions. This process contributes to reducing errors and enhancing patient outcomes by ensuring providers have access to the most pertinent and current patient data. Recent research has shown that incorporating prompts with large language models (LLMs) substantially boosts the efficacy of summarization tasks. However, we show that this approach also leads to increased output variance, resulting in notably divergent outputs even when prompts share similar meanings. To tackle this challenge, we introduce a model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings on multiple clinical note tasks and LLMs indicate that our method not only bolsters performance but also effectively curbs variance for various LLMs, providing a more uniform and dependable solution for summarizing vital medical information.


[115] 2303.13036

Stochastic Optimal Control For Gaussian Disturbances with Unknown Mean and Variance Based on Sample Statistics

We propose an open loop methodology based on sample statistics to solve chance constrained stochastic optimal control problems with probabilistic safety guarantees for linear systems where the additive Gaussian noise has unknown mean and covariance. We consider a joint chance constraint for time-varying polytopic target sets under assumptions that the disturbance has been sufficiently sampled. We derive two theorems that allow us to bound the probability of the state being more than some number of sample standard deviations away from the sample mean. We use these theorems to reformulate the chance constraint into a series of convex and linear constraints. Here, solutions guarantee chance constraint satisfaction. We demonstrate our method on a satellite rendezvous maneuver and provide comparisons with the scenario approach.


[116] 2303.13040

Open-Vocabulary Object Detection using Pseudo Caption Labels

Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.


[117] 2303.13041

gDoc: Automatic Generation of Structured API Documentation

Generating and maintaining API documentation with integrity and consistency can be time-consuming and expensive for evolving APIs. To solve this problem, several approaches have been proposed to automatically generate high-quality API documentation based on a combination of knowledge from different web sources. However, current researches are weak in handling unpopular APIs and cannot generate structured API documentation. Hence, in this poster, we propose a hybrid technique(namely \textit{gDoc}) for the automatic generation of structured API documentation. We first present a fine-grained search-based strategy to generate the description for partial API parameters via computing the relevance between various APIs, ensuring the consistency of API documentation. Then, we employ the cross-modal pretraining Seq2Seq model M6 to generate a structured API document for each API, which treats the document generation problem as a translation problem. Finally, we propose a heuristic algorithm to extract practical parameter examples from API request logs. The experiments evaluated on the online system show that this work's approach significantly improves the effectiveness and efficiency of API document generation.


[118] 2303.13043

Top-Down Visual Attention from Analysis by Synthesis

Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.


[119] 2303.13046

Quantized Phase Alignment by Discrete Phase Shifts for Reconfigurable Intelligent Surface-Assisted Communication Systems

Reconfigurable intelligent surface (RIS) has aroused a surge of interest in recent years. In this paper, we investigate the joint phase alignment and phase quantization on discrete phase shift designs for RIS-assisted single-input single-output (SISO) system. Firstly, the phenomena of phase distribution in far field and near field are respectively unveiled, paving the way for discretization of phase shift for RIS. Then, aiming at aligning phases, the phase distribution law and its underlying degree-of-freedom (DoF) are characterized, serving as the guideline of phase quantization strategies. Subsequently, two phase quantization methods, dynamic threshold phase quantization (DTPQ) and equal interval phase quantization (EIPQ), are proposed to strengthen the beamforming effect of RIS. DTPQ is capable of calculating the optimal discrete phase shifts with linear complexity in the number of unit cells on RIS, whilst EIPQ is a simplified method with a constant complexity yielding sub-optimal solution. Simulation results demonstrate that both methods achieve substantial improvements on power gain, stability, and robustness over traditional quantization methods. The path loss (PL) scaling law under discrete phase shift of RIS is unveiled for the first time, with the phase shifts designed by DTPQ due to its optimality. Additionally, the field trials conducted at 2.6 GHz and 35 GHz validate the favourable performance of the proposed methods in practical communication environment.


[120] 2303.13047

Towards Better Dynamic Graph Learning: New Architecture and Unified Library

We propose DyGFormer, a new Transformer-based architecture for dynamic graph learning that solely learns from the sequences of nodes' historical first-hop interactions. DyGFormer incorporates two distinct designs: a neighbor co-occurrence encoding scheme that explores the correlations of the source node and destination node based on their sequences; a patching technique that divides each sequence into multiple patches and feeds them to Transformer, allowing the model to effectively and efficiently benefit from longer histories. We also introduce DyGLib, a unified library with standard training pipelines, extensible coding interfaces, and comprehensive evaluating protocols to promote reproducible, scalable, and credible dynamic graph learning research. By performing extensive experiments on thirteen datasets from various domains for transductive/inductive dynamic link prediction and dynamic node classification tasks, we observe that: DyGFormer achieves state-of-the-art performance on most of the datasets, demonstrating the effectiveness of capturing nodes' correlations and long-term temporal dependencies; the results of baselines vary across different datasets and some findings are inconsistent with previous reports, which may be caused by their diverse pipelines and problematic implementations. We hope our work can provide new insights and facilitate the development of the dynamic graph learning field. All the resources including datasets, data loaders, algorithms, and executing scripts are publicly available at https://github.com/yule-BUAA/DyGLib.


[121] 2303.13050

Building Resilient Web 3.0 with Quantum Information Technologies and Blockchain: An Ambilateral View

Web 3.0 pursues the establishment of decentralized ecosystems based on blockchain technologies to drive the digital transformation of physical commerce and governance. Through consensus algorithms and smart contracts in blockchain, which are based on cryptography technologies, digital identity, digital asset management, decentralized autonomous organization, and decentralized finance are realized for secure and transparent digital economy services in Web 3.0 for promoting the integration of digital and physical economies. With the rapid realization of quantum devices, Web 3.0 is being developed in parallel with the deployment of quantum cloud computing and quantum Internet. In this regard, quantum computing first disrupts the original cryptographic systems that protect data security while reshaping modern cryptography with the advantages of quantum computing and communication. Therefore, this survey provides a comprehensive overview of blockchain-based Web 3.0 and its quantum and post-quantum enhancement from the ambilateral perspective. On the one hand, some post-quantum migration methods, and anti-quantum signatures offer potential ways to achieve unforgeable security under quantum attack for the internal technologies of blockchain. On the other hand, some quantum/post-quantum encryption and verification algorithms improve the external performance of the blockchain, enabling a decentralized, valuable, secure blockchain system. Finally, we discuss the future directions toward developing a provable secure decentralized digital ecosystem.


[122] 2303.13051

Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection

Increasing scene-awareness is a key challenge in video anomaly detection (VAD). In this work, we propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos. We first incorporate foreground object and background scene features with high-level semantics by taking advantage of pre-trained video parsing models. Then, building upon the autoencoder-based reconstruction framework, we introduce both scene-level and object-level contrastive learning to enforce the encoded latent features to be compact within the same semantic classes while being separable across different classes. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability. Moreover, for the sake of tackling rare normal activities, we design a skeleton-based motion augmentation to increase samples and refine the model further. Extensive experiments on three public datasets and scene-dependent mixture datasets validate the effectiveness of our proposed method.


[123] 2303.13052

Generative AI-aided Optimization for AI-Generated Content (AIGC) Services in Edge Networks

As Metaverse emerges as the next-generation Internet paradigm, the ability to efficiently generate content is paramount. AI-Generated Content (AIGC) offers a promising solution to this challenge. However, the training and deployment of large AI models necessitate significant resources. To address this issue, we introduce an AIGC-as-a-Service (AaaS) architecture, which deploys AIGC models in wireless edge networks, ensuring ubiquitous access to AIGC services for Metaverse users. Nonetheless, a key aspect of providing personalized user experiences requires the careful selection of AIGC service providers (ASPs) capable of effectively executing user tasks. This selection process is complicated by environmental uncertainty and variability, a challenge not yet addressed well in existing literature. Therefore, we first propose a diffusion model-based AI-generated optimal decision (AGOD) algorithm, which can generate the optimal ASP selection decisions. We then apply AGOD to deep reinforcement learning (DRL), resulting in the Deep Diffusion Soft Actor-Critic (D2SAC) algorithm, which achieves efficient and effective ASP selection. Our comprehensive experiments demonstrate that D2SAC outperforms seven leading DRL algorithms. Furthermore, the proposed AGOD algorithm has the potential for extension to various optimization problems in wireless networks, positioning it a promising approach for the future research on AIGC-driven services in Metaverse. The implementation of our proposed method is available at: https://github.com/Lizonghang/AGOD.


[124] 2303.13054

Sensorless Adaptive Vibration Suppression in Two-Mass Systems via Joint Estimation of Controller Parameters and System States

The scope of this study is to develop a novel sensorless adaptive vibration suppression controller for two-mass systems with joint estimation of states and controller parameters. Unlike existing solutions, we simultaneously: (i) propose an analytically proved, unified and singularity-issue-free scheme of parameters adjustment of a control law with additional feedbacks that ensures convergence of such parameters to their true values under extremely weak regressor finite excitation (FE) requirement, (ii) derive an adaptive observer of a two-mass electromechanical system physical states with guarantee of their convergence to the ground truth values under clear FE condition, (iii) rigorously prove the exponential stability of the obtained closed-loop system of adaptive vibration suppression for two-mass systems that includes the above-mentioned adaptive observer and adaptive controller. These approaches are grounded on the recently proposed method of parameters identification for one class of nonlinearly parameterized regression equation and thoroughly investigated dynamic regression extension and mixing procedure (DREM). The obtained theoretical results are confirmed via numerical experiments.


[125] 2303.13055

Reimagining Application User Interface (UI) Design using Deep Learning Methods: Challenges and Opportunities

In this paper, we present a review of the recent work in deep learning methods for user interface design. The survey encompasses well known deep learning techniques (deep neural networks, convolutional neural networks, recurrent neural networks, autoencoders, and generative adversarial networks) and datasets widely used to design user interface applications. We highlight important problems and emerging research frontiers in this field. We believe that the use of deep learning for user interface design automation tasks could be one of the high potential fields for the advancement of the software development industry.


[126] 2303.13059

Optimal Security Parameter for Encrypted Control Systems Against Eavesdropper and Malicious Server

A sample identifying complexity and a sample deciphering time have been introduced in a previous study to capture an estimation error and a computation time of system identification by adversaries. The quantities play a crucial role in defining the security of encrypted control systems and designing a security parameter. This study proposes an optimal security parameter for an encrypted control system under a network eavesdropper and a malicious controller server who attempt to identify system parameters using a least squares method. The security parameter design is achieved based on a modification of conventional homomorphic encryption for improving a sample deciphering time and a novel sample identifying complexity, characterized by controllability Gramians and the variance ratio of identification input to system noise. The effectiveness of the proposed design method for a security parameter is demonstrated through numerical simulations.


[127] 2303.13060

DiffPattern: Layout Pattern Generation via Discrete Diffusion

Deep generative models dominate the existing literature in layout pattern generation. However, leaving the guarantee of legality to an inexplicable neural network could be problematic in several applications. In this paper, we propose \tool{DiffPattern} to generate reliable layout patterns. \tool{DiffPattern} introduces a novel diverse topology generation method via a discrete diffusion model with compute-efficiently lossless layout pattern representation. Then a white-box pattern assessment is utilized to generate legal patterns given desired design rules. Our experiments on several benchmark settings show that \tool{DiffPattern} significantly outperforms existing baselines and is capable of synthesizing reliable layout patterns.


[128] 2303.13062

SIEDOB: Semantic Image Editing by Disentangling Object and Background

Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite different. However, all previous methods handle backgrounds and objects as a whole using a monolithic model. Consequently, they remain limited in processing content-rich images and suffer from generating unrealistic objects and texture-inconsistent backgrounds. To address this issue, we propose a novel paradigm, \textbf{S}emantic \textbf{I}mage \textbf{E}diting by \textbf{D}isentangling \textbf{O}bject and \textbf{B}ackground (\textbf{SIEDOB}), the core idea of which is to explicitly leverages several heterogeneous subnetworks for objects and backgrounds. First, SIEDOB disassembles the edited input into background regions and instance-level objects. Then, we feed them into the dedicated generators. Finally, all synthesized parts are embedded in their original locations and utilize a fusion network to obtain a harmonized result. Moreover, to produce high-quality edited images, we propose some innovative designs, including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator, and integrate them into SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines, especially in synthesizing realistic and diverse objects and texture-consistent backgrounds.


[129] 2303.13063

Design of a Low-Cost Prototype Underwater Vehicle

In this study, a small, inexpensive remotely driven underwater vehicle that can navigate in shallow water for the purpose of monitoring water quality and demonstrating vehicle control algorithms is presented. The vehicle is operated by an onboard micro-controller, and the sensor payload comprises a turbidity sensor for determining the quality of the water, a depth sensor, and a 9-axis inertial measurement unit. The developed vehicle is an open frame remotely operated vehicle (ROV) with a small footprint and a modular physical and electrical architecture. With a net weight of 1.6 kg, a maximum depth rating of 20 meters, and a development cost of around $80, the ROV frame is composed of polyvinyl chloride tubes and has a length of 0.35 meters. As a ground station, a dedicated laptop shows crucial vehicle data in real time and can send commands to the vehicle. Initial testing in the pool demonstrates that the vehicle is completely operational and effectively complies with pilot commands.


[130] 2303.13064

Unmanned Surface Vehicle: Yaw Modeling and Identification

In this article, a simplified modeling and system identification procedure for yaw motion of an unmanned surface vehicle (USV) is presented. Two thrusters that allow for both speed and direction control propel the USV. The outputs of the vehicle under inquiry include parameters that define the mobility of the USV in horizontal plane, such as yaw angle and yaw rate. A linear second order model is first developed, and the unknown coefficients are then determined using data from pool trials. Finally, simulations are carried out to verify the model so that it may be used in a later study to implement various control algorithms.


[131] 2303.13065

Retrieval-Augmented Classification with Decoupled Representation

Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code has been released here~\footnote{\url{https://github.com/xnliang98/MigBERT}} and you can download our model here~\footnote{\url{https://huggingface.co/xnliang/MigBERT-large/}}.


[132] 2303.13069

Human Guided Ground-truth Generation for Realistic Image Super-resolution

How to generate the ground-truth (GT) image is a critical issue for training realistic image super-resolution (Real-ISR) models. Existing methods mostly take a set of high-resolution (HR) images as GTs and apply various degradations to simulate their low-resolution (LR) counterparts. Though great progress has been achieved, such an LR-HR pair generation scheme has several limitations. First, the perceptual quality of HR images may not be high enough, limiting the quality of Real-ISR outputs. Second, existing schemes do not consider much human perception in GT generation, and the trained models tend to produce over-smoothed results or unpleasant artifacts. With the above considerations, we propose a human guided GT generation scheme. We first elaborately train multiple image enhancement models to improve the perceptual quality of HR images, and enable one LR image having multiple HR counterparts. Human subjects are then involved to annotate the high quality regions among the enhanced HR images as GTs, and label the regions with unpleasant artifacts as negative samples. A human guided GT image dataset with both positive and negative samples is then constructed, and a loss function is proposed to train the Real-ISR models. Experiments show that the Real-ISR models trained on our dataset can produce perceptually more realistic results with less artifacts. Dataset and codes can be found at https://github.com/ChrisDud0257/HGGT


[133] 2303.13071

PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360$^{\circ}$

Synthesis and reconstruction of 3D human head has gained increasing interests in computer vision and computer graphics recently. Existing state-of-the-art 3D generative adversarial networks (GANs) for 3D human head synthesis are either limited to near-frontal views or hard to preserve 3D consistency in large view angles. We propose PanoHead, the first 3D-aware generative model that enables high-quality view-consistent image synthesis of full heads in $360^\circ$ with diverse appearance and detailed geometry using only in-the-wild unstructured images for training. At its core, we lift up the representation power of recent 3D GANs and bridge the data alignment gap when training from in-the-wild images with widely distributed views. Specifically, we propose a novel two-stage self-adaptive image alignment for robust 3D GAN training. We further introduce a tri-grid neural volume representation that effectively addresses front-face and back-head feature entanglement rooted in the widely-adopted tri-plane formulation. Our method instills prior knowledge of 2D image segmentation in adversarial learning of 3D neural scene structures, enabling compositable head synthesis in diverse backgrounds. Benefiting from these designs, our method significantly outperforms previous 3D GANs, generating high-quality 3D heads with accurate geometry and diverse appearances, even with long wavy and afro hairstyles, renderable from arbitrary poses. Furthermore, we show that our system can reconstruct full 3D heads from single input images for personalized realistic 3D avatars.


[134] 2303.13072

Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognit

Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models for the application of ASR on edge devices, we propose a solution that can reuse the block in Transformer models for the occasion of the small footprint ASR system, which meets the objective of accommodating resource limitations without compromising recognition accuracy. Specifically, we design a novel block-reusing strategy for speech Transformer (BRST) to enhance the effectiveness of parameters and propose an adapter module (ADM) that can produce a compact and adaptable model with only a few additional trainable parameters accompanying each reusing block. We conducted an experiment with the proposed method on the public AISHELL-1 corpus, and the results show that the proposed approach achieves the character error rate (CER) of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM, respectively. In addition, we also make a deeper analysis to show the effect of ADM in the general block-reusing method.


[135] 2303.13073

BlockFW -- Towards Blockchain-based Rule-Sharing Firewall

Central-managed security mechanisms are often utilized in many organizations, but such server is also a security breaking point. This is because the server has the authority for all nodes that share the security protection. Hence if the attackers successfully tamper the server, the organization will be in trouble. Also, the settings and policies saved on the server are usually not cryptographically secured and ensured with hash. Thus, changing the settings from alternative way is feasible, without causing the security solution to raise any alarms. To mitigate these issues, in this work, we develop BlockFW - a blockchain-based rule sharing firewall to create a managed security mechanism, which provides validation and monitoring from multiple nodes. For BlockFW, all occurred transactions are cryptographically protected to ensure its integrity, making tampering attempts in utmost challenging for attackers. In the evaluation, we explore the performance of BlockFW under several adversarial conditions and demonstrate its effectiveness.


[136] 2303.13075

Security Analysis on Social Media Networks via STRIDE Model

Security associated threats are often increased for online social media during a pandemic, such as COVID-19, along with changes in a work environment. For example, employees in many companies and organizations have started to work from home due to the COVID-19 pandemic. Such working style has increased many remote activities and further relied on email for communication, thus creating an ideal condition for email fraud schemes. Motivated by this observation, the main purpose of this work is to evaluate the privacy policy of online social media and identify potential security associated problems. First, we perform a risk analysis of online social media networks such as Facebook, Twitter and LinkedIn by using the STRIDE model. This aims to find threats and vulnerabilities in the online social media. Then in this analysis, the phishing attack was found to be a main threat in online social media, which is a social engineering attack, where users are convinced through some fake messages or emails to extract their personal credentials.


[137] 2303.13076

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.


[138] 2303.13077

Improving the Performance of Spiking Neural Networks on Event-based Datasets with Knowledge Transfer

Spiking neural networks (SNNs) have rich spatial-temporal dynamics, which are suitable for processing neuromorphic, event-based data. However, event-based datasets are usually less annotated than static datasets used in traditional deep learning. Small data scale makes SNNs prone to overfitting and limits the performance of the SNN. To enhance the generalizability of SNNs on event-based datasets, we propose a knowledge-transfer framework that leverages static images to assist in the training on neuromorphic datasets. Our method proposes domain loss and semantic loss to exploit both domain-invariant and unique features of these two domains, providing SNNs with more generalized knowledge for subsequent targeted training on neuromorphic data. Specifically, domain loss aligns the feature space and aims to capture common features between static and event-based images, while semantic loss emphasizes that the differences between samples from different categories should be as large as possible. Experimental results demonstrate that our method outperforms existing methods on all mainstream neuromorphic vision datasets. In particular, we achieve significant performance improvement of 2.7\% and 9.8\% when using only 10\% training data of CIFAR10-DVS and N-Caltech 101 datasets, respectively.


[139] 2303.13079

Implementation of communication media around a mixed reality experience with HoloLens headset, as part of a digitalization of a nutrition workshop

The release of Microsoft's HoloLens headset addresses new types of issues that would have been difficult to design without such a hardware. This semi-transparent visor headset allows the user who wears it to view the projection of 3D virtual objects placed in its real environment. The user can also interact with these 3D objects, which can interact with each other. The framework of this new technology is called mixed reality. We had the opportunity to numerically transform a conventional human nutrition workshop for patients waiting for bariatric surgery by developing a software called HOLO_NUTRI using the HoloLens headset. Unlike our experience of user and conventional programmer specialized in the development of interactive 3D graphics applications, we realized that such a mixed reality experience required specific programming concepts quite different from those of conventional software or those of virtual reality applications, but above all required a thorough reflection about communication for users. In this article, we propose to explain our design of communication (graphic supports, tutorials of use of material, explanatory videos), a step which was crucial for the good progress of our project. The software was used by thirty patients from Le Puy-en-Velay Hospital during 10 sessions of one hour and a half during which patients had to take in hand the headset and software HOLO_NUTRI. We also proposed a series of questions to patients to have an assessment of both the adequacy and the importance of this communication approach for such experience. As the mixed reality technology is very recent but the number of applications based on it significantly increases, the reflection on the implementation of the elements of communication described in this article (videos, exercise of learning for the use of the headset, communication leaflet, etc.) can help developers of such applications.


[140] 2303.13080

MSAT: Biologically Inspired Multi-Stage Adaptive Threshold for Conversion of Spiking Neural Networks

Spiking Neural Networks (SNNs) can do inference with low power consumption due to their spike sparsity. ANN-SNN conversion is an efficient way to achieve deep SNNs by converting well-trained Artificial Neural Networks (ANNs). However, the existing methods commonly use constant threshold for conversion, which prevents neurons from rapidly delivering spikes to deeper layers and causes high time delay. In addition, the same response for different inputs may result in information loss during the information transmission. Inspired by the biological model mechanism, we propose a multi-stage adaptive threshold (MSAT). Specifically, for each neuron, the dynamic threshold varies with firing history and input properties and is positively correlated with the average membrane potential and negatively correlated with the rate of depolarization. The self-adaptation to membrane potential and input allows a timely adjustment of the threshold to fire spike faster and transmit more information. Moreover, we analyze the Spikes of Inactivated Neurons error which is pervasive in early time steps and propose spike confidence accordingly as a measurement of confidence about the neurons that correctly deliver spikes. We use such spike confidence in early time steps to determine whether to elicit spike to alleviate this error. Combined with the proposed method, we examine the performance on non-trivial datasets CIFAR-10, CIFAR-100, and ImageNet. We also conduct sentiment classification and speech recognition experiments on the IDBM and Google speech commands datasets respectively. Experiments show near-lossless and lower latency ANN-SNN conversion. To the best of our knowledge, this is the first time to build a biologically inspired multi-stage adaptive threshold for converted SNN, with comparable performance to state-of-the-art methods while improving energy efficiency.


[141] 2303.13087

Robust Generalization against Photon-Limited Corruptions via Worst-Case Sharpness Minimization

Robust generalization aims to tackle the most challenging data distributions which are rare in the training set and contain severe noises, i.e., photon-limited corruptions. Common solutions such as distributionally robust optimization (DRO) focus on the worst-case empirical risk to ensure low training error on the uncommon noisy distributions. However, due to the over-parameterized model being optimized on scarce worst-case data, DRO fails to produce a smooth loss landscape, thus struggling on generalizing well to the test set. Therefore, instead of focusing on the worst-case risk minimization, we propose SharpDRO by penalizing the sharpness of the worst-case distribution, which measures the loss changes around the neighbor of learning parameters. Through worst-case sharpness minimization, the proposed method successfully produces a flat loss curve on the corrupted distributions, thus achieving robust generalization. Moreover, by considering whether the distribution annotation is available, we apply SharpDRO to two problem settings and design a worst-case selection process for robust generalization. Theoretically, we show that SharpDRO has a great convergence guarantee. Experimentally, we simulate photon-limited corruptions using CIFAR10/100 and ImageNet30 datasets and show that SharpDRO exhibits a strong generalization ability against severe corruptions and exceeds well-known baseline methods with large performance gains.


[142] 2303.13089

Box-Level Active Detection

Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets within the same image incurs waste of budget and redundant labels. Having revealed above problems and limitations, we introduce a box-level active detection framework that controls a box-based budget per cycle, prioritizes informative targets and avoids redundancy for fair comparison and efficient application. Under the proposed box-level setting, we devise a novel pipeline, namely Complementary Pseudo Active Strategy (ComPAS). It exploits both human annotations and the model intelligence in a complementary fashion: an efficient input-end committee queries labels for informative objects only; meantime well-learned targets are identified by the model and compensated with pseudo-labels. ComPAS consistently outperforms 10 competitors under 4 settings in a unified codebase. With supervision from labeled data only, it achieves 100% supervised performance of VOC0712 with merely 19% box annotations. On the COCO dataset, it yields up to 4.3% mAP improvement over the second-best method. ComPAS also supports training with the unlabeled pool, where it surpasses 90% COCO supervised performance with 85% label reduction. Our source code is publicly available at https://github.com/lyumengyao/blad.


[143] 2303.13090

Orthogonal Annotation Benefits Barely-supervised Medical Image Segmentation

Recent trends in semi-supervised learning have significantly boosted the performance of 3D semi-supervised medical image segmentation. Compared with 2D images, 3D medical volumes involve information from different directions, e.g., transverse, sagittal, and coronal planes, so as to naturally provide complementary views. These complementary views and the intrinsic similarity among adjacent 3D slices inspire us to develop a novel annotation way and its corresponding semi-supervised model for effective segmentation. Specifically, we firstly propose the orthogonal annotation by only labeling two orthogonal slices in a labeled volume, which significantly relieves the burden of annotation. Then, we perform registration to obtain the initial pseudo labels for sparsely labeled volumes. Subsequently, by introducing unlabeled volumes, we propose a dual-network paradigm named Dense-Sparse Co-training (DeSCO) that exploits dense pseudo labels in early stage and sparse labels in later stage and meanwhile forces consistent output of two networks. Experimental results on three benchmark datasets validated our effectiveness in performance and efficiency in annotation. For example, with only 10 annotated slices, our method reaches a Dice up to 86.93% on KiTS19 dataset.


[144] 2303.13091

Limits of Predictability in Top-N Recommendation

Top-N recommendation aims to recommend each consumer a small set of N items from a large collection of items, and its accuracy is one of the most common indexes to evaluate the performance of a recommendation system. While a large number of algorithms are proposed to push the Top-N accuracy by learning the user preference from their history purchase data, a predictability question is naturally raised - whether there is an upper limit of such Top-N accuracy. This work investigates such predictability by studying the degree of regularity from a specific set of user behavior data. Quantifying the predictability of Top-N recommendations requires simultaneously quantifying the limits on the accuracy of the N behaviors with the highest probability. This greatly increases the difficulty of the problem. To achieve this, we firstly excavate the associations among N behaviors with the highest probability and describe the user behavior distribution based on the information theory. Then, we adopt the Fano inequality to scale and obtain the Top-N predictability. Extensive experiments are conducted on the real-world data where significant improvements are observed compared to the state-of-the-art methods. We have not only completed the predictability calculation for N targets but also obtained predictability that is much closer to the true value than existing methods. We expect our results to assist these research areas where the quantitative requirement of Top-N predictability is required.


[145] 2303.13092

On Generalizing Trace Minimization Principles, II

This paper is concerned with establishing a trace minimization principle for two Hermitian matrix pairs. Specifically, we will answer the question: when is $\inf_X\operatorname{tr}(\widehat AX^{\rm H}AX)$ subject to $\widehat BX^{\rm H}BX=I$ (the identity matrix of apt size) finite? Sufficient and necessary conditions are obtained and, when the infimum is finite, an explicit formula for it is presented in terms of the finite eigenvalues of the matrix pairs. Our results extend Fan's trace minimization principle (1949) for a Hermitian matrix, a minimization principle of Kova\v{c}-Striko and Veseli\'c (1995) for a Hermitian matrix pair, and most recent ones by the authors and their collaborators for a Hermitian matrix pair and a Hermitian matrix.


[146] 2303.13093

The Probabilistic Stability of Stochastic Gradient Descent

A fundamental open problem in deep learning theory is how to define and understand the stability of stochastic gradient descent (SGD) close to a fixed point. Conventional literature relies on the convergence of statistical moments, esp., the variance, of the parameters to quantify the stability. We revisit the definition of stability for SGD and use the \textit{convergence in probability} condition to define the \textit{probabilistic stability} of SGD. The proposed stability directly answers a fundamental question in deep learning theory: how SGD selects a meaningful solution for a neural network from an enormous number of solutions that may overfit badly. To achieve this, we show that only under the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning, convergence to low-rank saddles, and correct learning. When applied to a neural network, these phase diagrams imply that SGD prefers low-rank saddles when the underlying gradient is noisy, thereby improving the learning performance. This result is in sharp contrast to the conventional wisdom that SGD prefers flatter minima to sharp ones, which we find insufficient to explain the experimental data. We also prove that the probabilistic stability of SGD can be quantified by the Lyapunov exponents of the SGD dynamics, which can easily be measured in practice. Our work potentially opens a new venue for addressing the fundamental question of how the learning algorithm affects the learning outcome in deep learning.


[147] 2303.13095

Modeling Entities as Semantic Points for Visual Information Extraction in the Wild

Recently, Visual Information Extraction (VIE) has been becoming increasingly important in both the academia and industry, due to the wide range of real-world applications. Previously, numerous works have been proposed to tackle this problem. However, the benchmarks used to assess these methods are relatively plain, i.e., scenarios with real-world complexity are not fully represented in these benchmarks. As the first contribution of this work, we curate and release a new dataset for VIE, in which the document images are much more challenging in that they are taken from real applications, and difficulties such as blur, partial occlusion, and printing shift are quite common. All these factors may lead to failures in information extraction. Therefore, as the second contribution, we explore an alternative approach to precisely and robustly extract key information from document images under such tough conditions. Specifically, in contrast to previous methods, which usually either incorporate visual information into a multi-modal architecture or train text spotting and information extraction in an end-to-end fashion, we explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities, which could largely benefit entity labeling and linking. Extensive experiments on standard benchmarks in this field as well as the proposed dataset demonstrate that the proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models. Dataset is available at https://www.modelscope.cn/datasets/damo/SIBR/summary.


[148] 2303.13097

CP$^3$: Channel Pruning Plug-in for Point-based Networks

Channel pruning can effectively reduce both computational cost and memory footprint of the original network while keeping a comparable accuracy performance. Though great success has been achieved in channel pruning for 2D image-based convolutional networks (CNNs), existing works seldom extend the channel pruning methods to 3D point-based neural networks (PNNs). Directly implementing the 2D CNN channel pruning methods to PNNs undermine the performance of PNNs because of the different representations of 2D images and 3D point clouds as well as the network architecture disparity. In this paper, we proposed CP$^3$, which is a Channel Pruning Plug-in for Point-based network. CP$^3$ is elaborately designed to leverage the characteristics of point clouds and PNNs in order to enable 2D channel pruning methods for PNNs. Specifically, it presents a coordinate-enhanced channel importance metric to reflect the correlation between dimensional information and individual channel features, and it recycles the discarded points in PNN's sampling process and reconsiders their potentially-exclusive information to enhance the robustness of channel pruning. Experiments on various PNN architectures show that CP$^3$ constantly improves state-of-the-art 2D CNN pruning approaches on different point cloud tasks. For instance, our compressed PointNeXt-S on ScanObjectNN achieves an accuracy of 88.52% with a pruning rate of 57.8%, outperforming the baseline pruning methods with an accuracy gain of 1.94%.


[149] 2303.13099

Multi-View Zero-Shot Open Intent Induction from Dialogues: Multi Domain Batch and Proxy Gradient Transfer

In Task Oriented Dialogue (TOD) system, detecting and inducing new intents are two main challenges to apply the system in the real world. In this paper, we suggest the semantic multi-view model to resolve these two challenges: (1) SBERT for General Embedding (GE), (2) Multi Domain Batch (MDB) for dialogue domain knowledge, and (3) Proxy Gradient Transfer (PGT) for cluster-specialized semantic. MDB feeds diverse dialogue datasets to the model at once to tackle the multi-domain problem by learning the multiple domain knowledge. We introduce a novel method PGT, which employs the Siamese network to fine-tune the model with a clustering method directly.Our model can learn how to cluster dialogue utterances by using PGT. Experimental results demonstrate that our multi-view model with MDB and PGT significantly improves the Open Intent Induction performance compared to baseline systems.


[150] 2303.13100

PointGame: Geometrically and Adaptively Masked Auto-Encoder on Point Clouds

Self-supervised learning is attracting large attention in point cloud understanding. However, exploring discriminative and transferable features still remains challenging due to their nature of irregularity and sparsity. We propose a geometrically and adaptively masked auto-encoder for self-supervised learning on point clouds, termed \textit{PointGame}. PointGame contains two core components: GATE and EAT. GATE stands for the geometrical and adaptive token embedding module; it not only absorbs the conventional wisdom of geometric descriptors that captures the surface shape effectively, but also exploits adaptive saliency to focus on the salient part of a point cloud. EAT stands for the external attention-based Transformer encoder with linear computational complexity, which increases the efficiency of the whole pipeline. Unlike cutting-edge unsupervised learning models, PointGame leverages geometric descriptors to perceive surface shapes and adaptively mines discriminative features from training data. PointGame showcases clear advantages over its competitors on various downstream tasks under both global and local fine-tuning strategies. The code and pre-trained models will be publicly available.


[151] 2303.13101

MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

To benefit the complementary information between heterogeneous data, we introduce a new Multimodal Transformer (MMFormer) for Remote Sensing (RS) image classification using Hyperspectral Image (HSI) accompanied by another source of data such as Light Detection and Ranging (LiDAR). Compared with traditional Vision Transformer (ViT) lacking inductive biases of convolutions, we first introduce convolutional layers to our MMFormer to tokenize patches from multimodal data of HSI and LiDAR. Then we propose a Multi-scale Multi-head Self-Attention (MSMHSA) module to address the problem of compatibility which often limits to fuse HSI with high spectral resolution and LiDAR with relatively low spatial resolution. The proposed MSMHSA module can incorporate HSI to LiDAR data in a coarse-to-fine manner enabling us to learn a fine-grained representation. Extensive experiments on widely used benchmarks (e.g., Trento and MUUFL) demonstrate the effectiveness and superiority of our proposed MMFormer for RS image classification.


[152] 2303.13102

Keypoint-Guided Optimal Transport

Existing Optimal Transport (OT) methods mainly derive the optimal transport plan/matching under the criterion of transport cost/distance minimization, which may cause incorrect matching in some cases. In many applications, annotating a few matched keypoints across domains is reasonable or even effortless in annotation burden. It is valuable to investigate how to leverage the annotated keypoints to guide the correct matching in OT. In this paper, we propose a novel KeyPoint-Guided model by ReLation preservation (KPG-RL) that searches for the optimal matching (i.e., transport plan) guided by the keypoints in OT. To impose the keypoints in OT, first, we propose a mask-based constraint of the transport plan that preserves the matching of keypoint pairs. Second, we propose to preserve the relation of each data point to the keypoints to guide the matching. The proposed KPG-RL model can be solved by Sinkhorn's algorithm and is applicable even when distributions are supported in different spaces. We further utilize the relation preservation constraint in the Kantorovich Problem and Gromov-Wasserstein model to impose the guidance of keypoints in them. Meanwhile, the proposed KPG-RL model is extended to the partial OT setting. Moreover, we deduce the dual formulation of the KPG-RL model, which is solved using deep learning techniques. Based on the learned transport plan from dual KPG-RL, we propose a novel manifold barycentric projection to transport source data to the target domain. As applications, we apply the proposed KPG-RL model to the heterogeneous domain adaptation and image-to-image translation. Experiments verified the effectiveness of the proposed approach.


[153] 2303.13112

A Simple Explanation for the Phase Transition in Large Language Models with List Decoding

Various recent experimental results show that large language models (LLM) exhibit emergent abilities that are not present in small models. System performance is greatly improved after passing a certain critical threshold of scale. In this letter, we provide a simple explanation for such a phase transition phenomenon. For this, we model an LLM as a sequence-to-sequence random function. Instead of using instant generation at each step, we use a list decoder that keeps a list of candidate sequences at each step and defers the generation of the output sequence at the end. We show that there is a critical threshold such that the expected number of erroneous candidate sequences remains bounded when an LLM is below the threshold, and it grows exponentially when an LLM is above the threshold. Such a threshold is related to the basic reproduction number in a contagious disease.


[154] 2303.13113

Adaptive Regularization for Class-Incremental Learning

Class-Incremental Learning updates a deep classifier with new categories while maintaining the previously observed class accuracy. Regularizing the neural network weights is a common method to prevent forgetting previously learned classes while learning novel ones. However, existing regularizers use a constant magnitude throughout the learning sessions, which may not reflect the varying levels of difficulty of the tasks encountered during incremental learning. This study investigates the necessity of adaptive regularization in Class-Incremental Learning, which dynamically adjusts the regularization strength according to the complexity of the task at hand. We propose a Bayesian Optimization-based approach to automatically determine the optimal regularization magnitude for each learning task. Our experiments on two datasets via two regularizers demonstrate the importance of adaptive regularization for achieving accurate and less forgetful visual incremental learning.


[155] 2303.13121

DetOFA: Efficient Training of Once-for-All Networks for Object Detection by Using Pre-trained Supernet and Path Filter

We address the challenge of training a large supernet for the object detection task, using a relatively small amount of training data. Specifically, we propose an efficient supernet-based neural architecture search (NAS) method that uses transfer learning and search space pruning. First, the supernet is pre-trained on a classification task, for which large datasets are available. Second, the search space defined by the supernet is pruned by removing candidate models that are predicted to perform poorly. To effectively remove the candidates over a wide range of resource constraints, we particularly design a performance predictor, called path filter, which can accurately predict the relative performance of the models that satisfy similar resource constraints. Hence, supernet training is more focused on the best-performing candidates. Our path filter handles prediction for paths with different resource budgets. Compared to once-for-all, our proposed method reduces the computational cost of the optimal network architecture by 30% and 63%, while yielding better accuracy-floating point operations Pareto front (0.85 and 0.45 points of improvement on average precision for Pascal VOC and COCO, respectively).


[156] 2303.13122

Exploring Visual Prompts for Whole Slide Image Classification with Multiple Instance Learning

Multiple instance learning (MIL) has emerged as a popular method for classifying histopathology whole slide images (WSIs). However, existing approaches typically rely on pre-trained models from large natural image datasets, such as ImageNet, to generate instance features, which can be sub-optimal due to the significant differences between natural images and histopathology images that lead to a domain shift. In this paper, we present a novel, simple yet effective method for learning domain-specific knowledge transformation from pre-trained models to histopathology images. Our approach entails using a prompt component to assist the pre-trained model in discerning differences between the pre-trained dataset and the target histopathology dataset, resulting in improved performance of MIL models. We validate our method on two publicly available datasets, Camelyon16 and TCGA-NSCLC. Extensive experimental results demonstrate the significant performance improvement of our method for different MIL models and backbones. Upon publication of this paper, we will release the source code for our method.


[157] 2303.13123

Laplacian Segmentation Networks: Improved Epistemic Uncertainty from Spatial Aleatoric Uncertainty

Out of distribution (OOD) medical images are frequently encountered, e.g. because of site- or scanner differences, or image corruption. OOD images come with a risk of incorrect image segmentation, potentially negatively affecting downstream diagnoses or treatment. To ensure robustness to such incorrect segmentations, we propose Laplacian Segmentation Networks (LSN) that jointly model epistemic (model) and aleatoric (data) uncertainty in image segmentation. We capture data uncertainty with a spatially correlated logit distribution. For model uncertainty, we propose the first Laplace approximation of the weight posterior that scales to large neural networks with skip connections that have high-dimensional outputs. Empirically, we demonstrate that modelling spatial pixel correlation allows the Laplacian Segmentation Network to successfully assign high epistemic uncertainty to out-of-distribution objects appearing within images.


[158] 2303.13125

Finite Volume Approximations for Non-Linear Parabolic Problems with Stochastic Forcing

We propose a two-point flux approximation finite-volume scheme for a stochastic non-linear parabolic equation with a multiplicative noise. The time discretization is implicit except for the stochastic noise term in order to be compatible with stochastic integration in the sense of It\^{o}. We show existence and uniqueness of solutions to the scheme and the appropriate measurability for stochastic integration follows from the uniqueness of approximate solutions.


[159] 2303.13126

MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. Project page is available at https://magicfusion.github.io/.


[160] 2303.13129

Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations

Digital human motion synthesis is a vibrant research field with applications in movies, AR/VR, and video games. Whereas methods were proposed to generate natural and realistic human motions, most only focus on modeling humans and largely ignore object movements. Generating task-oriented human-object interaction motions in simulation is challenging. For different intents of using the objects, humans conduct various motions, which requires the human first to approach the objects and then make them move consistently with the human instead of staying still. Also, to deploy in downstream applications, the synthesized motions are desired to be flexible in length, providing options to personalize the predicted motions for various purposes. To this end, we propose TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations, which generates full human-object interaction motions to conduct specific tasks, given only the task type, the object, and a starting human status. TOHO generates human-object motions in three steps: 1) it first estimates the keyframe poses of conducting a task given the task type and object information; 2) then, it infills the keyframes and generates continuous motions; 3) finally, it applies a compact closed-form object motion estimation to generate the object motion. Our method generates continuous motions that are parameterized only by the temporal coordinate, which allows for upsampling or downsampling of the sequence to arbitrary frames and adjusting the motion speeds by designing the temporal coordinate vector. We demonstrate the effectiveness of our method, both qualitatively and quantitatively. This work takes a step further toward general human-scene interaction simulation.


[161] 2303.13131

Watch Out for the Confusing Faces: Detecting Face Swapping with the Probability Distribution of Face Identification Models

Recently, face swapping has been developing rapidly and achieved a surprising reality, raising concerns about fake content. As a countermeasure, various detection approaches have been proposed and achieved promising performance. However, most existing detectors struggle to maintain performance on unseen face swapping methods and low-quality images. Apart from the generalization problem, current detection approaches have been shown vulnerable to evasion attacks crafted by detection-aware manipulators. Lack of robustness under adversary scenarios leaves threats for applying face swapping detection in real world. In this paper, we propose a novel face swapping detection approach based on face identification probability distributions, coined as IdP_FSD, to improve the generalization and robustness. IdP_FSD is specially designed for detecting swapped faces whose identities belong to a finite set, which is meaningful in real-world applications. Compared with previous general detection methods, we make use of the available real faces with concerned identities and require no fake samples for training. IdP_FSD exploits face swapping's common nature that the identity of swapped face combines that of two faces involved in swapping. We reflect this nature with the confusion of a face identification model and measure the confusion with the maximum value of the output probability distribution. What's more, to defend our detector under adversary scenarios, an attention-based finetuning scheme is proposed for the face identification models used in IdP_FSD. Extensive experiments show that the proposed IdP_FSD not only achieves high detection performance on different benchmark datasets and image qualities but also raises the bar for manipulators to evade the detection.


[162] 2303.13132

Masked Image Training for Generalizable Deep Image Denoising

When capturing and storing images, devices inevitably introduce noise. Reducing this noise is a critical task called image denoising. Deep learning has become the de facto method for image denoising, especially with the emergence of Transformer-based models that have achieved notable state-of-the-art results on various image tasks. However, deep learning-based methods often suffer from a lack of generalization ability. For example, deep models trained on Gaussian noise may perform poorly when tested on other noise distributions. To address this issue, we present a novel approach to enhance the generalization performance of denoising networks, known as masked training. Our method involves masking random pixels of the input image and reconstructing the missing information during training. We also mask out the features in the self-attention layers to avoid the impact of training-testing inconsistency. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios. Additionally, our interpretability analysis demonstrates the superiority of our method.


[163] 2303.13133

Generative Image Inpainting with Segmentation Confusion Adversarial Training and Contrastive Learning

This paper presents a new adversarial training framework for image inpainting with segmentation confusion adversarial training (SCAT) and contrastive learning. SCAT plays an adversarial game between an inpainting generator and a segmentation network, which provides pixel-level local training signals and can adapt to images with free-form holes. By combining SCAT with standard global adversarial training, the new adversarial training framework exhibits the following three advantages simultaneously: (1) the global consistency of the repaired image, (2) the local fine texture details of the repaired image, and (3) the flexibility of handling images with free-form holes. Moreover, we propose the textural and semantic contrastive learning losses to stabilize and improve our inpainting model's training by exploiting the feature representation space of the discriminator, in which the inpainting images are pulled closer to the ground truth images but pushed farther from the corrupted images. The proposed contrastive losses better guide the repaired images to move from the corrupted image data points to the real image data points in the feature representation space, resulting in more realistic completed images. We conduct extensive experiments on two benchmark datasets, demonstrating our model's effectiveness and superiority both qualitatively and quantitatively.


[164] 2303.13136

Approximation of Functions of Several Variables by Multidimensional A- and J-fractions with Independent Variables

The paper deals with the problem of approximating the functions of several variables by branched continued fractions, in particular, multidimensional A- and J-fractions with independent variables. A generalization of Gragg's algorithm is constructed that enables us to compute, by the coefficients of the given formal multiple power series, the coefficients of the corresponding multidimensional A-fraction with independent variables. This algorithm can also be used to construct the multidimensional J-fraction with independent variables corresponding to a given formal multiple Laurent series. Some numerical experiments of approximating the functions of several variables by these branched continued fractions are given.


[165] 2303.13137

FedGH: Heterogeneous Federated Learning with Generalized Global Header

Federated learning (FL) is an emerging machine learning paradigm that allows multiple parties to train a shared model collaboratively in a privacy-preserving manner. Existing horizontal FL methods generally assume that the FL server and clients hold the same model structure. However, due to system heterogeneity and the need for personalization, enabling clients to hold models with diverse structures has become an important direction. Existing model-heterogeneous FL approaches often require publicly available datasets and incur high communication and/or computational costs, which limit their performances. To address these limitations, we propose the Federated Global prediction Header (FedGH) approach. It is a communication and computation-efficient model-heterogeneous FL framework which trains a shared generalized global prediction header with representations extracted by heterogeneous extractors for clients' models at the FL server. The trained generalized global prediction header learns from different clients. The acquired global knowledge is then transferred to clients to substitute each client's local prediction header. We derive the non-convex convergence rate of FedGH. Extensive experiments on two real-world datasets demonstrate that FedGH achieves significantly more advantageous performance in both model-homogeneous and -heterogeneous FL scenarios compared to seven state-of-the-art personalized FL models, beating the best-performing baseline by up to 8.87% (for model-homogeneous FL) and 1.83% (for model-heterogeneous FL) in terms of average test accuracy, while saving up to 85.53% of communication overhead.


[166] 2303.13138

Online search is more likely to lead students to validate true news than to refute false ones

With the spread of high-speed Internet and portable smart devices, the way people access and consume information has drastically changed. However, this presents many challenges, including information overload, personal data leakage, and misinformation diffusion. Across the spectrum of risks that Internet users can face nowadays, this work focuses on understanding how young people perceive and deal with false information. Within an experimental campaign involving 261 students, we presented to the participants six different news items and invited them to browse the Internet to assess the veracity of the presented information. Our results suggest that online search is more likely to lead students to validate true news than to refute false ones. We found that students change their opinion related to a specific piece of information more often than their global idea about a broader topic. Also, our experiment reflected that the majority of participants rely on online sources to obtain information and access the news, and those getting information from books and Internet browsing are the most accurate in assessing the veracity of a news item. This work provides a principled understanding of how young people perceive and distinguish true and false pieces of information, identifying strengths and weaknesses amidst young subjects and contributing to build tailored digital information literacy strategies for youth.


[167] 2303.13141

Numerical evaluation of singular integrals on non-disjoint self-similar fractal sets

We consider the numerical evaluation of a class of double integrals with respect to a pair of self-similar measures over a self-similar fractal set, with a weakly singular integrand of logarithmic or algebraic type. In a recent paper [Gibbs, Hewett and Moiola, Numer. Alg., 2023] it was shown that when the fractal set is ``disjoint'' in a certain sense (an example being the Cantor set), the self-similarity of the measures, combined with the homogeneity properties of the integrand, can be exploited to express the singular integral exactly in terms of regular integrals, which can be readily approximated numerically. In this paper we present a methodology for extending these results to cases where the fractal is non-disjoint. Our approach applies to many well-known examples including the Sierpinski triangle, the Vicsek fractal, the Sierpinski carpet, and the Koch snowflake.


[168] 2303.13148

Calibrated Out-of-Distribution Detection with a Generic Representation

Out-of-distribution detection is a common issue in deploying vision models in practice and solving it is an essential building block in safety critical applications. Existing OOD detection solutions focus on improving the OOD robustness of a classification model trained exclusively on in-distribution (ID) data. In this work, we take a different approach and propose to leverage generic pre-trained representations. We first investigate the behaviour of simple classifiers built on top of such representations and show striking performance gains compared to the ID trained representations. We propose a novel OOD method, called GROOD, that achieves excellent performance, predicated by the use of a good generic representation. Only a trivial training process is required for adapting GROOD to a particular problem. The method is simple, general, efficient, calibrated and with only a few hyper-parameters. The method achieves state-of-the-art performance on a number of OOD benchmarks, reaching near perfect performance on several of them. The source code is available at https://github.com/vojirt/GROOD.


[169] 2303.13151

Defining Quality Requirements for a Trustworthy AI Wildflower Monitoring Platform

For an AI solution to evolve from a trained machine learning model into a production-ready AI system, many more things need to be considered than just the performance of the machine learning model. A production-ready AI system needs to be trustworthy, i.e. of high quality. But how to determine this in practice? For traditional software, ISO25000 and its predecessors have since long time been used to define and measure quality characteristics. Recently, quality models for AI systems, based on ISO25000, have been introduced. This paper applies one such quality model to a real-life case study: a deep learning platform for monitoring wildflowers. The paper presents three realistic scenarios sketching what it means to respectively use, extend and incrementally improve the deep learning platform for wildflower identification and counting. Next, it is shown how the quality model can be used as a structured dictionary to define quality requirements for data, model and software. Future work remains to extend the quality model with metrics, tools and best practices to aid AI engineering practitioners in implementing trustworthy AI systems.


[170] 2303.13157

Adiabatic replay for continual learning

Conventional replay-based approaches to continual learning (CL) require, for each learning phase with new data, the replay of samples representing all of the previously learned knowledge in order to avoid catastrophic forgetting. Since the amount of learned knowledge grows over time in CL problems, generative replay spends an increasing amount of time just re-learning what is already known. In this proof-of-concept study, we propose a replay-based CL strategy that we term adiabatic replay (AR), which derives its efficiency from the (reasonable) assumption that each new learning phase is adiabatic, i.e., represents only a small addition to existing knowledge. Each new learning phase triggers a sampling process that selectively replays, from the body of existing knowledge, just such samples that are similar to the new data, in contrast to replaying all of it. Complete replay is not required since AR represents the data distribution by GMMs, which are capable of selectively updating their internal representation only where data statistics have changed. As long as additions are adiabatic, the amount of to-be-replayed samples need not to depend on the amount of previously acquired knowledge at all. We verify experimentally that AR is superior to state-of-the-art deep generative replay using VAEs.


[171] 2303.13158

Improvement of Color Image Analysis Using a New Hybrid Face Recognition Algorithm based on Discrete Wavelets and Chebyshev Polynomials

This work is unique in the use of discrete wavelets that were built from or derived from Chebyshev polynomials of the second and third kind, filter the Discrete Second Chebyshev Wavelets Transform (DSCWT), and derive two effective filters. The Filter Discrete Third Chebyshev Wavelets Transform (FDTCWT) is used in the process of analyzing color images and removing noise and impurities that accompany the image, as well as because of the large amount of data that makes up the image as it is taken. These data are massive, making it difficult to deal with each other during transmission. However to address this issue, the image compression technique is used, with the image not losing information due to the readings that were obtained, and the results were satisfactory. Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR), Bit Per Pixel (BPP), and Compression Ratio (CR) Coronavirus is the initial treatment, while the processing stage is done with network training for Convolutional Neural Networks (CNN) with Discrete Second Chebeshev Wavelets Convolutional Neural Network (DSCWCNN) and Discrete Third Chebeshev Wavelets Convolutional Neural Network (DTCWCNN) to create an efficient algorithm for face recognition, and the best results were achieved in accuracy and in the least amount of time. Two samples of color images that were made or implemented were used. The proposed theory was obtained with fast and good results; the results are evident shown in the tables below.


[172] 2303.13166

Take 5: Interpretable Image Classification with a Handful of Features

Deep Neural Networks use thousands of mostly incomprehensible features to identify a single class, a decision no human can follow. We propose an interpretable sparse and low dimensional final decision layer in a deep neural network with measurable aspects of interpretability and demonstrate it on fine-grained image classification. We argue that a human can only understand the decision of a machine learning model, if the features are interpretable and only very few of them are used for a single decision. For that matter, the final layer has to be sparse and, to make interpreting the features feasible, low dimensional. We call a model with a Sparse Low-Dimensional Decision SLDD-Model. We show that a SLDD-Model is easier to interpret locally and globally than a dense high-dimensional decision layer while being able to maintain competitive accuracy. Additionally, we propose a loss function that improves a model's feature diversity and accuracy. Our more interpretable SLDD-Model only uses 5 out of just 50 features per class, while maintaining 97% to 100% of the accuracy on four common benchmark datasets compared to the baseline model with 2048 features.


[173] 2303.13168

An elementary belief function logic

Non-additive uncertainty theories, typically possibility theory, belief functions and imprecise probabilities share a common feature with modal logic: the duality properties between possibility and necessity measures, belief and plausibility functions as well as between upper and lower probabilities extend the duality between possibility and necessity modalities to the graded environment. It has been shown that the all-or-nothing version of possibility theory can be exactly captured by a minimal epistemic logic (MEL) that uses a very small fragment of the KD modal logic, without resorting to relational semantics. Besides, the case of belief functions has been studied independently, and a belief function logic has been obtained by extending the modal logic S5 to graded modalities using {\L}ukasiewicz logic, albeit using relational semantics. This paper shows that a simpler belief function logic can be devised by adding {\L}ukasiewicz logic on top of MEL. It allows for a more natural semantics in terms of Shafer basic probability assignments.


[174] 2303.13171

Adaptive step-size control for global approximation of SDEs driven by countably dimensional Wiener process

In this paper we deal with global approximation of solutions of stochastic differential equations (SDEs) driven by countably dimensional Wiener process. Under certain regularity conditions imposed on the coefficients, we show lower bounds for exact asymptotic error behaviour. For that reason, we analyse separately two classes of admissible algorithms: based on equidistant, and possibly not equidistant meshes. Our results indicate that in both cases, decrease of any method error requires significant increase of the cost term, which is illustrated by the product of cost and error diverging to infinity. This is, however, not visible in the finite dimensional case. In addition, we propose an implementable, path-independent Euler algorithm with adaptive step-size control, which is asymptotically optimal among algorithms using specified truncation levels of the underlying Wiener process. Our theoretical findings are supported by numerical simulation in Python language.


[175] 2303.13173

Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository

Systems with artificial intelligence components, so-called AI-based systems, have gained considerable attention recently. However, many organizations have issues with achieving production readiness with such systems. As a means to improve certain software quality attributes and to address frequently occurring problems, design patterns represent proven solution blueprints. While new patterns for AI-based systems are emerging, existing patterns have also been adapted to this new context. The goal of this study is to provide an overview of design patterns for AI-based systems, both new and adapted ones. We want to collect and categorize patterns, and make them accessible for researchers and practitioners. To this end, we first performed a multivocal literature review (MLR) to collect design patterns used with AI-based systems. We then integrated the created pattern collection into a web-based pattern repository to make the patterns browsable and easy to find. As a result, we selected 51 resources (35 white and 16 gray ones), from which we extracted 70 unique patterns used for AI-based systems. Among these are 34 new patterns and 36 traditional ones that have been adapted to this context. Popular pattern categories include "architecture" (25 patterns), "deployment" (16), "implementation" (9), or "security & safety" (9). While some patterns with four or more mentions already seem established, the majority of patterns have only been mentioned once or twice (51 patterns). Our results in this emerging field can be used by researchers as a foundation for follow-up studies and by practitioners to discover relevant patterns for informing the design of AI-based systems.


[176] 2303.13174

3D-POP -- An automated annotation approach to facilitate markerless 2D-3D tracking of freely moving birds with marker-based motion capture

Recent advances in machine learning and computer vision are revolutionizing the field of animal behavior by enabling researchers to track the poses and locations of freely moving animals without any marker attachment. However, large datasets of annotated images of animals for markerless pose tracking, especially high-resolution images taken from multiple angles with accurate 3D annotations, are still scant. Here, we propose a method that uses a motion capture (mo-cap) system to obtain a large amount of annotated data on animal movement and posture (2D and 3D) in a semi-automatic manner. Our method is novel in that it extracts the 3D positions of morphological keypoints (e.g eyes, beak, tail) in reference to the positions of markers attached to the animals. Using this method, we obtained, and offer here, a new dataset - 3D-POP with approximately 300k annotated frames (4 million instances) in the form of videos having groups of one to ten freely moving birds from 4 different camera views in a 3.6m x 4.2m area. 3D-POP is the first dataset of flocking birds with accurate keypoint annotations in 2D and 3D along with bounding box and individual identities and will facilitate the development of solutions for problems of 2D to 3D markerless pose, trajectory tracking, and identification in birds.


[177] 2303.13175

Enhancement of theColor Image Compression Using a New Algorithm based on Discrete Hermite Wavelet Transform

The Internet has turned the entire world into a small village;this is because it has made it possible to share millions of images and videos. However, sending and receiving a huge amount of data is considered to be a main challenge. To address this issue, a new algorithm is required to reduce image bits and represent the data in a compressed form. Nevertheless, image compression is an important application for transferring large files and images. This requires appropriate and efficient transfers in this field to achieve the task and reach the best results. In this work, we propose a new algorithm based on discrete Hermite wavelets transformation (DHWT) that shows the efficiency and quality of the color images. By compressing the color image, this method analyzes it and divides it into approximate coefficients and detail coefficients after adding the wavelets into MATLAB. With Multi-Resolution Analyses (MRA), the appropriate filter is derived, and the mathematical aspects prove to be validated by testing a new filter and performing its operation. After the decomposition of the rows and upon the process of the reconstruction, taking the inverse of the filter and dealing with the columns of the matrix, the original matrix is improved by measuring the parameters of the image to achieve the best quality of the resulting image, such as the peak signal-to-noise ratio (PSNR), compression ratio (CR), bits per pixel (BPP), and mean square error (MSE).


[178] 2303.13177

It is all Connected: A New Graph Formulation for Spatio-Temporal Forecasting

With an ever-increasing number of sensors in modern society, spatio-temporal time series forecasting has become a de facto tool to make informed decisions about the future. Most spatio-temporal forecasting models typically comprise distinct components that learn spatial and temporal dependencies. A common methodology employs some Graph Neural Network (GNN) to capture relations between spatial locations, while another network, such as a recurrent neural network (RNN), learns temporal correlations. By representing every recorded sample as its own node in a graph, rather than all measurements for a particular location as a single node, temporal and spatial information is encoded in a similar manner. In this setting, GNNs can now directly learn both temporal and spatial dependencies, jointly, while also alleviating the need for additional temporal networks. Furthermore, the framework does not require aligned measurements along the temporal dimension, meaning that it also naturally facilitates irregular time series, different sampling frequencies or missing data, without the need for data imputation. To evaluate the proposed methodology, we consider wind speed forecasting as a case study, where our proposed framework outperformed other spatio-temporal models using GNNs with either Transformer or LSTM networks as temporal update functions.


[179] 2303.13182

CMG-Net: An End-to-End Contact-Based Multi-Finger Dexterous Grasping Network

In this paper, we propose a novel representation for grasping using contacts between multi-finger robotic hands and objects to be manipulated. This representation significantly reduces the prediction dimensions and accelerates the learning process. We present an effective end-to-end network, CMG-Net, for grasping unknown objects in a cluttered environment by efficiently predicting multi-finger grasp poses and hand configurations from a single-shot point cloud. Moreover, we create a synthetic grasp dataset that consists of five thousand cluttered scenes, 80 object categories, and 20 million annotations. We perform a comprehensive empirical study and demonstrate the effectiveness of our grasping representation and CMG-Net. Our work significantly outperforms the state-of-the-art for three-finger robotic hands. We also demonstrate that the model trained using synthetic data performs very well for real robots.


[180] 2303.13186

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearances.To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.


[181] 2303.13188

A Multiple Linear Regression Analysis to Measure the Journal Contribution to the Social Attention of Research

This paper proposes a three-year average of social attention as a more reliable measure of social impact for journals, since the social attention of research can vary widely among scientific articles, even within the same journal. The proposed measure is used to evaluate a journal's contribution to social attention in comparison to other bibliometric indicators. The study uses Dimensions as a data source and examines research articles from 76 disciplinary library and information science journals through multiple linear regression analysis. The study identifies socially influential journals whose contribution to social attention is twice that of scholarly impact as measured by citations. In addition, the study finds that the number of authors and open access have a moderate impact on social attention, while the journal impact factor has a negative impact and funding has a small impact.


[182] 2303.13190

Marching-Primitives: Shape Abstraction from Signed Distance Function

Representing complex objects with basic geometric primitives has long been a topic in computer vision. Primitive-based representations have the merits of compactness and computational efficiency in higher-level tasks such as physics simulation, collision checking, and robotic manipulation. Unlike previous works which extract polygonal meshes from a signed distance function (SDF), in this paper, we present a novel method, named Marching-Primitives, to obtain a primitive-based abstraction directly from an SDF. Our method grows geometric primitives (such as superquadrics) iteratively by analyzing the connectivity of voxels while marching at different levels of signed distance. For each valid connected volume of interest, we march on the scope of voxels from which a primitive is able to be extracted in a probabilistic sense and simultaneously solve for the parameters of the primitive to capture the underlying local geometry. We evaluate the performance of our method on both synthetic and real-world datasets. The results show that the proposed method outperforms the state-of-the-art in terms of accuracy, and is directly generalizable among different categories and scales. The code is open-sourced at https://github.com/ChirikjianLab/Marching-Primitives.git.


[183] 2303.13191

Extended High Utility Pattern Mining: An Answer Set Programming Based Framework and Applications

Detecting sets of relevant patterns from a given dataset is an important challenge in data mining. The relevance of a pattern, also called utility in the literature, is a subjective measure and can be actually assessed from very different points of view. Rule-based languages like Answer Set Programming (ASP) seem well suited for specifying user-provided criteria to assess pattern utility in a form of constraints; moreover, declarativity of ASP allows for a very easy switch between several criteria in order to analyze the dataset from different points of view. In this paper, we make steps toward extending the notion of High Utility Pattern Mining (HUPM); in particular we introduce a new framework that allows for new classes of utility criteria not considered in the previous literature. We also show how recent extensions of ASP with external functions can support a fast and effective encoding and testing of the new framework. To demonstrate the potential of the proposed framework, we exploit it as a building block for the definition of an innovative method for predicting ICU admission for COVID-19 patients. Finally, an extensive experimental activity demonstrates both from a quantitative and a qualitative point of view the effectiveness of the proposed approach. Under consideration in Theory and Practice of Logic Programming (TPLP)


[184] 2303.13192

Mechanism Design for Ad Auctions with Display Prices

In many applications, ads are displayed together with the prices, so as to provide a direct comparison among similar products or services. The price-displaying feature not only influences the consumers' decisions, but also affects the advertisers' bidding behaviors. In this paper, we study ad auctions with display prices from the perspective of mechanism design, in which advertisers are asked to submit both the costs and prices of their products. We provide a characterization for all incentive compatible auctions with display prices, and use it to design auctions under two scenarios. In the former scenario, the display prices are assumed to be exogenously determined. For this setting, we derive the welfare-maximizing and revenue-maximizing auctions for any realization of the price profile. In the latter, advertisers are allowed to strategize display prices in their own interests. We investigate two families of allocation policies within the scenario and identify the equilibrium prices accordingly. Our results reveal that the display prices do affect the design of ad auctions and the platform can leverage such information to optimize the performance of ad delivery.


[185] 2303.13193

VADER: Video Alignment Differencing and Retrieval

We propose VADER, a spatio-temporal matching, alignment, and change summarization method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over adaptively chunked video content. A transformer-based alignment module then refines the temporal localization of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered.


[186] 2303.13194

Complementary Pseudo Multimodal Feature for Point Cloud Anomaly Detection

Point cloud (PCD) anomaly detection steadily emerges as a promising research area. This study aims to improve PCD anomaly detection performance by combining handcrafted PCD descriptions with powerful pre-trained 2D neural networks. To this end, this study proposes Complementary Pseudo Multimodal Feature (CPMF) that incorporates local geometrical information in 3D modality using handcrafted PCD descriptors and global semantic information in the generated pseudo 2D modality using pre-trained 2D neural networks. For global semantics extraction, CPMF projects the origin PCD into a pseudo 2D modality containing multi-view images. These images are delivered to pre-trained 2D neural networks for informative 2D modality feature extraction. The 3D and 2D modality features are aggregated to obtain the CPMF for PCD anomaly detection. Extensive experiments demonstrate the complementary capacity between 2D and 3D modality features and the effectiveness of CPMF, with 95.15% image-level AU-ROC and 92.93% pixel-level PRO on the MVTec3D benchmark. Code is available on https://github.com/caoyunkang/CPMF.


[187] 2303.13199

First Session Adaptation: A Strong Replay-Free Baseline for Class-Incremental Learning

In Class-Incremental Learning (CIL) an image classification system is exposed to new classes in each learning session and must be updated incrementally. Methods approaching this problem have updated both the classification head and the feature extractor body at each session of CIL. In this work, we develop a baseline method, First Session Adaptation (FSA), that sheds light on the efficacy of existing CIL approaches and allows us to assess the relative performance contributions from head and body adaption. FSA adapts a pre-trained neural network body only on the first learning session and fixes it thereafter; a head based on linear discriminant analysis (LDA), is then placed on top of the adapted body, allowing exact updates through CIL. FSA is replay-free i.e.~it does not memorize examples from previous sessions of continual learning. To empirically motivate FSA, we first consider a diverse selection of 22 image-classification datasets, evaluating different heads and body adaptation techniques in high/low-shot offline settings. We find that the LDA head performs well and supports CIL out-of-the-box. We also find that Featurewise Layer Modulation (FiLM) adapters are highly effective in the few-shot setting, and full-body adaption in the high-shot setting. Second, we empirically investigate various CIL settings including high-shot CIL and few-shot CIL, including settings that have previously been used in the literature. We show that FSA significantly improves over the state-of-the-art in 15 of the 16 settings considered. FSA with FiLM adapters is especially performant in the few-shot setting. These results indicate that current approaches to continuous body adaptation are not working as expected. Finally, we propose a measure that can be applied to a set of unlabelled inputs which is predictive of the benefits of body adaptation.


[188] 2303.13204

A Privacy-Preserving Energy Theft Detection Model for Effective Demand-Response Management in Smart Grids

The detection of energy thefts is vital for the safety of the whole smart grid system. However, the detection alone is not enough since energy thefts can crucially affect the electricity supply leading to some blackouts. Moreover, privacy is one of the major challenges that must be preserved when dealing with clients' energy data. This is often overlooked in energy theft detection research as most current detection techniques rely on raw, unencrypted data, which may potentially expose sensitive and personal data. To solve this issue, we present a privacy-preserving energy theft detection technique with effective demand management that employs two layers of privacy protection. We explore a split learning mechanism that trains a detection model in a decentralised fashion without the need to exchange raw data. We also employ a second layer of privacy by the use of a masking scheme to mask clients' outputs in order to prevent inference attacks. A privacy-enhanced version of this mechanism also employs an additional layer of privacy protection by training a randomisation layer at the end of the client-side model. This is done to make the output as random as possible without compromising the detection performance. For the energy theft detection part, we design a multi-output machine learning model to identify energy thefts, estimate their volume, and effectively predict future demand. Finally, we use a comprehensive set of experiments to test our proposed scheme. The experimental results show that our scheme achieves high detection accuracy and greatly improves the privacy preservation degree.


[189] 2303.13209

Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph Generation with Decoupled Label Learning

Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.


[190] 2303.13211

Don't FREAK Out: A Frequency-Inspired Approach to Detecting Backdoor Poisoned Samples in DNNs

In this paper we investigate the frequency sensitivity of Deep Neural Networks (DNNs) when presented with clean samples versus poisoned samples. Our analysis shows significant disparities in frequency sensitivity between these two types of samples. Building on these findings, we propose FREAK, a frequency-based poisoned sample detection algorithm that is simple yet effective. Our experimental results demonstrate the efficacy of FREAK not only against frequency backdoor attacks but also against some spatial attacks. Our work is just the first step in leveraging these insights. We believe that our analysis and proposed defense mechanism will provide a foundation for future research and development of backdoor defenses.


[191] 2303.13212

A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation

Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.


[192] 2303.13213

Stochastic Graph Neural Network-based Value Decomposition for MARL in Internet of Vehicles

Autonomous driving has witnessed incredible advances in the past several decades, while Multi-Agent Reinforcement Learning (MARL) promises to satisfy the essential need of autonomous vehicle control in a wireless connected vehicle networks. In MARL, how to effectively decompose a global feedback into the relative contributions of individual agents belongs to one of the most fundamental problems. However, the environment volatility due to vehicle movement and wireless disturbance could significantly shape time-varying topological relationships among agents, thus making the Value Decomposition (VD) challenging. Therefore, in order to cope with this annoying volatility, it becomes imperative to design a dynamic VD framework. Hence, in this paper, we propose a novel Stochastic VMIX (SVMIX) methodology by taking account of dynamic topological features during the VD and incorporating the corresponding components into a multi-agent actor-critic architecture. In particular, Stochastic Graph Neural Network (SGNN) is leveraged to effectively capture underlying dynamics in topological features and improve the flexibility of VD against the environment volatility. Finally, the superiority of SVMIX is verified through extensive simulations.


[193] 2303.13216

A Case Study on AI Engineering Practices: Developing an Autonomous Stock Trading System

Today, many systems use artificial intelligence (AI) to solve complex problems. While this often increases system effectiveness, developing a production-ready AI-based system is a difficult task. Thus, solid AI engineering practices are required to ensure the quality of the resulting system and to improve the development process. While several practices have already been proposed for the development of AI-based systems, detailed practical experiences of applying these practices are rare. In this paper, we aim to address this gap by collecting such experiences during a case study, namely the development of an autonomous stock trading system that uses machine learning functionality to invest in stocks. We selected 10 AI engineering practices from the literature and systematically applied them during development, with the goal to collect evidence about their applicability and effectiveness. Using structured field notes, we documented our experiences. Furthermore, we also used field notes to document challenges that occurred during the development, and the solutions we applied to overcome them. Afterwards, we analyzed the collected field notes, and evaluated how each practice improved the development. Lastly, we compared our evidence with existing literature. Most applied practices improved our system, albeit to varying extent, and we were able to overcome all major challenges. The qualitative results provide detailed accounts about 10 AI engineering practices, as well as challenges and solutions associated with such a project. Our experiences therefore enrich the emerging body of evidence in this field, which may be especially helpful for practitioner teams new to AI engineering.


[194] 2303.13217

Fairness-guided Few-shot Prompting for Large Language Models

Large language models have demonstrated surprising ability to perform in-context learning, i.e., these models can be directly applied to solve numerous downstream tasks by conditioning on a prompt constructed by a few input-output examples. However, prior research has shown that in-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. Therefore, the construction of an appropriate prompt is essential for improving the performance of in-context learning. In this paper, we revisit this problem from the view of predictive bias. Specifically, we introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. Then we empirically show that prompts with higher bias always lead to unsatisfactory predictive quality. Based on this observation, we propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning. We perform comprehensive experiments with state-of-the-art mainstream models such as GPT-3 on various downstream tasks. Our results indicate that our method can enhance the model's in-context learning performance in an effective and interpretable manner.


[195] 2303.13220

Parameter-Efficient Sparse Retrievers and Rerankers using Adapters

Parameter-Efficient transfer learning with Adapters have been studied in Natural Language Processing (NLP) as an alternative to full fine-tuning. Adapters are memory-efficient and scale well with downstream tasks by training small bottle-neck layers added between transformer layers while keeping the large pretrained language model (PLMs) frozen. In spite of showing promising results in NLP, these methods are under-explored in Information Retrieval. While previous studies have only experimented with dense retriever or in a cross lingual retrieval scenario, in this paper we aim to complete the picture on the use of adapters in IR. First, we study adapters for SPLADE, a sparse retriever, for which adapters not only retain the efficiency and effectiveness otherwise achieved by finetuning, but are memory-efficient and orders of magnitude lighter to train. We observe that Adapters-SPLADE not only optimizes just 2\% of training parameters, but outperforms fully fine-tuned counterpart and existing parameter-efficient dense IR models on IR benchmark datasets. Secondly, we address domain adaptation of neural retrieval thanks to adapters on cross-domain BEIR datasets and TripClick. Finally, we also consider knowledge sharing between rerankers and first stage rankers. Overall, our study complete the examination of adapters for neural IR


[196] 2303.13221

Explore the Power of Synthetic Data on Few-shot Object Detection

Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. The few training samples restrict the performance of FSOD model. Recent text-to-image generation models have shown promising results in generating high-quality images. How applicable these synthetic images are for FSOD tasks remains under-explored. This work extensively studies how synthetic images generated from state-of-the-art text-to-image generators benefit FSOD tasks. We focus on two perspectives: (1) How to use synthetic data for FSOD? (2) How to find representative samples from the large-scale synthetic dataset? We design a copy-paste-based pipeline for using synthetic data. Specifically, saliency object detection is applied to the original generated image, and the minimum enclosing box is used for cropping the main object based on the saliency map. After that, the cropped object is randomly pasted on the image, which comes from the base dataset. We also study the influence of the input text of text-to-image generator and the number of synthetic images used. To construct a representative synthetic training dataset, we maximize the diversity of the selected images via a sample-based and cluster-based method. However, the severe problem of high false positives (FP) ratio of novel categories in FSOD can not be solved by using synthetic data. We propose integrating CLIP, a zero-shot recognition model, into the FSOD pipeline, which can filter 90% of FP by defining a threshold for the similarity score between the detected object and the text of the predicted category. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method, in which performance gain is up to 21.9% compared to the few-shot baseline.


[197] 2303.13223

Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels

Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, \ie, CLIP~\cite{radford2021clip}, to compensate for insufficient annotations. In spite of promising performance, they generally overlook the valuable prior about the label-to-label correspondence. In this paper, we advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior about the label-to-label correspondence via a semantic prior prompter. We then present a novel Semantic Correspondence Prompt Network (SCPNet), which can thoroughly explore the structured semantic prior. A Prior-Enhanced Self-Supervised Learning method is further introduced to enhance the use of the prior. Comprehensive experiments and analyses on several widely used benchmark datasets show that our method significantly outperforms existing methods on all datasets, well demonstrating the effectiveness and the superiority of our method. Our code will be available at https://github.com/jameslahm/SCPNet.


[198] 2303.13226

LearnedFTL: A Learning-based Page-level FTL for Improving Random Reads in Flash-based SSDs

We present LearnedFTL, which applies learned indexes to on-demand page-level flash translation layer (FTL) designs to improve the random read performance of flash-based solid-state drives (SSDs). The first of its kind, it minimizes the number of double reads induced by address translation in random read accesses. To apply the learned indexes to address translation, LearnedFTL proposes dynamic piece-wise regression to efficiently build learned indexes. LearnedFTL also exploits the unique feature of page relocation in SSD internal garbage collection (GC), and embeds the learned index training in GC, which can minimize additional delay on normal read and write operations. Additionally, LearnedFTL employs a bitmap prediction filter to guarantee the accuracy of learned indexes' predictions. With these designs, LearnedFTL considerably speeds up address translation while reducing the number of flash read accesses caused by the demand-based page-level FTL. Our benchmark-driven experiments on a FEMU-based prototype show that LearnedFTL reduces the 99th percentile tail latency by 4.8$\times$, on average, compared to the state-of-the-art TPFTL scheme.


[199] 2303.13227

Confidence-Aware and Self-Supervised Image Anomaly Localisation

Universal anomaly detection still remains a challenging problem in machine learning and medical image analysis. It is possible to learn an expected distribution from a single class of normative samples, e.g., through epistemic uncertainty estimates, auto-encoding models, or from synthetic anomalies in a self-supervised way. The performance of self-supervised anomaly detection approaches is still inferior compared to methods that use examples from known unknown classes to shape the decision boundary. However, outlier exposure methods often do not identify unknown unknowns. Here we discuss an improved self-supervised single-class training strategy that supports the approximation of probabilistic inference with loosen feature locality constraints. We show that up-scaling of gradients with histogram-equalised images is beneficial for recently proposed self-supervision tasks. Our method is integrated into several out-of-distribution (OOD) detection models and we show evidence that our method outperforms the state-of-the-art on various benchmark datasets. Source code will be publicly available by the time of the conference.


[200] 2303.13228

Enriching Neural Network Training Dataset to Improve Worst-Case Performance Guarantees

Machine learning algorithms, especially Neural Networks (NNs), are a valuable tool used to approximate non-linear relationships, like the AC-Optimal Power Flow (AC-OPF), with considerable accuracy -- and achieving a speedup of several orders of magnitude when deployed for use. Often in power systems literature, the NNs are trained with a fixed dataset generated prior to the training process. In this paper, we show that adapting the NN training dataset during training can improve the NN performance and substantially reduce its worst-case violations. This paper proposes an algorithm that identifies and enriches the training dataset with critical datapoints that reduce the worst-case violations and deliver a neural network with improved worst-case performance guarantees. We demonstrate the performance of our algorithm in four test power systems, ranging from 39-buses to 162-buses.


[201] 2303.13231

Trading Communication for Computation in Byzantine-Resilient Gradient Coding

We consider gradient coding in the presence of an adversary, controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the inputs of the malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we reduce replication by proposing a method that detects the erroneous inputs from the malicious workers, hence transforming them into erasures. For $s$ malicious workers, our solution can reduce the replication to $s+1$ instead of $2s+1$ for each partial gradient at the expense of only $s$ additional computations at the main node and additional rounds of light communication between the main node and the workers. We give fundamental limits of the general framework for fractional repetition data allocation. Our scheme is optimal in terms of replication and local computation but incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound.


[202] 2303.13232

Transforming Radiance Field with Lipschitz Network for Photorealistic 3D Scene Stylization

Recent advances in 3D scene representation and novel view synthesis have witnessed the rise of Neural Radiance Fields (NeRFs). Nevertheless, it is not trivial to exploit NeRF for the photorealistic 3D scene stylization task, which aims to generate visually consistent and photorealistic stylized scenes from novel views. Simply coupling NeRF with photorealistic style transfer (PST) will result in cross-view inconsistency and degradation of stylized view syntheses. Through a thorough analysis, we demonstrate that this non-trivial task can be simplified in a new light: When transforming the appearance representation of a pre-trained NeRF with Lipschitz mapping, the consistency and photorealism across source views will be seamlessly encoded into the syntheses. That motivates us to build a concise and flexible learning framework namely LipRF, which upgrades arbitrary 2D PST methods with Lipschitz mapping tailored for the 3D scene. Technically, LipRF first pre-trains a radiance field to reconstruct the 3D scene, and then emulates the style on each view by 2D PST as the prior to learn a Lipschitz network to stylize the pre-trained appearance. In view of that Lipschitz condition highly impacts the expressivity of the neural network, we devise an adaptive regularization to balance the reconstruction and stylization. A gradual gradient aggregation strategy is further introduced to optimize LipRF in a cost-efficient manner. We conduct extensive experiments to show the high quality and robust performance of LipRF on both photorealistic 3D stylization and object appearance editing.


[203] 2303.13233

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.


[204] 2303.13241

6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics

We present a novel technique to estimate the 6D pose of objects from single images where the 3D geometry of the object is only given approximately and not as a precise 3D model. To achieve this, we employ a dense 2D-to-3D correspondence predictor that regresses 3D model coordinates for every pixel. In addition to the 3D coordinates, our model also estimates the pixel-wise coordinate error to discard correspondences that are likely wrong. This allows us to generate multiple 6D pose hypotheses of the object, which we then refine iteratively using a highly efficient region-based approach. We also introduce a novel pixel-wise posterior formulation by which we can estimate the probability for each hypothesis and select the most likely one. As we show in experiments, our approach is capable of dealing with extreme visual conditions including overexposure, high contrast, or low signal-to-noise ratio. This makes it a powerful technique for the particularly challenging task of estimating the pose of tumbling satellites for in-orbit robotic applications. Our method achieves state-of-the-art performance on the SPEED+ dataset and has won the SPEC2021 post-mortem competition.


[205] 2303.13245

CrOC: Cross-View Online Clustering for Dense Visual Representation Learning

Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at https://github.com/stegmuel/CrOC.


[206] 2303.13246

Continuification control of large-scale multiagent systems under limited sensing and structural perturbations

We investigate the stability and robustness properties of a continuification-based strategy for the control of large-scale multiagent systems. Within continuation-based strategy, one transforms the microscopic, agent-level description of the system dynamics into a macroscopic continuum-level, for which a control action can be synthesized to steer the macroscopic dynamics towards a desired distribution. Such an action is ultimately discretized to obtain a set of deployable control inputs for the agents to achieve the goal. The mathematical proof of convergence toward the desired distribution typically relies on the assumptions that no disturbance is present and that each agent possesses global knowledge of all the others' positions. Here, we analytically and numerically address the possibility of relaxing these assumptions for the case of a one-dimensional system of agents moving in a ring. We offer compelling evidence in favor of the use of a continuification-based strategy when agents only possess a finite sensing capability and spatio-temporal perturbations affect the macroscopic dynamics of the ensemble. We also discuss some preliminary results about the role of an integral action in the macroscopic control solution.


[207] 2303.13247

Optimizing Duplicate Size Thresholds in IDEs

In this paper, we present an approach for transferring an optimal lower size threshold for clone detection from one language to another by analyzing their clone distributions. We showcase this method by transferring the threshold from regular Python scripts to Jupyter notebooks for using in two JetBrains IDEs, Datalore and DataSpell.


[208] 2303.13251

A Bag-of-Prototypes Representation for Dataset-Level Applications

This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central to the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram. Without assuming access to dataset labels, the BoP representation provides a rich characterization of the dataset semantic distribution. Furthermore, BoP representations cooperate well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.


[209] 2303.13252

Nominal Sets in Agda -- A Fresh and Immature Mechanization

In this paper we present our current development on a new formalization of nominal sets in Agda. Our first motivation in having another formalization was to understand better nominal sets and to have a playground for testing type systems based on nominal logic. Not surprisingly, we have independently built up the same hierarchy of types leading to nominal sets. We diverge from other formalizations in how to conceive finite permutations: in our formalization a finite permutation is a permutation (i.e. a bijection) whose domain is finite. Finite permutations have different representations, for instance as compositions of transpositions (the predominant in other formalizations) or compositions of disjoint cycles. We prove that these representations are equivalent and use them to normalize (up to composition order of independent transpositions) compositions of transpositions.


[210] 2303.13254

Paraconsistent Transition Systems

Often in Software Engineering, a modeling formalism has to support scenarios of inconsistency in which several requirements either reinforce or contradict each other. Paraconsistent transition systems are proposed in this paper as one such formalism: states evolve through two accessibility relations capturing weighted evidence of a transition or its absence, respectively. Their weights come from a specific residuated lattice. A category of these systems, and the corresponding algebra, is defined as providing a formal setting to model different application scenarios. One of them, dealing with the effect of quantum decoherence in quantum programs, is used for illustration purposes.


[211] 2303.13255

ReLo: a Dynamic Logic to Reason About Reo Circuits

Critical systems require high reliability and are present in many domains. They are systems in which failure may result in financial damage or even loss of lives. Standard techniques of software engineering are not enough to ensure the absence of unacceptable failures and/or that critical requirements are fulfilled. Reo is a component-based modelling language that aims to provide a framework to build software based on existing pieces of software, which has been used in a wide variety of domains. Its formal semantics provides grounds to certify that systems based on Reo models satisfy specific requirements (i.e., absence of deadlocks). Current logical approaches for reasoning over Reo require the conversion of formal semantics into a logical framework. ReLo is a dynamic logic that naturally subsumes Reo's semantics. It provides a means to reason over Reo circuits. This work extends ReLo by introducing the iteration operator, and soundness and completeness proofs for its axiomatization.


[212] 2303.13256

Analyzing Innermost Runtime Complexity Through Tuple Interpretations

Time complexity in rewriting is naturally understood as the number of steps needed to reduce terms to normal forms. Establishing complexity bounds to this measure is a well-known problem in the rewriting community. A vast majority of techniques to find such bounds consist of modifying termination proofs in order to recover complexity information. This has been done for instance with semantic interpretations, recursive path orders, and dependency pairs. In this paper, we follow the same program by tailoring tuple interpretations to deal with innermost complexity analysis. A tuple interpretation interprets terms as tuples holding upper bounds to the cost of reduction and size of normal forms. In contrast with the full rewriting setting, the strongly monotonic requirement for cost components is dropped when reductions are innermost. This weakened requirement on cost tuples allows us to prove the innermost version of the compatibility result: if all rules in a term rewriting system can be strictly oriented, then the innermost rewrite relation is well-founded. We establish the necessary conditions for which tuple interpretations guarantee polynomial bounds to the runtime of compatible systems and describe a search procedure for such interpretations.


[213] 2303.13257

Equational Theorem Proving for Clauses over Strings

Although reasoning about equations over strings has been extensively studied for several decades, little research has been done for equational reasoning on general clauses over strings. This paper introduces a new superposition calculus with strings and present an equational theorem proving framework for clauses over strings. It provides a saturation procedure for clauses over strings and show that the proposed superposition calculus with contraction rules is refutationally complete. This paper also presents a new decision procedure for word problems over strings w.r.t. a set of conditional equations R over strings if R can be finitely saturated under the proposed inference system.


[214] 2303.13258

A Formal Proof of the Strong Normalization Theorem for System T in Agda

We present a framework for the formal meta-theory of lambda calculi in first-order syntax, with two sorts of names, one to represent both free and bound variables, and the other for constants, and by using Stoughton's multiple substitutions. On top of the framework we formalize Girard's proof of the Strong Normalization Theorem for both the simply-typed lambda calculus and System T. As to the latter, we also present a simplification of the original proof. The whole development has been machine-checked using the Agda system.


[215] 2303.13262

Noise impact on recurrent neural network with linear activation function

In recent years, more and more researchers in the field of neural networks are interested in creating hardware implementations where neurons and the connection between them are realized physically. The physical implementation of ANN fundamentally changes the features of noise influence. In the case hardware ANNs, there are many internal sources of noise with different properties. The purpose of this paper is to study the peculiarities of internal noise propagation in recurrent ANN on the example of echo state network (ESN), to reveal ways to suppress such noises and to justify the stability of networks to some types of noises. In this paper we analyse ESN in presence of uncorrelated additive and multiplicative white Gaussian noise. Here we consider the case when artificial neurons have linear activation function with different slope coefficients. Starting from studying only one noisy neuron we complicate the problem by considering how the input signal and the memory property affect the accumulation of noise in ESN. In addition, we consider the influence of the main types of coupling matrices on the accumulation of noise. So, as such matrices, we take a uniform matrix and a diagonal-like matrices with different coefficients called "blurring" coefficient. We have found that the general view of variance and signal-to-noise ratio of ESN output signal is similar to only one neuron. The noise is less accumulated in ESN with diagonal reservoir connection matrix with large "blurring" coefficient. Especially it concerns uncorrelated multiplicative noise.


[216] 2303.13264

Modular CSI Quantization for FDD Massive MIMO Communication

We consider high-dimensional MIMO transmissions in frequency division duplexing (FDD) systems. For precoding, the frequency selective channel has to be measured, quantized and fed back to the base station by the users. When the number of antennas is very high this typically leads to prohibitively high quantization complexity and large feedback. In 5G New Radio (NR), a modular quantization approach has been applied for this, where first a low-dimensional subspace is identified for the whole frequency selective channel, and then subband channels are linearly mapped to this subspace and quantized. We analyze how the components in such a modular scheme contribute to the overall quantization distortion. Based on this analysis we improve the technology components in the modular approach and propose an orthonormalized wideband precoding scheme and a sequential wideband precoding approach which provide considerable gains over the conventional method. We compare the performance of the developed quantization schemes to prior art by simulations in terms of the projection distortion, overall distortion and spectral efficiency, in a scenario with a realistic spatial channel model.


[217] 2303.13269

Disguise without Disruption: Utility-Preserving Face De-Identification

With the increasing ubiquity of cameras and smart sensors, humanity is generating data at an exponential rate. Access to this trove of information, often covering yet-underrepresented use-cases (e.g., AI in medical settings) could fuel a new generation of deep-learning tools. However, eager data scientists should first provide satisfying guarantees w.r.t. the privacy of individuals present in these untapped datasets. This is especially important for images or videos depicting faces, as their biometric information is the target of most identification methods. While a variety of solutions have been proposed to de-identify such images, they often corrupt other non-identifying facial attributes that would be relevant for downstream tasks. In this paper, we propose Disguise, a novel algorithm to seamlessly de-identify facial images while ensuring the usability of the altered data. Unlike prior arts, we ground our solution in both differential privacy and ensemble-learning research domains. Our method extracts and swaps depicted identities with fake ones, synthesized via variational mechanisms to maximize obfuscation and non-invertibility; while leveraging the supervision from a mixture-of-experts to disentangle and preserve other utility attributes. We extensively evaluate our method on multiple datasets, demonstrating higher de-identification rate and superior consistency than prior art w.r.t. various downstream tasks.


[218] 2303.13272

Frame-Level Multi-Label Playing Technique Detection Using Multi-Scale Network and Self-Attention Mechanism

Instrument playing technique (IPT) is a key element of musical presentation. However, most of the existing works for IPT detection only concern monophonic music signals, yet little has been done to detect IPTs in polyphonic instrumental solo pieces with overlapping IPTs or mixed IPTs. In this paper, we formulate it as a frame-level multi-label classification problem and apply it to Guzheng, a Chinese plucked string instrument. We create a new dataset, Guzheng\_Tech99, containing Guzheng recordings and onset, offset, pitch, IPT annotations of each note. Because different IPTs vary a lot in their lengths, we propose a new method to solve this problem using multi-scale network and self-attention. The multi-scale network extracts features from different scales, and the self-attention mechanism applied to the feature maps at the coarsest scale further enhances the long-range feature extraction. Our approach outperforms existing works by a large margin, indicating its effectiveness in IPT detection.


[219] 2303.13273

TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision

In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.


[220] 2303.13277

SINE: Semantic-driven Image-based NeRF Editing with Prior-guided Editing Field

Despite the great success in 2D editing using user-friendly tools, such as Photoshop, semantic strokes, or even text prompts, similar capabilities in 3D areas are still limited, either relying on 3D modeling skills or allowing editing within only a few categories.In this paper, we present a novel semantic-driven NeRF editing approach, which enables users to edit a neural radiance field with a single image, and faithfully delivers edited novel views with high fidelity and multi-view consistency.To achieve this goal, we propose a prior-guided editing field to encode fine-grained geometric and texture editing in 3D space, and develop a series of techniques to aid the editing process, including cyclic constraints with a proxy mesh to facilitate geometric supervision, a color compositing mechanism to stabilize semantic-driven texture editing, and a feature-cluster-based regularization to preserve the irrelevant content unchanged.Extensive experiments and editing examples on both real-world and synthetic data demonstrate that our method achieves photo-realistic 3D editing using only a single edited image, pushing the bound of semantic-driven editing in 3D real-world scenes. Our project webpage: https://zju3dv.github.io/sine/.


[221] 2303.13279

Parameterized Algorithms for Topological Indices in Chemistry

We have developed efficient parameterized algorithms for the enumeration problems of graphs arising in chemistry. In particular, we have focused on the following problems: enumeration of Kekul\'e structures, computation of Hosoya index, computation of Merrifield-Simmons index, and computation of graph entropy based on matchings and independent sets. All these problems are known to be $\# P$-complete. We have developed FPT algorithms for bounded treewidth and bounded pathwidth for these problems with a better time complexity than the known state-of-the-art in the literature. We have also conducted experiments on the entire PubChem database of chemical compounds and tested our algorithms. We also provide a comparison with naive baseline algorithms for these problems, along with a distribution of treewidth for the chemical compounds available in the PubChem database.


[222] 2303.13283

Visual-Language Prompt Tuning with Knowledge-guided Context Optimization

Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge is the worse generalization to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. The key insight of KgCoOp is that forgetting about essential knowledge can be alleviated by reducing the discrepancy between the learnable prompt and the hand-crafted prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, \emph{i.e.,} achieves better performance with less training time.


[223] 2303.13284

GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering

In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.


[224] 2303.13290

Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration

Deep point cloud registration methods face challenges to partial overlaps and rely on labeled data. To address these issues, we propose UDPReg, an unsupervised deep probabilistic registration framework for point clouds with partial overlaps. Specifically, we first adopt a network to learn posterior probability distributions of Gaussian mixture models (GMMs) from point clouds. To handle partial point cloud registration, we apply the Sinkhorn algorithm to predict the distribution-level correspondences under the constraint of the mixing weights of GMMs. To enable unsupervised learning, we design three distribution consistency-based losses: self-consistency, cross-consistency, and local contrastive. The self-consistency loss is formulated by encouraging GMMs in Euclidean and feature spaces to share identical posterior distributions. The cross-consistency loss derives from the fact that the points of two partially overlapping point clouds belonging to the same clusters share the cluster centroids. The cross-consistency loss allows the network to flexibly learn a transformation-invariant posterior distribution of two aligned point clouds. The local contrastive loss facilitates the network to extract discriminative local features. Our UDPReg achieves competitive performance on the 3DMatch/3DLoMatch and ModelNet/ModelLoNet benchmarks.


[225] 2303.13293

LABRAD-OR: Lightweight Memory Scene Graphs for Accurate Bimodal Reasoning in Dynamic Operating Rooms

Modern surgeries are performed in complex and dynamic settings, including ever-changing interactions between medical staff, patients, and equipment. The holistic modeling of the operating room (OR) is, therefore, a challenging but essential task, with the potential to optimize the performance of surgical teams and aid in developing new surgical technologies to improve patient outcomes. The holistic representation of surgical scenes as semantic scene graphs (SGG), where entities are represented as nodes and relations between them as edges, is a promising direction for fine-grained semantic OR understanding. We propose, for the first time, the use of temporal information for more accurate and consistent holistic OR modeling. Specifically, we introduce memory scene graphs, where the scene graphs of previous time steps act as the temporal representation guiding the current prediction. We design an end-to-end architecture that intelligently fuses the temporal information of our lightweight memory scene graphs with the visual information from point clouds and images. We evaluate our method on the 4D-OR dataset and demonstrate that integrating temporality leads to more accurate and consistent results achieving an +5% increase and a new SOTA of 0.88 in macro F1. This work opens the path for representing the entire surgery history with memory scene graphs and improves the holistic understanding in the OR. Introducing scene graphs as memory representations can offer a valuable tool for many temporal understanding tasks.


[226] 2303.13294

Considerations on the Evaluation of Biometric Quality Assessment Algorithms

Quality assessment algorithms can be used to estimate the utility of a biometric sample for the purpose of biometric recognition. "Error versus Discard Characteristic" (EDC) plots, and "partial Area Under Curve" (pAUC) values of curves therein, are generally used by researchers to evaluate the predictive performance of such quality assessment algorithms. An EDC curve depends on an error type such as the "False Non Match Rate" (FNMR), a quality assessment algorithm, a biometric recognition system, a set of comparisons each corresponding to a biometric sample pair, and a comparison score threshold corresponding to a starting error. To compute an EDC curve, comparisons are progressively discarded based on the associated samples' lowest quality scores, and the error is computed for the remaining comparisons. Additionally, a discard fraction limit or range must be selected to compute pAUC values, which can then be used to quantitatively rank quality assessment algorithms. This paper discusses and analyses various details for this kind of quality assessment algorithm evaluation, including general EDC properties, interpretability improvements for pAUC values based on a hard lower error limit and a soft upper error limit, the use of relative instead of discrete rankings, stepwise vs. linear curve interpolation, and normalisation of quality scores to a [0, 100] integer range. We also analyse the stability of quantitative quality assessment algorithm rankings based on pAUC values across varying pAUC discard fraction limits and starting errors, concluding that higher pAUC discard fraction limits should be preferred. The analyses are conducted both with synthetic data and with real data for a face image quality assessment scenario, with a focus on general modality-independent conclusions for EDC evaluations.


[227] 2303.13297

Improving Generalization with Domain Convex Game

Domain generalization (DG) tends to alleviate the poor generalization capability of deep neural networks by learning model with multiple source domains. A classical solution to DG is domain augmentation, the common belief of which is that diversifying source domains will be conducive to the out-of-distribution generalization. However, these claims are understood intuitively, rather than mathematically. Our explorations empirically reveal that the correlation between model generalization and the diversity of domains may be not strictly positive, which limits the effectiveness of domain augmentation. This work therefore aim to guarantee and further enhance the validity of this strand. To this end, we propose a new perspective on DG that recasts it as a convex game between domains. We first encourage each diversified domain to enhance model generalization by elaborately designing a regularization term based on supermodularity. Meanwhile, a sample filter is constructed to eliminate low-quality samples, thereby avoiding the impact of potentially harmful information. Our framework presents a new avenue for the formal analysis of DG, heuristic analysis and extensive experiments demonstrate the rationality and effectiveness.


[228] 2303.13299

Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.


[229] 2303.13300

Innovation Slowdown: Decelerating Concept Creation and Declining Originality in New Technological Concepts

The creation of new technological concepts through design reuses, recombination, and synthesis of prior concepts to create new ones may lead to exponential growth of the concept space over time. However, our statistical analysis of a large-scale technology semantic network consisting of over four million concepts from patent texts found evidence of a persistent deceleration in the pace of concept creation and a decline in the originality of newly created concepts. These trends may be attributed to the limitations of human intelligence in innovating beyond an expanding space of prior art. To sustain innovation, we recommend the development and implementation of creative artificial intelligence that can augment various aspects of the innovation process, including learning, creation, and evaluation.


[230] 2303.13307

Don't Peek at My Chart: Privacy-preserving Visualization for Mobile Devices

Data visualizations have been widely used on mobile devices like smartphones for various tasks (e.g., visualizing personal health and financial data), making it convenient for people to view such data anytime and anywhere. However, others nearby can also easily peek at the visualizations, resulting in personal data disclosure. In this paper, we propose a perception-driven approach to transform mobile data visualizations into privacy-preserving ones. Specifically, based on human visual perception, we develop a masking scheme to adjust the spatial frequency and luminance contrast of colored visualizations. The resulting visualization retains its original information in close proximity but reduces the visibility when viewed from a certain distance or further away. We conducted two user studies to inform the design of our approach (N=16) and systematically evaluate its performance (N=18), respectively. The results demonstrate the effectiveness of our approach in terms of privacy preservation for mobile data visualizations.


[231] 2303.13309

New Results on Single User Massive MIMO

Achieving high bit rates is the main goal of wireless technologies like 5G and beyond. This translates to obtaining high spectral efficiencies using large number of antennas at the transmitter and receiver (single user massive multiple input multiple output or SU-MMIMO). It is possible to have a large number of antennas in the mobile handset at mm-wave frequencies in the range $30 - 300$ GHz due to the small antenna size. In this work, we investigate the bit-error-rate (BER) performance of SU-MMIMO in two scenarios (a) using serially concatenated turbo code (SCTC) in uncorrelated channel and (b) parallel concatenated turbo code (PCTC) in correlated channel. Computer simulation results indicate that the BER is quite insensitive to re-transmissions and wide variations in the number of transmit and receive antennas. Moreover, we have obtained a BER of $10^{-5}$ at an average signal-to-interference plus noise ratio (SINR) per bit of just 1.25 dB with 512 transmit and receive antennas ($512\times 512$ SU-MMIMO system) with a spectral efficiency of 256 bits/transmission or 256 bits/sec/Hz in an uncorrelated channel. Similar BER results have been obtained for SU-MMIMO using PCTC in correlated channel. A semi-analytic approach to estimating the BER of a turbo code has been derived.


[232] 2303.13310

SwissBERT: The Multilingual Language Model for Switzerland

We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at https://github.com/ZurichNLP/swissbert.


[233] 2303.13314

Leveraging Foundation Models for Clinical Text Analysis

Infectious diseases are a significant public health concern globally, and extracting relevant information from scientific literature can facilitate the development of effective prevention and treatment strategies. However, the large amount of clinical data available presents a challenge for information extraction. To address this challenge, this study proposes a natural language processing (NLP) framework that uses a pre-trained transformer model fine-tuned on task-specific data to extract key information related to infectious diseases from free-text clinical data. The proposed framework includes three components: a data layer for preparing datasets from clinical texts, a foundation model layer for entity extraction, and an assessment layer for performance analysis. The results of the evaluation indicate that the proposed method outperforms standard methods, and leveraging prior knowledge through the pre-trained transformer model makes it useful for investigating other infectious diseases in the future.


[234] 2303.13316

RGB-D-Inertial SLAM in Indoor Dynamic Environments with Long-term Large Occlusion

This work presents a novel RGB-D-inertial dynamic SLAM method that can enable accurate localisation when the majority of the camera view is occluded by multiple dynamic objects over a long period of time. Most dynamic SLAM approaches either remove dynamic objects as outliers when they account for a minor proportion of the visual input, or detect dynamic objects using semantic segmentation before camera tracking. Therefore, dynamic objects that cause large occlusions are difficult to detect without prior information. The remaining visual information from the static background is also not enough to support localisation when large occlusion lasts for a long period. To overcome these problems, our framework presents a robust visual-inertial bundle adjustment that simultaneously tracks camera, estimates cluster-wise dense segmentation of dynamic objects and maintains a static sparse map by combining dense and sparse features. The experiment results demonstrate that our method achieves promising localisation and object segmentation performance compared to other state-of-the-art methods in the scenario of long-term large occlusion.


[235] 2303.13318

Implicit Active Flux methods for linear advection

In this work we develop implicit Active Flux schemes for the scalar advection equation. At every cell interface we approximate the solution by a polynomial in time. This allows to evolve the point values using characteristics and to update the cell averages using fluxes obtained by integrating this polynomial. The resulting schemes have order of convergence up to five, but show almost no oscillations with high frequencies for discontinuous solutions. In numerical experiments we compare the different methods and show an application to network flows.


[236] 2303.13320

QDP: Learning to Sequentially Optimise Quasi-Static and Dynamic Manipulation Primitives for Robotic Cloth Manipulation

Pre-defined manipulation primitives are widely used for cloth manipulation. However, cloth properties such as its stiffness or density can highly impact the performance of these primitives. Although existing solutions have tackled the parameterisation of pick and place locations, the effect of factors such as the velocity or trajectory of quasi-static and dynamic manipulation primitives has been neglected. Choosing appropriate values for these parameters is crucial to cope with the range of materials present in house-hold cloth objects. To address this challenge, we introduce the Quasi-Dynamic Parameterisable (QDP) method, which optimises parameters such as the motion velocity in addition to the pick and place positions of quasi-static and dynamic manipulation primitives. In this work, we leverage the framework of Sequential Reinforcement Learning to decouple sequentially the parameters that compose the primitives. To evaluate the effectiveness of the method we focus on the task of cloth unfolding with a robotic arm in simulation and real-world experiments. Our results in simulation show that by deciding the optimal parameters for the primitives the performance can improve by 20% compared to sub-optimal ones. Real-world results demonstrate the advantage of modifying the velocity and height of manipulation primitives for cloths with different mass, stiffness, shape and size. Supplementary material, videos, and code, can be found at https://sites.google.com/view/qdp-srl.


[237] 2303.13325

DARE-GRAM : Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices

Unsupervised Domain Adaptation Regression (DAR) aims to bridge the domain gap between a labeled source dataset and an unlabelled target dataset for regression problems. Recent works mostly focus on learning a deep feature encoder by minimizing the discrepancy between source and target features. In this work, we present a different perspective for the DAR problem by analyzing the closed-form ordinary least square~(OLS) solution to the linear regressor in the deep domain adaptation context. Rather than aligning the original feature embedding space, we propose to align the inverse Gram matrix of the features, which is motivated by its presence in the OLS solution and the Gram matrix's ability to capture the feature correlations. Specifically, we propose a simple yet effective DAR method which leverages the pseudo-inverse low-rank property to align the scale and angle in a selected subspace generated by the pseudo-inverse Gram matrix of the two domains. We evaluate our method on three domain adaptation regression benchmarks. Experimental results demonstrate that our method achieves state-of-the-art performance. Our code is available at https://github.com/ismailnejjar/DARE-GRAM.


[238] 2303.13326

Decentralized Adversarial Training over Graphs

The vulnerability of machine learning models to adversarial attacks has been attracting considerable attention in recent years. Most existing studies focus on the behavior of stand-alone single-agent learners. In comparison, this work studies adversarial training over graphs, where individual agents are subjected to perturbations of varied strength levels across space. It is expected that interactions by linked agents, and the heterogeneity of the attack models that are possible over the graph, can help enhance robustness in view of the coordination power of the group. Using a min-max formulation of diffusion learning, we develop a decentralized adversarial training framework for multi-agent systems. We analyze the convergence properties of the proposed scheme for both convex and non-convex environments, and illustrate the enhanced robustness to adversarial attacks.


[239] 2303.13336

Audio Diffusion Model for Speech Synthesis: A Survey on Text To Speech and Speech Enhancement in Generative AI

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.


[240] 2303.13340

Increasing Textual Context Size Boosts Medical Image-Text Matching

This short technical report demonstrates a simple technique that yields state of the art results in medical image-text matching tasks. We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance in the medical domain where encoding longer textual contexts is often required. We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions. ClipMD was tested on two medical image-text datasets and compared with other image-text matching models. The results show that ClipMD outperforms other models on both datasets by a large margin. We make our code and pretrained model publicly available.


[241] 2303.13344

Stochastic Decision Petri Nets

We introduce stochastic decision Petri nets (SDPNs), which are a form of stochastic Petri nets equipped with rewards and a control mechanism via the deactivation of controllable transitions. Such nets can be translated into Markov decision processes (MDPs), potentially leading to a combinatorial explosion in the number of states due to concurrency. Hence we restrict ourselves to instances where nets are either safe, free-choice and acyclic nets (SAFC nets) or even occurrence nets and policies are defined by a constant deactivation pattern. We obtain complexity-theoretic results for such cases via a close connection to Bayesian networks, in particular we show that for SAFC nets the question whether there is a policy guaranteeing a reward above a certain threshold is $\mathsf{NP}^\mathsf{PP}$-complete. We also introduce a partial-order procedure which uses an SMT solver to address this problem.


[242] 2303.13351

DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. DBLP-QuAD is the largest scholarly question answering dataset.


[243] 2303.13352

Planning for Complex Non-prehensile Manipulation Among Movable Objects by Interleaving Multi-Agent Pathfinding and Physics-Based Simulation

Real-world manipulation problems in heavy clutter require robots to reason about potential contacts with objects in the environment. We focus on pick-and-place style tasks to retrieve a target object from a shelf where some `movable' objects must be rearranged in order to solve the task. In particular, our motivation is to allow the robot to reason over and consider non-prehensile rearrangement actions that lead to complex robot-object and object-object interactions where multiple objects might be moved by the robot simultaneously, and objects might tilt, lean on each other, or topple. To support this, we query a physics-based simulator to forward simulate these interaction dynamics which makes action evaluation during planning computationally very expensive. To make the planner tractable, we establish a connection between the domain of Manipulation Among Movable Objects and Multi-Agent Pathfinding that lets us decompose the problem into two phases our M4M algorithm iterates over. First we solve a multi-agent planning problem that reasons about the configurations of movable objects but does not forward simulate a physics model. Next, an arm motion planning problem is solved that uses a physics-based simulator but does not search over possible configurations of movable objects. We run simulated and real-world experiments with the PR2 robot and compare against relevant baseline algorithms. Our results highlight that M4M generates complex 3D interactions, and solves at least twice as many problems as the baselines with competitive performance.


[244] 2303.13354

Trajectory-Prediction with Vision: A Survey

To plan a safe and efficient route, an autonomous vehicle should anticipate future trajectories of other agents around it. Trajectory prediction is an extremely challenging task which recently gained a lot of attention in the autonomous vehicle research community. Trajectory-prediction forecasts future state of all the dynamic agents in the scene given their current and past states. A good prediction model can prevent collisions on the road, and hence the ultimate goal for autonomous vehicles: Collision rate: collisions per Million miles. The objective of this paper is to provide an overview of the field trajectory-prediction. We categorize the relevant algorithms into different classes so that researchers can follow through the trends in the trajectory-prediction research field. Moreover we also touch upon the background knowledge required to formulate a trajectory-prediction problem.


[245] 2303.13355

Revealing Weaknesses of Vietnamese Language Models Through Unanswerable Questions in Machine Reading Comprehension

Although the curse of multilinguality significantly restricts the language abilities of multilingual models in monolingual settings, researchers now still have to rely on multilingual models to develop state-of-the-art systems in Vietnamese Machine Reading Comprehension. This difficulty in researching is because of the limited number of high-quality works in developing Vietnamese language models. In order to encourage more work in this research field, we present a comprehensive analysis of language weaknesses and strengths of current Vietnamese monolingual models using the downstream task of Machine Reading Comprehension. From the analysis results, we suggest new directions for developing Vietnamese language models. Besides this main contribution, we also successfully reveal the existence of artifacts in Vietnamese Machine Reading Comprehension benchmarks and suggest an urgent need for new high-quality benchmarks to track the progress of Vietnamese Machine Reading Comprehension. Moreover, we also introduced a minor but valuable modification to the process of annotating unanswerable questions for Machine Reading Comprehension from previous work. Our proposed modification helps improve the quality of unanswerable questions to a higher level of difficulty for Machine Reading Comprehension systems to solve.


[246] 2303.13357

POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficient model to reconstruct accurate human mesh is needed for real-world applications. In this paper, we propose a pure transformer architecture named POoling aTtention TransformER (POTTER) for the HMR task from single images. Observing that the conventional attention module is memory and computationally expensive, we propose an efficient pooling attention module, which significantly reduces the memory and computational cost without sacrificing performance. Furthermore, we design a new transformer architecture by integrating a High-Resolution (HR) stream for the HMR task. The high-resolution local and global features from the HR stream can be utilized for recovering more accurate human mesh. Our POTTER outperforms the SOTA method METRO by only requiring 7% of total parameters and 14% of the Multiply-Accumulate Operations on the Human3.6M (PA-MPJPE metric) and 3DPW (all three metrics) datasets. The project webpage is https://zczcwh.github.io/potter_page.


[247] 2303.13360

Towards the Scalable Evaluation of Cooperativeness in Language Models

It is likely that AI systems driven by pre-trained language models (PLMs) will increasingly be used to assist humans in high-stakes interactions with other agents, such as negotiation or conflict resolution. Consistent with the goals of Cooperative AI \citep{dafoe_open_2020}, we wish to understand and shape the multi-agent behaviors of PLMs in a pro-social manner. An important first step is the evaluation of model behaviour across diverse cooperation problems. Since desired behaviour in an interaction depends upon precise game-theoretic structure, we focus on generating scenarios with particular structures with both crowdworkers and a language model. Our work proceeds as follows. First, we discuss key methodological issues in the generation of scenarios corresponding to particular game-theoretic structures. Second, we employ both crowdworkers and a language model to generate such scenarios. We find that the quality of generations tends to be mediocre in both cases. We additionally get both crowdworkers and a language model to judge whether given scenarios align with their intended game-theoretic structure, finding mixed results depending on the game. Third, we provide a dataset of scenario based on our data generated. We provide both quantitative and qualitative evaluations of UnifiedQA and GPT-3 on this dataset. We find that instruct-tuned models tend to act in a way that could be perceived as cooperative when scaled up, while other models seemed to have flat scaling trends.


[248] 2303.13363

FS-Real: Towards Real-World Cross-Device Federated Learning

Federated Learning (FL) aims to train high-quality models in collaboration with distributed clients while not uploading their local data, which attracts increasing attention in both academia and industry. However, there is still a considerable gap between the flourishing FL research and real-world scenarios, mainly caused by the characteristics of heterogeneous devices and its scales. Most existing works conduct evaluations with homogeneous devices, which are mismatched with the diversity and variability of heterogeneous devices in real-world scenarios. Moreover, it is challenging to conduct research and development at scale with heterogeneous devices due to limited resources and complex software stacks. These two key factors are important yet underexplored in FL research as they directly impact the FL training dynamics and final performance, making the effectiveness and usability of FL algorithms unclear. To bridge the gap, in this paper, we propose an efficient and scalable prototyping system for real-world cross-device FL, FS-Real. It supports heterogeneous device runtime, contains parallelism and robustness enhanced FL server, and provides implementations and extensibility for advanced FL utility features such as personalization, communication compression and asynchronous aggregation. To demonstrate the usability and efficiency of FS-Real, we conduct extensive experiments with various device distributions, quantify and analyze the effect of the heterogeneous device and various scales, and further provide insights and open discussions about real-world FL scenarios. Our system is released to help to pave the way for further real-world FL research and broad applications involving diverse devices and scales.


[249] 2303.13364

Reevaluating Data Partitioning for Emotion Detection in EmoWOZ

This paper focuses on the EmoWoz dataset, an extension of MultiWOZ that provides emotion labels for the dialogues. MultiWOZ was partitioned initially for another purpose, resulting in a distributional shift when considering the new purpose of emotion recognition. The emotion tags in EmoWoz are highly imbalanced and unevenly distributed across the partitions, which causes sub-optimal performance and poor comparison of models. We propose a stratified sampling scheme based on emotion tags to address this issue, improve the dataset's distribution, and reduce dataset shift. We also introduce a special technique to handle conversation (sequential) data with many emotional tags. Using our proposed sampling method, models built upon EmoWoz can perform better, making it a more reliable resource for training conversational agents with emotional intelligence. We recommend that future researchers use this new partitioning to ensure consistent and accurate performance evaluations.


[250] 2303.13365

Requirement Formalisation using Natural Language Processing and Machine Learning: A Systematic Review

Improvement of software development methodologies attracts developers to automatic Requirement Formalisation (RF) in the Requirement Engineering (RE) field. The potential advantages by applying Natural Language Processing (NLP) and Machine Learning (ML) in reducing the ambiguity and incompleteness of requirement written in natural languages is reported in different studies. The goal of this paper is to survey and classify existing work on NLP and ML for RF, identifying challenges in this domain and providing promising future research directions. To achieve this, we conducted a systematic literature review to outline the current state-of-the-art of NLP and ML techniques in RF by selecting 257 papers from common used libraries. The search result is filtered by defining inclusion and exclusion criteria and 47 relevant studies between 2012 and 2022 are selected. We found that heuristic NLP approaches are the most common NLP techniques used for automatic RF, primary operating on structured and semi-structured data. This study also revealed that Deep Learning (DL) technique are not widely used, instead classical ML techniques are predominant in the surveyed studies. More importantly, we identified the difficulty of comparing the performance of different approaches due to the lack of standard benchmark cases for RF.


[251] 2303.13367

ChatGPT and a New Academic Reality: AI-Written Research Papers and the Ethics of the Large Language Models in Scholarly Publishing

This paper discusses OpenAIs ChatGPT, a generative pre-trained transformer, which uses natural language processing to fulfill text-based user requests (i.e., a chatbot). The history and principles behind ChatGPT and similar models are discussed. This technology is then discussed in relation to its potential impact on academia and scholarly research and publishing. ChatGPT is seen as a potential model for the automated preparation of essays and other types of scholarly manuscripts. Potential ethical issues that could arise with the emergence of large language models like GPT-3, the underlying technology behind ChatGPT, and its usage by academics and researchers, are discussed and situated within the context of broader advancements in artificial intelligence, machine learning, and natural language processing for research and scholarly publishing.


[252] 2303.13371

Plug-and-Play Regulators for Image-Text Matching

Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are plug-and-play: both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods. Code and pre-trained models are available at: https://github.com/Paranioar/RCAR.


[253] 2303.13372

Adversarial Robustness of Learning-based Static Malware Classifiers

Malware detection has long been a stage for an ongoing arms race between malware authors and anti-virus systems. Solutions that utilize machine learning (ML) gain traction as the scale of this arms race increases. This trend, however, makes performing attacks directly on ML an attractive prospect for adversaries. We study this arms race from both perspectives in the context of MalConv, a popular convolutional neural network-based malware classifier that operates on raw bytes of files. First, we show that MalConv is vulnerable to adversarial patch attacks: appending a byte-level patch to malware files bypasses detection 94.3% of the time. Moreover, we develop a universal adversarial patch (UAP) attack where a single patch can drop the detection rate in constant time of any malware file that contains it by 80%. These patches are effective even being relatively small with respect to the original file size -- between 2%-8%. As a countermeasure, we then perform window ablation that allows us to apply de-randomized smoothing, a modern certified defense to patch attacks in vision tasks, to raw files. The resulting `smoothed-MalConv' can detect over 80% of malware that contains the universal patch and provides certified robustness up to 66%, outlining a promising step towards robust malware detection. To our knowledge, we are the first to apply universal adversarial patch attack and certified defense using ablations on byte level in the malware field.


[254] 2303.13373

Fine-tuning ClimateBert transformer with ClimaText for the disclosure analysis of climate-related financial risks

In recent years there has been a growing demand from financial agents, especially from particular and institutional investors, for companies to report on climate-related financial risks. A vast amount of information, in text format, can be expected to be disclosed in the short term by firms in order to identify these types of risks in their financial and non financial reports, particularly in response to the growing regulation that is being passed on the matter. To this end, this paper applies state-of-the-art NLP techniques to achieve the detection of climate change in text corpora. We use transfer learning to fine-tune two transformer models, BERT and ClimateBert -a recently published DistillRoBERTa-based model that has been specifically tailored for climate text classification-. These two algorithms are based on the transformer architecture which enables learning the contextual relationships between words in a text. We carry out the fine-tuning process of both models on the novel Clima-Text database, consisting of data collected from Wikipedia, 10K Files Reports and web-based claims. Our text classification model obtained from the ClimateBert fine-tuning process on ClimaText, outperforms the models created with BERT and the current state-of-the-art transformer in this particular problem. Our study is the first one to implement on the ClimaText database the recently published ClimateBert algorithm. Based on our results, it can be said that ClimateBert fine-tuned on ClimaText is an outstanding tool within the NLP pre-trained transformer models that may and should be used by investors, institutional agents and companies themselves to monitor the disclosure of climate risk in financial reports. In addition, our transfer learning methodology is cheap in computational terms, thus allowing any organization to perform it.


[255] 2303.13375

Capabilities of GPT-4 on Medical Challenge Problems

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. We also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond measuring model performance, experiments were conducted to investigate the influence of test questions containing both text and images on model performance, probe for memorization of content during training, and study probability calibration, which is of critical importance in high-stakes applications like medicine. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is significantly better calibrated than GPT-3.5, demonstrating a much-improved ability to predict the likelihood that its answers are correct. We also explore the behavior of the model qualitatively through a case study that shows the ability of GPT-4 to explain medical reasoning, personalize explanations to students, and interactively craft new counterfactual scenarios around a medical case. Implications of the findings are discussed for potential uses of GPT-4 in medical education, assessment, and clinical practice, with appropriate attention to challenges of accuracy and safety.


[256] 2303.13378

Searching a Tree with Signals: Routing Mobile Sensors for Targets Emitting Radiation, Chemicals or Scents

Adversarial search of a network for an immobile Hider (or target) was introduced and solved for rooted trees by Gal (1979). In this zero-sum game, a Hider picks a point to hide on the tree and a Searcher picks a unit speed trajectory starting at the root. The payoff (to the Hider) is the search time. In Gal's model (and many subsequent investigations), the Searcher receives no additional information after the Hider chooses his location. In reality, the Searcher will often receive such locational information. For homeland security, mobile sensors on vehicles have been used to locate radioactive material stashed in an urban environment. In a military setting, mobile sensors can detect chemical signatures from land mines. In predator-prey search, the predator often has specially attuned senses (hearing for wolves, vision for eagles, smell for dogs, sonar for bats, pressure sensors for sharks) that may help it locate the prey. How can such noisy locational information be used by the Searcher to modify her route? We model such information as signals which indicate which of two branches of a binary tree should be searched first, where the signal has a known accuracy p<1. Our solution calculates which branch (at every branch node) is favored, meaning it should always be searched first when the signal is in that direction. When the signal is in the other direction, we calculate the probability the signal should be followed. Compared to the optimal Hider strategy in the classic search game of Gal, the Hider's optimal distribution for this model is more skewed towards leaf nodes that are further from the root.


[257] 2303.13379

Practical and Ethical Challenges of Large Language Models in Education: A Systematic Literature Review

Educational technology innovations that have been developed based on large language models (LLMs) have shown the potential to automate the laborious process of generating and analysing textual content. While various innovations have been developed to automate a range of educational tasks (e.g., question generation, feedback provision, and essay grading), there are concerns regarding the practicality and ethicality of these innovations. Such concerns may hinder future research and the adoption of LLMs-based innovations in authentic educational contexts. To address this, we conducted a systematic literature review of 118 peer-reviewed papers published since 2017 to pinpoint the current state of research on using LLMs to automate and support educational tasks. The practical and ethical challenges of LLMs-based innovations were also identified by assessing their technological readiness, model performance, replicability, system transparency, privacy, equality, and beneficence. The findings were summarised into three recommendations for future studies, including updating existing innovations with state-of-the-art models (e.g., GPT-3), embracing the initiative of open-sourcing models/systems, and adopting a human-centred approach throughout the developmental process. These recommendations could support future research to develop practical and ethical innovations for supporting diverse educational tasks and benefiting students, teachers, and institutions.


[258] 2303.13381

Cosys-AirSim: A Real-Time Simulation Framework Expanded for Complex Industrial Applications

Within academia and industry, there has been a need for expansive simulation frameworks that include model-based simulation of sensors, mobile vehicles, and the environment around them. To this end, the modular, real-time, and open-source AirSim framework has been a popular community-built system that fulfills some of those needs. However, the framework required adding systems to serve some complex industrial applications, including designing and testing new sensor modalities, Simultaneous Localization And Mapping (SLAM), autonomous navigation algorithms, and transfer learning with machine learning models. In this work, we discuss the modification and additions to our open-source version of the AirSim simulation framework, including new sensor modalities, vehicle types, and methods to generate realistic environments with changeable objects procedurally. Furthermore, we show the various applications and use cases the framework can serve.


[259] 2303.13382

Covariance Steering for Uncertain Contact-rich Systems

Planning and control for uncertain contact systems is challenging as it is not clear how to propagate uncertainty for planning. Contact-rich tasks can be modeled efficiently using complementarity constraints among other techniques. In this paper, we present a stochastic optimization technique with chance constraints for systems with stochastic complementarity constraints. We use a particle filter-based approach to propagate moments for stochastic complementarity system. To circumvent the issues of open-loop chance constrained planning, we propose a contact-aware controller for covariance steering of the complementarity system. Our optimization problem is formulated as Non-Linear Programming (NLP) using bilevel optimization. We present an important-particle algorithm for numerical efficiency for the underlying control problem. We verify that our contact-aware closed-loop controller is able to steer the covariance of the states under stochastic contact-rich tasks.


[260] 2303.13385

Planning for Manipulation among Movable Objects: Deciding Which Objects Go Where, in What Order, and How

We are interested in pick-and-place style robot manipulation tasks in cluttered and confined 3D workspaces among movable objects that may be rearranged by the robot and may slide, tilt, lean or topple. A recently proposed algorithm, M4M, determines which objects need to be moved and where by solving a Multi-Agent Pathfinding MAPF abstraction of this problem. It then utilises a nonprehensile push planner to compute actions for how the robot might realise these rearrangements and a rigid body physics simulator to check whether the actions satisfy physics constraints encoded in the problem. However, M4M greedily commits to valid pushes found during planning, and does not reason about orderings over pushes if multiple objects need to be rearranged. Furthermore, M4M does not reason about other possible MAPF solutions that lead to different rearrangements and pushes. In this paper, we extend M4M and present Enhanced-M4M (E-M4M) -- a systematic graph search-based solver that searches over orderings of pushes for movable objects that need to be rearranged and different possible rearrangements of the scene. We introduce several algorithmic optimisations to circumvent the increased computational complexity, discuss the space of problems solvable by E-M4M and show that experimentally, both on the real robot and in simulation, it significantly outperforms the original M4M algorithm, as well as other state-of-the-art alternatives when dealing with complex scenes.


[261] 2303.13386

Compositional Zero-Shot Domain Transfer with Text-to-Text Models

Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: we simultaneously train NLG for in-domain label-to-data generation which enables data augmentation for self-finetuning and NLU for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on NLI, text summarisation and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current SOTA in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.


[262] 2303.13390

On Designing a Learning Robot: Improving Morphology for Enhanced Task Performance and Learning

As robots become more prevalent, optimizing their design for better performance and efficiency is becoming increasingly important. However, current robot design practices overlook the impact of perception and design choices on a robot's learning capabilities. To address this gap, we propose a comprehensive methodology that accounts for the interplay between the robot's perception, hardware characteristics, and task requirements. Our approach optimizes the robot's morphology holistically, leading to improved learning and task execution proficiency. To achieve this, we introduce a Morphology-AGnostIc Controller (MAGIC), which helps with the rapid assessment of different robot designs. The MAGIC policy is efficiently trained through a novel PRIvileged Single-stage learning via latent alignMent (PRISM) framework, which also encourages behaviors that are typical of robot onboard observation. Our simulation-based results demonstrate that morphologies optimized holistically improve the robot performance by 15-20% on various manipulation tasks, and require 25x less data to match human-expert made morphology performance. In summary, our work contributes to the growing trend of learning-based approaches in robotics and emphasizes the potential in designing robots that facilitate better learning.


[263] 2303.13391

Xplainer: From X-Ray Observations to Explainable Zero-Shot Diagnosis

Automated diagnosis prediction from medical images is a valuable resource to support clinical decision-making. However, such systems usually need to be trained on large amounts of annotated data, which often is scarce in the medical domain. Zero-shot methods address this challenge by allowing a flexible adaption to new settings with different clinical findings without relying on labeled data. Further, to integrate automated diagnosis in the clinical workflow, methods should be transparent and explainable, increasing medical professionals' trust and facilitating correctness verification. In this work, we introduce Xplainer, a novel framework for explainable zero-shot diagnosis in the clinical setting. Xplainer adapts the classification-by-description approach of contrastive vision-language models to the multi-label medical diagnosis task. Specifically, instead of directly predicting a diagnosis, we prompt the model to classify the existence of descriptive observations, which a radiologist would look for on an X-Ray scan, and use the descriptor probabilities to estimate the likelihood of a diagnosis. Our model is explainable by design, as the final diagnosis prediction is directly based on the prediction of the underlying descriptors. We evaluate Xplainer on two chest X-ray datasets, CheXpert and ChestX-ray14, and demonstrate its effectiveness in improving the performance and explainability of zero-shot diagnosis. Our results suggest that Xplainer provides a more detailed understanding of the decision-making process and can be a valuable tool for clinical diagnosis.


[264] 2303.13395

Dual-Quaternion Interpolation

Transformations in the field of computer graphics and geometry are one of the most important concepts for efficient manipulation and control of objects in 2-dimensional and 3-dimensional space. Transformations take many forms each with their advantages and disadvantages. A particularly powerful tool for representing transforms in a unified form are dual-quaternions. A benefit of this unified form is the interpolation properties, which address a range of limitations (compact form that allows a rotational and translational components to be coupled). In this article, we examine various dual-quaternion interpolation options that achieve different trade-offs between computational cost, aesthetic factors and coupling dependency. Surprisingly, despite dual-quaternions being a common tool in graphics libraries, there are limited details on the interpolation details. Here we attempt to explain interpolation concept, elaborating on underpinning theories, while explaining concepts and bespoke modifications for added control.


[265] 2303.13396

Zero-guidance Segmentation Using Zero Segment Labels

CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.


[266] 2303.13397

DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. Compared to single image-based methods, video-based methods can utilize temporal information to further improve performance by incorporating human body motion priors. However, many-to-many approaches such as VIBE suffer from motion smoothness and temporal inconsistency. While many-to-one approaches such as TCMR and MPS-Net rely on the future frames, which is non-causal and time inefficient during inference. To address these challenges, a novel Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is presented. DDT is designed to decode specific motion patterns from the input sequence, enhancing motion smoothness and temporal consistency. As a many-to-many approach, the decoder of our DDT outputs the human mesh of all the frames, making DDT more viable for real-world applications where time efficiency is crucial and a causal model is desired. Extensive experiments are conducted on the widely used datasets (Human3.6M, MPI-INF-3DHP, and 3DPW), which demonstrated the effectiveness and efficiency of our DDT.


[267] 2303.13399

Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

Interactive segmentation enables users to segment as needed by providing cues of objects, which introduces human-computer interaction for many fields, such as image editing and medical image analysis. Typically, massive and expansive pixel-level annotations are spent to train deep models by object-oriented interactions with manually labeled object masks. In this work, we reveal that informative interactions can be made by simulation with semantic-consistent yet diverse region exploration in an unsupervised paradigm. Concretely, we introduce a Multi-granularity Interaction Simulation (MIS) approach to open up a promising direction for unsupervised interactive segmentation. Drawing on the high-quality dense features produced by recent self-supervised models, we propose to gradually merge patches or regions with similar features to form more extensive regions and thus, every merged region serves as a semantic-meaningful multi-granularity proposal. By randomly sampling these proposals and simulating possible interactions based on them, we provide meaningful interaction at multiple granularities to teach the model to understand interactions. Our MIS significantly outperforms non-deep learning unsupervised methods and is even comparable with some previous deep-supervised methods without any annotation.


[268] 2303.13401

Optimization and Optimizers for Adversarial Robustness

Empirical robustness evaluation (RE) of deep learning models against adversarial perturbations entails solving nontrivial constrained optimization problems. Existing numerical algorithms that are commonly used to solve them in practice predominantly rely on projected gradient, and mostly handle perturbations modeled by the $\ell_1$, $\ell_2$ and $\ell_\infty$ distances. In this paper, we introduce a novel algorithmic framework that blends a general-purpose constrained-optimization solver PyGRANSO with Constraint Folding (PWCF), which can add more reliability and generality to the state-of-the-art RE packages, e.g., AutoAttack. Regarding reliability, PWCF provides solutions with stationarity measures and feasibility tests to assess the solution quality. For generality, PWCF can handle perturbation models that are typically inaccessible to the existing projected gradient methods; the main requirement is the distance metric to be almost everywhere differentiable. Taking advantage of PWCF and other existing numerical algorithms, we further explore the distinct patterns in the solutions found for solving these optimization problems using various combinations of losses, perturbation models, and optimization algorithms. We then discuss the implications of these patterns on the current robustness evaluation and adversarial training.


[269] 2303.13402

Return on Investment Driven Observability

Observability, in cloud native systems, is the capability to continuously generate and discover actionable insights, based on signals from the system under observation. How do you know what insights are the most useful ones? What signals should you be using to generate insights? This article discusses challenges arising when rolling out observability in organizations and how you can, based on Return on Investment (RoI) analysis, address said challenges.


[270] 2303.13405

SC-MIL: Supervised Contrastive Multiple Instance Learning for Imbalanced Classification in Pathology

Multiple Instance learning (MIL) models have been extensively used in pathology to predict biomarkers and risk-stratify patients from gigapixel-sized images. Machine learning problems in medical imaging often deal with rare diseases, making it important for these models to work in a label-imbalanced setting. Furthermore, these imbalances can occur in out-of-distribution (OOD) datasets when the models are deployed in the real-world. We leverage the idea that decoupling feature and classifier learning can lead to improved decision boundaries for label imbalanced datasets. To this end, we investigate the integration of supervised contrastive learning with multiple instance learning (SC-MIL). Specifically, we propose a joint-training MIL framework in the presence of label imbalance that progressively transitions from learning bag-level representations to optimal classifier learning. We perform experiments with different imbalance settings for two well-studied problems in cancer pathology: subtyping of non-small cell lung cancer and subtyping of renal cell carcinoma. SC-MIL provides large and consistent improvements over other techniques on both in-distribution (ID) and OOD held-out sets across multiple imbalanced settings.


[271] 2303.13408

Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

To detect the deployment of large language models for malicious use cases (e.g., fake content creation or academic plagiarism), several approaches have recently been proposed for identifying AI-generated text via watermarks or statistical irregularities. How robust are these detection algorithms to paraphrases of AI-generated text? To stress test these detectors, we first train an 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, optionally leveraging surrounding text (e.g., user-written prompts) as context. DIPPER also uses scalar knobs to control the amount of lexical diversity and reordering in the paraphrases. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops the detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings, while only classifying 1% of human-written sequences as AI-generated. We will open source our code, model and data for future research.


[272] 2303.13412

Low-Light Image Enhancement by Learning Contrastive Representations in Spatial and Frequency Domains

Images taken under low-light conditions tend to suffer from poor visibility, which can decrease image quality and even reduce the performance of the downstream tasks. It is hard for a CNN-based method to learn generalized features that can recover normal images from the ones under various unknow low-light conditions. In this paper, we propose to incorporate the contrastive learning into an illumination correction network to learn abstract representations to distinguish various low-light conditions in the representation space, with the purpose of enhancing the generalizability of the network. Considering that light conditions can change the frequency components of the images, the representations are learned and compared in both spatial and frequency domains to make full advantage of the contrastive learning. The proposed method is evaluated on LOL and LOL-V2 datasets, the results show that the proposed method achieves better qualitative and quantitative results compared with other state-of-the-arts.


[273] 2303.13415

Block constrained pressure residual preconditioning for two-phase flow in porous media by mixed hybrid finite elements

This work proposes an original preconditioner that couples the Constrained Pressure Residual (CPR) method with block preconditioning for the efficient solution of the linearized systems of equations arising from fully implicit multiphase flow models. This preconditioner, denoted as Block CPR (BCPR), is specifically designed for Lagrange multipliers-based flow models, such as those generated by Mixed Hybrid Finite Element (MHFE) approximations. An original MHFE-based formulation of the two-phase flow model is taken as a reference for the development of the BCPR preconditioner, in which the set of system unknowns comprises both element and face pressures, in addition to the cell saturations, resulting in a $3 \times 3$ block-structured Jacobian matrix with a $2 \times 2$ inner pressure problem. The CPR method is one of the most established techniques for reservoir simulations, but most research focused on solutions for Two-Point Flux Approximation (TPFA)-based discretizations that do not readily extend to our problem formulation. Therefore, we designed a dedicated two-stage strategy, inspired by the CPR algorithm, where a block preconditioner is used for the pressure part with the aim at exploiting the inner $2 \times 2$ structure. The proposed preconditioning framework is tested by an extensive experimentation, comprising both synthetic and realistic applications in Cartesian and non-Cartesian domains.


[274] 2303.13416

A Unified Framework for Learned Sparse Retrieval

Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial differences in effectiveness and efficiency. Differences in the experimental setups and configurations used make it difficult to compare the methods and derive insights. In this work, we analyze existing LSR methods and identify key components to establish an LSR framework that unifies all LSR methods under the same perspective. We then reproduce all prominent methods using a common codebase and re-train them in the same environment, which allows us to quantify how components of the framework affect effectiveness and efficiency. We find that (1) including document term weighting is most important for a method's effectiveness, (2) including query weighting has a small positive impact, and (3) document expansion and query expansion have a cancellation effect. As a result, we show how removing query expansion from a state-of-the-art model can reduce latency significantly while maintaining effectiveness on MSMarco and TripClick benchmarks. Our code is publicly available at https://github.com/thongnt99/learned-sparse-retrieval


[275] 2303.13418

GiveMeLabeledIssues: An Open Source Issue Recommendation System

Developers often struggle to navigate an Open Source Software (OSS) project's issue-tracking system and find a suitable task. Proper issue labeling can aid task selection, but current tools are limited to classifying the issues according to their type (e.g., bug, question, good first issue, feature, etc.). In contrast, this paper presents a tool (GiveMeLabeledIssues) that mines project repositories and labels issues based on the skills required to solve them. We leverage the domain of the APIs involved in the solution (e.g., User Interface (UI), Test, Databases (DB), etc.) as a proxy for the required skills. GiveMeLabeledIssues facilitates matching developers' skills to tasks, reducing the burden on project maintainers. The tool obtained a precision of 83.9% when predicting the API domains involved in the issues. The replication package contains instructions on executing the tool and including new projects. A demo video is available at https://www.youtube.com/watch?v=ic2quUue7i8


[276] 2303.13419

Modular Retrieval for Generalization and Interpretation

New retrieval tasks have always been emerging, thus urging the development of new retrieval models. However, instantiating a retrieval model for each new retrieval task is resource-intensive and time-consuming, especially for a retrieval model that employs a large-scale pre-trained language model. To address this issue, we shift to a novel retrieval paradigm called modular retrieval, which aims to solve new retrieval tasks by instead composing multiple existing retrieval modules. Built upon the paradigm, we propose a retrieval model with modular prompt tuning named REMOP. It constructs retrieval modules subject to task attributes with deep prompt tuning, and yields retrieval models subject to tasks with module composition. We validate that, REMOP inherently with modularity not only has appealing generalizability and interpretability in preliminary explorations, but also achieves comparable performance to state-of-the-art retrieval models on a zero-shot retrieval benchmark.\footnote{Our code is available at \url{https://github.com/FreedomIntelligence/REMOP}}


[277] 2303.13430

Medical diffusion on a budget: textual inversion for medical image generation

Diffusion-based models for text-to-image generation have gained immense popularity due to recent advancements in efficiency, accessibility, and quality. Although it is becoming increasingly feasible to perform inference with these systems using consumer-grade GPUs, training them from scratch still requires access to large datasets and significant computational resources. In the case of medical image generation, the availability of large, publicly accessible datasets that include text reports is limited due to legal and ethical concerns. While training a diffusion model on a private dataset may address this issue, it is not always feasible for institutions lacking the necessary computational resources. This work demonstrates that pre-trained Stable Diffusion models, originally trained on natural images, can be adapted to various medical imaging modalities by training text embeddings with textual inversion. In this study, we conducted experiments using medical datasets comprising only 100 samples from three medical modalities. Embeddings were trained in a matter of hours, while still retaining diagnostic relevance in image generation. Experiments were designed to achieve several objectives. Firstly, we fine-tuned the training and inference processes of textual inversion, revealing that larger embeddings and more examples are required. Secondly, we validated our approach by demonstrating a 2\% increase in the diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we performed simulations by interpolating between healthy and diseased states, combining multiple pathologies, and inpainting to show embedding flexibility and control of disease appearance. Finally, the embeddings trained in this study are small (less than 1 MB), which facilitates easy sharing of medical data with reduced privacy concerns.


[278] 2303.13434

Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective

Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory's perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +17.7% on DomainNet, respectively.


[279] 2303.13439

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .


[280] 2303.13440

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Code and models will be made available.


[281] 2303.13446

On the Utility of Koopman Operator Theory in Learning Dexterous Manipulation Skills

Recent advances in learning-based approaches have led to impressive dexterous manipulation capabilities. Yet, we haven't witnessed widespread adoption of these capabilities beyond the laboratory. This is likely due to practical limitations, such as significant computational burden, inscrutable policy architectures, sensitivity to parameter initializations, and the considerable technical expertise required for implementation. In this work, we investigate the utility of Koopman operator theory in alleviating these limitations. Koopman operators are simple yet powerful control-theoretic structures that help represent complex nonlinear dynamics as linear systems in higher-dimensional spaces. Motivated by the fact that complex nonlinear dynamics underlie dexterous manipulation, we develop an imitation learning framework that leverages Koopman operators to simultaneously learn the desired behavior of both robot and object states. We demonstrate that a Koopman operator-based framework is surprisingly effective for dexterous manipulation and offers a number of unique benefits. First, the learning process is analytical, eliminating the sensitivity to parameter initializations and painstaking hyperparameter optimization. Second, the learned reference dynamics can be combined with a task-agnostic tracking controller such that task changes and variations can be handled with ease. Third, a Koopman operator-based approach can perform comparably to state-of-the-art imitation learning algorithms in terms of task success rate and imitation error, while being an order of magnitude more computationally efficient. In addition, we discuss a number of avenues for future research made available by this work.


[282] 2303.13447

Towards Transparent, Reusable, and Customizable Data Science in Computational Notebooks

Data science workflows are human-centered processes involving on-demand programming and analysis. While programmable and interactive interfaces such as widgets embedded within computational notebooks are suitable for these workflows, they lack robust state management capabilities and do not support user-defined customization of the interactive components. The absence of such capabilities hinders workflow reusability and transparency while limiting the scope of exploration of the end-users. In response, we developed MAGNETON, a framework for authoring interactive widgets within computational notebooks that enables transparent, reusable, and customizable data science workflows. The framework enhances existing widgets to support fine-grained interaction history management, reusable states, and user-defined customizations. We conducted three case studies in a real-world knowledge graph construction and serving platform to evaluate the effectiveness of these widgets. Based on the observations, we discuss future implications of employing MAGNETON widgets for general-purpose data science workflows.


[283] 2303.13450

Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes

Recent breakthroughs in text-guided image generation have led to remarkable progress in the field of 3D synthesis from text. By optimizing neural radiance fields (NeRF) directly from text, recent methods are able to produce remarkable results. Yet, these methods are limited in their control of each object's placement or appearance, as they represent the scene as a whole. This can be a major issue in scenarios that require refining or manipulating objects in the scene. To remedy this deficit, we propose a novel GlobalLocal training framework for synthesizing a 3D scene using object proxies. A proxy represents the object's placement in the generated scene and optionally defines its coarse geometry. The key to our approach is to represent each object as an independent NeRF. We alternate between optimizing each NeRF on its own and as part of the full scene. Thus, a complete representation of each object can be learned, while also creating a harmonious scene with style and lighting match. We show that using proxies allows a wide variety of editing options, such as adjusting the placement of each independent object, removing objects from a scene, or refining an object. Our results show that Set-the-Scene offers a powerful solution for scene synthesis and manipulation, filling a crucial gap in controllable text-to-3D synthesis.


[284] 2303.13451

Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.


[285] 2303.13452

Human Behavior in the Time of COVID-19: Learning from Big Data

Since the World Health Organization (WHO) characterized COVID-19 as a pandemic in March 2020, there have been over 600 million confirmed cases of COVID-19 and more than six million deaths as of October 2022. The relationship between the COVID-19 pandemic and human behavior is complicated. On one hand, human behavior is found to shape the spread of the disease. On the other hand, the pandemic has impacted and even changed human behavior in almost every aspect. To provide a holistic understanding of the complex interplay between human behavior and the COVID-19 pandemic, researchers have been employing big data techniques such as natural language processing, computer vision, audio signal processing, frequent pattern mining, and machine learning. In this study, we present an overview of the existing studies on using big data techniques to study human behavior in the time of the COVID-19 pandemic. In particular, we categorize these studies into three groups - using big data to measure, model, and leverage human behavior, respectively. The related tasks, data, and methods are summarized accordingly. To provide more insights into how to fight the COVID-19 pandemic and future global catastrophes, we further discuss challenges and potential opportunities.


[286] 2303.13455

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.


[287] 2303.13458

Optimization Dynamics of Equivariant and Augmented Neural Networks

We investigate the optimization of multilayer perceptrons on symmetric data. We compare the strategy of constraining the architecture to be equivariant to that of using augmentation. We show that, under natural assumptions on the loss and non-linearities, the sets of equivariant stationary points are identical for the two strategies, and that the set of equivariant layers is invariant under the gradient flow for augmented models. Finally, we show that stationary points may be unstable for augmented training although they are stable for the equivariant models


[288] 2303.13463

W2KPE: Keyphrase Extraction with Word-Word Relation

This paper describes our submission to ICASSP 2023 MUG Challenge Track 4, Keyphrase Extraction, which aims to extract keyphrases most relevant to the conference theme from conference materials. We model the challenge as a single-class Named Entity Recognition task and developed techniques for better performance on the challenge: For the data preprocessing, we encode the split keyphrases after word segmentation. In addition, we increase the amount of input information that the model can accept at one time by fusing multiple preprocessed sentences into one segment. We replace the loss function with the multi-class focal loss to address the sparseness of keyphrases. Besides, we score each appearance of keyphrases and add an extra output layer to fit the score to rank keyphrases. Exhaustive evaluations are performed to find the best combination of the word segmentation tool, the pre-trained embedding model, and the corresponding hyperparameters. With these proposals, we scored 45.04 on the final test set.


[289] 2303.13465

Deep RL with Hierarchical Action Exploration for Dialogue Generation

Conventionally, since the natural language action space is astronomical, approximate dynamic programming applied to dialogue generation involves policy improvement with action sampling. However, such a practice is inefficient for reinforcement learning (RL) because the eligible (high action value) responses are very sparse, and the greedy policy sustained by the random sampling is flabby. This paper shows that the performance of dialogue policy positively correlated with sampling size by theoretical and experimental. We introduce a novel dual-granularity Q-function to alleviate this limitation by exploring the most promising response category to intervene in the sampling. It extracts the actions following the grained hierarchy, which can achieve the optimum with fewer policy iterations. Our approach learns in the way of offline RL from multiple reward functions designed to recognize human emotional details. Empirical studies demonstrate that our algorithm outperforms the baseline methods. Further verification presents that ours can generate responses with higher expected rewards and controllability.


[290] 2303.13466

Extracting Physical Rehabilitation Exercise Information from Clinical Notes: a Comparison of Rule-Based and Machine Learning Natural Language Processing Techniques

Physical rehabilitation plays a crucial role in the recovery process of post-stroke patients. By personalizing therapies for patients leveraging predictive modeling and electronic health records (EHRs), healthcare providers can make the rehabilitation process more efficient. Before predictive modeling can provide decision support for the assignment of treatment plans, automated methods are necessary to extract physical rehabilitation exercise information from unstructured EHRs. We introduce a rule-based natural language processing algorithm to annotate therapeutic procedures for stroke patients and compare it to several small machine learning models. We find that our algorithm outperforms these models in extracting half of the concepts where sufficient data is available, and individual exercise descriptions can be assigned binary labels with an f-score of no less than 0.75 per concept. More research needs to be done before these algorithms can be deployed on unlabeled documents, but current progress gives promise to the potential of precision rehabilitation research.


[291] 2303.13471

Egocentric Audio-Visual Object Localization

Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the ``free'' self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes.


[292] 2303.13472

Plotting Behind the Scenes: Towards Learnable Game Engines

Game engines are powerful tools in computer graphics. Their power comes at the immense cost of their development. In this work, we present a framework to train game-engine-like neural models, solely from monocular annotated videos. The result-a Learnable Game Engine (LGE)-maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. Similarly to a game engine, it models the logic of the game and the underlying rules of physics, to make it possible for a user to play the game by specifying both high- and low-level action sequences. Most captivatingly, our LGE unlocks the director's mode, where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents in the form of language and desired states. This requires learning "game AI", encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, devise the strategy to win a point. The key to learning such game AI is the exploitation of a large and diverse text corpus, collected in this work, describing detailed actions in a game and used to train our animation model. To render the resulting state of the environment and its agents, we use a compositional NeRF representation used in our synthesis model. To foster future research, we present newly collected, annotated and calibrated large-scale Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality. Besides, our LGEs unlock applications beyond capabilities of the current state of the art. Our framework, data, and models are available at https://learnable-game-engines.github.io/lge-website.


[293] 2303.13475

Learning Semantic Text Similarity to rank Hypernyms of Financial Terms

Over the years, there has been a paradigm shift in how users access financial services. With the advancement of digitalization more users have been preferring the online mode of performing financial activities. This has led to the generation of a huge volume of financial content. Most investors prefer to go through these contents before making decisions. Every industry has terms that are specific to the domain it operates in. Banking and Financial Services are not an exception to this. In order to fully comprehend these contents, one needs to have a thorough understanding of the financial terms. Getting a basic idea about a term becomes easy when it is explained with the help of the broad category to which it belongs. This broad category is referred to as hypernym. For example, "bond" is a hypernym of the financial term "alternative debenture". In this paper, we propose a system capable of extracting and ranking hypernyms for a given financial term. The system has been trained with financial text corpora obtained from various sources like DBpedia [4], Investopedia, Financial Industry Business Ontology (FIBO), prospectus and so on. Embeddings of these terms have been extracted using FinBERT [3], FinISH [1] and fine-tuned using SentenceBERT [54]. A novel approach has been used to augment the training set with negative samples. It uses the hierarchy present in FIBO. Finally, we benchmark the system performance with that of the existing ones. We establish that it performs better than the existing ones and is also scalable.


[294] 2303.13477

TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation

We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. In sharp contrast to past approaches that rely on complex non-linear optimization, we propose to formulate it as a neural optimization that learns to efficiently estimate the shape and pose. We introduce Deep Directional Distance Function (DeepDDF), a neural network that directly outputs the depth image of an object given the camera viewpoint and viewing direction, for efficient error computation in 2D image space. We formulate the joint estimation itself as a Transformer which we refer to as TransPoser. We fully leverage the tokenization and multi-head attention to sequentially process the growing set of observations and to efficiently update the shape and pose with a learned momentum, respectively. Experimental results on synthetic and real data show that DeepDDF achieves high accuracy as a category-level object shape representation and TransPoser achieves state-of-the-art accuracy efficiently for joint shape and pose estimation.


[295] 2303.13479

Prior-free Category-level Pose Estimation with Implicit Space Transformation

Category-level 6D pose estimation aims to predict the poses and sizes of unseen objects from a specific category. Thanks to prior deformation, which explicitly adapts a category-specific 3D prior (i.e., a 3D template) to a given object instance, prior-based methods attained great success and have become a major research stream. However, obtaining category-specific priors requires collecting a large amount of 3D models, which is labor-consuming and often not accessible in practice. This motivates us to investigate whether priors are necessary to make prior-based methods effective. Our empirical study shows that the 3D prior itself is not the credit to the high performance. The keypoint actually is the explicit deformation process, which aligns camera and world coordinates supervised by world-space 3D models (also called canonical space). Inspired by these observation, we introduce a simple prior-free implicit space transformation network, namely IST-Net, to transform camera-space features to world-space counterparts and build correspondence between them in an implicit manner without relying on 3D priors. Besides, we design camera- and world-space enhancers to enrich the features with pose-sensitive information and geometrical constraints, respectively. Albeit simple, IST-Net becomes the first prior-free method that achieves state-of-the-art performance, with top inference speed on the REAL275 dataset. Our code and models will be publicly available.


[296] 2303.13482

TactoFind: A Tactile Only System for Object Retrieval

We study the problem of object retrieval in scenarios where visual sensing is absent, object shapes are unknown beforehand and objects can move freely, like grabbing objects out of a drawer. Successful solutions require localizing free objects, identifying specific object instances, and then grasping the identified objects, only using touch feedback. Unlike vision, where cameras can observe the entire scene, touch sensors are local and only observe parts of the scene that are in contact with the manipulator. Moreover, information gathering via touch sensors necessitates applying forces on the touched surface which may disturb the scene itself. Reasoning with touch, therefore, requires careful exploration and integration of information over time -- a challenge we tackle. We present a system capable of using sparse tactile feedback from fingertip touch sensors on a dexterous hand to localize, identify and grasp novel objects without any visual feedback. Videos are available at https://taochenshh.github.io/projects/tactofind.


[297] 2303.13483

NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations

Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.


[298] 2303.13484

Why and When: Understanding System Initiative during Conversational Collaborative Search

In the last decade, conversational search has attracted considerable attention. However, most research has focused on systems that can support a \emph{single} searcher. In this paper, we explore how systems can support \emph{multiple} searchers collaborating over an instant messaging platform (i.e., Slack). We present a ``Wizard of Oz'' study in which 27 participant pairs collaborated on three information-seeking tasks over Slack. Participants were unable to search on their own and had to gather information by interacting with a \emph{searchbot} directly from the Slack channel. The role of the searchbot was played by a reference librarian. Conversational search systems must be capable of engaging in \emph{mixed-initiative} interaction by taking and relinquishing control of the conversation to fulfill different objectives. Discourse analysis research suggests that conversational agents can take \emph{two} levels of initiative: dialog- and task-level initiative. Agents take dialog-level initiative to establish mutual belief between agents and task-level initiative to influence the goals of the other agents. During the study, participants were exposed to three experimental conditions in which the searchbot could take different levels of initiative: (1) no initiative, (2) only dialog-level initiative, and (3) both dialog- and task-level initiative. In this paper, we focus on understanding the Wizard's actions. Specifically, we focus on understanding the Wizard's motivations for taking initiative and their rationale for the timing of each intervention. To gain insights about the Wizard's actions, we conducted a stimulated recall interview with the Wizard. We present findings from a qualitative analysis of this interview data and discuss implications for designing conversational search systems to support collaborative search.


[299] 2303.13485

Timely Multi-Process Estimation Over Erasure Channels With and Without Feedback

We consider a multi-process remote estimation system observing $K$ independent Ornstein-Uhlenbeck processes. In this system, a shared sensor samples the $K$ processes in such a way that the long-term average sum mean square error (MSE) is minimized. The sensor operates under a total sampling frequency constraint $f_{\max}$. The samples from all processes consume random processing delays in a shared queue and then are transmitted over an erasure channel with probability $\epsilon$. We study two variants of the problem: first, when the samples are scheduled according to a Maximum-Age-First (MAF) policy, and the receiver provides an erasure status feedback; and second, when samples are scheduled according to a Round-Robin (RR) policy, when there is no erasure status feedback from the receiver. Aided by optimal structural results, we show that the optimal sampling policy for both settings, under some conditions, is a \emph{threshold policy}. We characterize the optimal threshold and the corresponding optimal long-term average sum MSE as a function of $K$, $f_{\max}$, $\epsilon$, and the statistical properties of the observed processes. Our results show that, with an exponentially distributed service rate, the optimal threshold $\tau^*$ increases as the number of processes $K$ increases, for both settings. Additionally, we show that the optimal threshold is an \emph{increasing} function of $\epsilon$ in the case of \emph{available} erasure status feedback, while it exhibits the \emph{opposite behavior}, i.e., $\tau^*$ is a \emph{decreasing} function of $\epsilon$, in the case of \emph{absent} erasure status feedback.


[300] 2303.13489

Boosting Reinforcement Learning and Planning with Demonstrations: A Survey

Although reinforcement learning has seen tremendous success recently, this kind of trial-and-error learning can be impractical or inefficient in complex environments. The use of demonstrations, on the other hand, enables agents to benefit from expert knowledge rather than having to discover the best action to take through exploration. In this survey, we discuss the advantages of using demonstrations in sequential decision making, various ways to apply demonstrations in learning-based decision making paradigms (for example, reinforcement learning and planning in the learned models), and how to collect the demonstrations in various scenarios. Additionally, we exemplify a practical pipeline for generating and utilizing demonstrations in the recently proposed ManiSkill robot learning benchmark.


[301] 2303.13491

FraudAuditor: A Visual Analytics Approach for Collusive Fraud in Health Insurance

Collusive fraud, in which multiple fraudsters collude to defraud health insurance funds, threatens the operation of the healthcare system. However, existing statistical and machine learning-based methods have limited ability to detect fraud in the scenario of health insurance due to the high similarity of fraudulent behaviors to normal medical visits and the lack of labeled data. To ensure the accuracy of the detection results, expert knowledge needs to be integrated with the fraud detection process. By working closely with health insurance audit experts, we propose FraudAuditor, a three-stage visual analytics approach to collusive fraud detection in health insurance. Specifically, we first allow users to interactively construct a co-visit network to holistically model the visit relationships of different patients. Second, an improved community detection algorithm that considers the strength of fraud likelihood is designed to detect suspicious fraudulent groups. Finally, through our visual interface, users can compare, investigate, and verify suspicious patient behavior with tailored visualizations that support different time scales. We conducted case studies in a real-world healthcare scenario, i.e., to help locate the actual fraud group and exclude the false positive group. The results and expert feedback proved the effectiveness and usability of the approach.


[302] 2303.13493

Green Time-Critical Fog Communication and Computing

Fog computing allows computationally-heavy problems with tight time constraints to be solved even if end devices have limited computational resources and latency induced by cloud computing is too high. How can energy consumed by fog computing be saved while obeying latency constraints and considering not only computations but also transmission through wireless and wired links? This work examines the latency and energy consumption sources in fog networks and discusses models describing these costs for various technologies. Next, resource allocation strategies are discussed considering the various degrees of freedom available in such a complex system, and their influence on energy consumption and latency. Finally, a vision for a future distributed, AI-driven resources allocation strategy is presented and justified.


[303] 2303.13494

Attention! Dynamic Epistemic Logic Models of (In)attentive Agents

Attention is the crucial cognitive ability that limits and selects what information we observe. Previous work by Bolander et al. (2016) proposes a model of attention based on dynamic epistemic logic (DEL) where agents are either fully attentive or not attentive at all. While introducing the realistic feature that inattentive agents believe nothing happens, the model does not represent the most essential aspect of attention: its selectivity. Here, we propose a generalization that allows for paying attention to subsets of atomic formulas. We introduce the corresponding logic for propositional attention, and show its axiomatization to be sound and complete. We then extend the framework to account for inattentive agents that, instead of assuming nothing happens, may default to a specific truth-value of what they failed to attend to (a sort of prior concerning the unattended atoms). This feature allows for a more cognitively plausible representation of the inattentional blindness phenomenon, where agents end up with false beliefs due to their failure to attend to conspicuous but unexpected events. Both versions of the model define attention-based learning through appropriate DEL event models based on a few and clear edge principles. While the size of such event models grow exponentially both with the number of agents and the number of atoms, we introduce a new logical language for describing event models syntactically and show that using this language our event models can be represented linearly in the number of agents and atoms. Furthermore, representing our event models using this language is achieved by a straightforward formalisation of the aforementioned edge principles.


[304] 2303.13495

ReVersion: Diffusion-Based Relation Inversion from Images

Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.


[305] 2303.13496

The effectiveness of MAE pre-pretraining for billion-scale pretraining

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on Food-101 (96.0%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.


[306] 2303.13497

TriPlaneNet: An Encoder for EG3D Inversion

Recent progress in NeRF-based GANs has introduced a number of approaches for high-resolution and high-fidelity generative modeling of human heads with a possibility for novel view rendering. At the same time, one must solve an inverse problem to be able to re-render or modify an existing image or video. Despite the success of universal optimization-based methods for 2D GAN inversion, those, applied to 3D GANs, may fail to produce 3D-consistent renderings. Fast encoder-based techniques, such as those developed for StyleGAN, may also be less appealing due to the lack of identity preservation. In our work, we introduce a real-time method that bridges the gap between the two approaches by directly utilizing the tri-plane representation introduced for EG3D generative model. In particular, we build upon a feed-forward convolutional encoder for the latent code and extend it with a fully-convolutional predictor of tri-plane numerical offsets. As shown in our work, the renderings are similar in quality to optimization-based techniques and significantly outperform the baselines for novel view. As we empirically prove, this is a consequence of directly operating in the tri-plane space, not in the GAN parameter space, while making use of an encoder-based trainable approach.


[307] 2303.13500

A Closer Look at Model Adaptation using Feature Distortion and Simplicity Bias

Advances in the expressivity of pretrained models have increased interest in the design of adaptation protocols which enable safe and effective transfer learning. Going beyond conventional linear probing (LP) and fine tuning (FT) strategies, protocols that can effectively control feature distortion, i.e., the failure to update features orthogonal to the in-distribution, have been found to achieve improved out-of-distribution generalization (OOD). In order to limit this distortion, the LP+FT protocol, which first learns a linear probe and then uses this initialization for subsequent FT, was proposed. However, in this paper, we find when adaptation protocols (LP, FT, LP+FT) are also evaluated on a variety of safety objectives (e.g., calibration, robustness, etc.), a complementary perspective to feature distortion is helpful to explain protocol behavior. To this end, we study the susceptibility of protocols to simplicity bias (SB), i.e. the well-known propensity of deep neural networks to rely upon simple features, as SB has recently been shown to underlie several problems in robust generalization. Using a synthetic dataset, we demonstrate the susceptibility of existing protocols to SB. Given the strong effectiveness of LP+FT, we then propose modified linear probes that help mitigate SB, and lead to better initializations for subsequent FT. We verify the effectiveness of the proposed LP+FT variants for decreasing SB in a controlled setting, and their ability to improve OOD generalization and safety on three adaptation datasets.


[308] 2303.13501

Chordal Averaging on Flag Manifolds and Its Applications

This paper presents a new, provably-convergent algorithm for computing the flag-mean and flag-median of a set of points on a flag manifold under the chordal metric. The flag manifold is a mathematical space consisting of flags, which are sequences of nested subspaces of a vector space that increase in dimension. The flag manifold is a superset of a wide range of known matrix groups, including Stiefel and Grassmanians, making it a general object that is useful in a wide variety computer vision problems. To tackle the challenge of computing first order flag statistics, we first transform the problem into one that involves auxiliary variables constrained to the Stiefel manifold. The Stiefel manifold is a space of orthogonal frames, and leveraging the numerical stability and efficiency of Stiefel-manifold optimization enables us to compute the flag-mean effectively. Through a series of experiments, we show the competence of our method in Grassmann and rotation averaging, as well as principal component analysis.


[309] 2303.13504

ReBotNet: Fast Real-time Video Enhancement

Most video restoration networks are slow, have high computational load, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time video enhancement for practical use-cases like live video calls and video streams. Our proposed method, called Recurrent Bottleneck Mixer Network (ReBotNet), employs a dual-branch framework. The first branch learns spatio-temporal features by tokenizing the input frames along the spatial and temporal dimensions using a ConvNext-based encoder and processing these abstract tokens using a bottleneck mixer. To further improve temporal consistency, the second branch employs a mixer directly on tokens extracted from individual frames. A common decoder then merges the features form the two branches to predict the enhanced frame. In addition, we propose a recurrent training approach where the last frame's prediction is leveraged to efficiently enhance the current frame while improving temporal consistency. To evaluate our method, we curate two new datasets that emulate real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.


[310] 2303.13505

A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition

The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: https://github.com/AndongDeng/BEAR


[311] 2303.13506

The Quantization Model of Neural Scaling

We propose the $\textit{Quantization Model}$ of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the $\textit{Quantization Hypothesis}$, where learned network capabilities are quantized into discrete chunks ($\textit{quanta}$). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model internals, we auto-discover diverse model capabilities (quanta) and find tentative evidence that the distribution over corresponding subproblems in the prediction of natural text is compatible with the power law predicted from the neural scaling exponent as predicted from our theory.


[312] 2303.13508

DreamBooth3D: Subject-Driven Text-to-3D Generation

We present DreamBooth3D, an approach to personalize text-to-3D generative models from as few as 3-6 casually captured images of a subject. Our approach combines recent advances in personalizing text-to-image models (DreamBooth) with text-to-3D generation (DreamFusion). We find that naively combining these methods fails to yield satisfactory subject-specific 3D assets due to personalized text-to-image models overfitting to the input viewpoints of the subject. We overcome this through a 3-stage optimization strategy where we jointly leverage the 3D consistency of neural radiance fields together with the personalization capability of text-to-image models. Our method can produce high-quality, subject-specific 3D assets with text-driven modifications such as novel poses, colors and attributes that are not seen in any of the input images of the subject.


[313] 2303.13509

Position-Guided Point Cloud Panoptic Segmentation Transformer

DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at https://github.com/SmartBot-PJLab/P3Former .


[314] 2303.13510

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark will be available at https://github.com/SmartBot-PJLab/MV-JAR .


[315] 2303.13511

Neural Preset for Color Style Transfer

In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization.


[316] 2303.13512

Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition

To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022. The BASALT challenge asks teams to compete to develop algorithms to solve tasks with hard-to-specify reward functions in Minecraft. Through this competition, we aimed to promote the development of algorithms that use human feedback as channels to learn the desired behavior. We describe the competition and provide an overview of the top solutions. We conclude by discussing the impact of the competition and future directions for improvement.


[317] 2303.13514

SAOR: Single-View Articulated Object Reconstruction

We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.


[318] 2303.13515

Persistent Nature: A Generative Model of Unbounded 3D Worlds

Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: https://chail.github.io/persistent-nature/.


[319] 2303.13516

Ablating Concepts in Text-to-Image Diffusion Models

Large-scale text-to-image diffusion models can generate high-fidelity images with powerful compositional ability. However, these models are typically trained on an enormous amount of Internet data, often containing copyrighted material, licensed images, and personal photos. Furthermore, they have been found to replicate the style of various living artists or memorize exact training samples. How can we remove such copyrighted concepts or images without retraining the model from scratch? To achieve this goal, we propose an efficient method of ablating concepts in the pretrained model, i.e., preventing the generation of a target concept. Our algorithm learns to match the image distribution for a target style, instance, or text prompt we wish to ablate to the distribution corresponding to an anchor concept. This prevents the model from generating target concepts given its text condition. Extensive experiments show that our method can successfully prevent the generation of the ablated concept while preserving closely related concepts in the model.


[320] 2303.13518

Three ways to improve feature alignment for open vocabulary detection

The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.


[321] 2303.13519

Learning and Verification of Task Structure in Instructional Videos

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.


[322] 2112.08977

Artificial diffusion for convective and acoustic low Mach number flows I: Analysis of the modified equations, and application to Roe-type schemes

Three asymptotic limits exist for the Euler equations at low Mach number - purely convective, purely acoustic, and mixed convective-acoustic. Standard collocated density-based numerical schemes for compressible flow are known to fail at low Mach number due to the incorrect asymptotic scaling of the artificial diffusion. Previous studies of this class of schemes have shown a variety of behaviours across the different limits and proposed guidelines for the design of low-Mach schemes. However, these studies have primarily focused on specific discretisations and/or only the convective limit. In this paper, we review the low-Mach behaviour using the modified equations - the continuous Euler equations augmented with artificial diffusion terms - which are representative of a wide range of schemes in this class. By considering both convective and acoustic effects, we show that three diffusion scalings naturally arise. Single- and multiple-scale asymptotic analysis of these scalings shows that many of the important low-Mach features of this class of schemes can be reproduced in a straightforward manner in the continuous setting. As an example, we show that many existing low-Mach Roe-type finite-volume schemes match one of these three scalings. Our analysis corroborates previous analysis of these schemes, and we are able to refine previous guidelines on the design of low-Mach schemes by including both convective and acoustic effects. Discrete analysis and numerical examples demonstrate the behaviour of minimal Roe-type schemes with each of the three scalings for convective, acoustic, and mixed flows.


[323] 2303.10740

Artificial diffusion for convective and acoustic low Mach number flows II: Application to Liou-Steffen, Zha-Bilgen and Toro-Vasquez convection-pressure flux splittings

Liou-Steffen splitting (AUSM) schemes are popular for low Mach number simulations, however, like many numerical schemes for compressible flow they require careful modification to accurately resolve convective features in this regime. Previous analyses of these schemes usually focus only on a single discrete scheme at the convective limit, only considering flow with acoustic effects empirically, if at all. In our recent paper Hope-Collins & di Mare, 2023 we derived constraints on the artificial diffusion scaling of low Mach number schemes for flows both with and without acoustic effects, and applied this analysis to Roe-type finite-volume schemes. In this paper we form approximate diffusion matrices for the Liou-Steffen splitting, as well as the closely related Zha-Bilgen and Toro-Vasquez splittings. We use the constraints found in Hope-Collins & di Mare, 2023 to derive and analyse the required scaling of each splitting at low Mach number. By transforming the diffusion matrices to the entropy variables we can identify erroneous diffusion terms compared to the ideal form used in Hope-Collins & di Mare, 2023. These terms vanish asymptotically for the Liou-Steffen splitting, but result in spurious entropy generation for the Zha-Bilgen and Toro-Vasquez splittings unless a particular form of the interface pressure is used. Numerical examples for acoustic and convective flow verify the results of the analysis, and show the importance of considering the resolution of the entropy field when assessing schemes of this type.


[324] 2303.12801

A Data Augmentation Method and the Embedding Mechanism for Detection and Classification of Pulmonary Nodules on Small Samples

Detection of pulmonary nodules by CT is used for screening lung cancer in early stages.omputer aided diagnosis (CAD) based on deep-learning method can identify the suspected areas of pulmonary nodules in CT images, thus improving the accuracy and efficiency of CT diagnosis. The accuracy and robustness of deep learning models. Method:In this paper, we explore (1) the data augmentation method based on the generation model and (2) the model structure improvement method based on the embedding mechanism. Two strategies have been introduced in this study: a new data augmentation method and a embedding mechanism. In the augmentation method, a 3D pixel-level statistics algorithm is proposed to generate pulmonary nodule and by combing the faked pulmonary nodule and healthy lung, we generate new pulmonary nodule samples. The embedding mechanism are designed to better understand the meaning of pixels of the pulmonary nodule samples by introducing hidden variables. Result: The result of the 3DVNET model with the augmentation method for pulmonary nodule detection shows that the proposed data augmentation method outperforms the method based on generative adversarial network (GAN) framework, training accuracy improved by 1.5%, and with embedding mechanism for pulmonary nodules classification shows that the embedding mechanism improves the accuracy and robustness for the classification of pulmonary nodules obviously, the model training accuracy is close to 1 and the model testing F1-score is 0.90.Conclusion:he proposed data augmentation method and embedding mechanism are beneficial to improve the accuracy and robustness of the model, and can be further applied in other common diagnostic imaging tasks.


[325] 2303.12806

Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma

Although artificial intelligence (AI) systems have been shown to improve the accuracy of initial melanoma diagnosis, the lack of transparency in how these systems identify melanoma poses severe obstacles to user acceptance. Explainable artificial intelligence (XAI) methods can help to increase transparency, but most XAI methods are unable to produce precisely located domain-specific explanations, making the explanations difficult to interpret. Moreover, the impact of XAI methods on dermatologists has not yet been evaluated. Extending on two existing classifiers, we developed an XAI system that produces text and region based explanations that are easily interpretable by dermatologists alongside its differential diagnoses of melanomas and nevi. To evaluate this system, we conducted a three-part reader study to assess its impact on clinicians' diagnostic accuracy, confidence, and trust in the XAI-support. We showed that our XAI's explanations were highly aligned with clinicians' explanations and that both the clinicians' trust in the support system and their confidence in their diagnoses were significantly increased when using our XAI compared to using a conventional AI system. The clinicians' diagnostic accuracy was numerically, albeit not significantly, increased. This work demonstrates that clinicians are willing to adopt such an XAI system, motivating their future use in the clinic.


[326] 2303.12814

Fixed points of arbitrarily deep 1-dimensional neural networks

In this paper, we introduce a new class of functions on $\mathbb{R}$ that is closed under composition, and contains the logistic sigmoid function. We use this class to show that any 1-dimensional neural network of arbitrary depth with logistic sigmoid activation functions has at most three fixed points. While such neural networks are far from real world applications, we are able to completely understand their fixed points, providing a foundation to the much needed connection between application and theory of deep neural networks.


[327] 2303.12834

The power and limitations of learning quantum dynamics incoherently

Quantum process learning is emerging as an important tool to study quantum systems. While studied extensively in coherent frameworks, where the target and model system can share quantum information, less attention has been paid to whether the dynamics of quantum systems can be learned without the system and target directly interacting. Such incoherent frameworks are practically appealing since they open up methods of transpiling quantum processes between the different physical platforms without the need for technically challenging hybrid entanglement schemes. Here we provide bounds on the sample complexity of learning unitary processes incoherently by analyzing the number of measurements that are required to emulate well-established coherent learning strategies. We prove that if arbitrary measurements are allowed, then any efficiently representable unitary can be efficiently learned within the incoherent framework; however, when restricted to shallow-depth measurements only low-entangling unitaries can be learned. We demonstrate our incoherent learning algorithm for low entangling unitaries by successfully learning a 16-qubit unitary on \texttt{ibmq\_kolkata}, and further demonstrate the scalabilty of our proposed algorithm through extensive numerical experiments.


[328] 2303.12861

Cube-Based 3D Denoising Diffusion Probabilistic Model for Cone Beam Computed Tomography Reconstruction with Incomplete Data

Deep learning (DL) has been extensively researched in the field of computed tomography (CT) reconstruction with incomplete data, particularly in sparse-view CT reconstruction. However, applying DL to sparse-view cone beam CT (CBCT) remains challenging. Many models learn the mapping from sparse-view CT images to ground truth but struggle to achieve satisfactory performance in terms of global artifact removal. Incorporating sinogram data and utilizing dual-domain information can enhance anti-artifact performance, but this requires storing the entire sinogram in memory. This presents a memory issue for high-resolution CBCT sinograms, limiting further research and application. In this paper, we propose a cube-based 3D denoising diffusion probabilistic model (DDPM) for CBCT reconstruction using down-sampled data. A DDPM network, trained on cubes extracted from paired fully sampled sinograms and down-sampled sinograms, is employed to inpaint down-sampled sinograms. Our method divides the entire sinogram into overlapping cubes and processes these cubes in parallel using multiple GPUs, overcoming memory limitations. Experimental results demonstrate that our approach effectively suppresses few-view artifacts while preserving textural details faithfully.


[329] 2303.12863

Peak Estimation of Time Delay Systems using Occupation Measures

This work proposes a method to compute the maximum value obtained by a state function along trajectories of a Delay Differential Equation (DDE). An example of this task is finding the maximum number of infected people in an epidemic model with a nonzero incubation period. The variables of this peak estimation problem include the stopping time and the original history (restricted to a class of admissible histories). The original nonconvex DDE peak estimation problem is approximated by an infinite-dimensional Linear Program (LP) in occupation measures, inspired by existing measure-based methods in peak estimation and optimal control. This LP is approximated from above by a sequence of Semidefinite Programs (SDPs) through the moment-Sum of Squares (SOS) hierarchy. Effectiveness of this scheme in providing peak estimates for DDEs is demonstrated with provided examples


[330] 2303.12873

From Compact Plasma Particle Sources to Advanced Accelerators with Modeling at Exascale

Developing complex, reliable advanced accelerators requires a coordinated, extensible, and comprehensive approach in modeling, from source to the end of beam lifetime. We present highlights in Exascale Computing to scale accelerator modeling software to the requirements set for contemporary science drivers. In particular, we present the first laser-plasma modeling on an exaflop supercomputer using the US DOE Exascale Computing Project WarpX. Leveraging developments for Exascale, the new DOE SCIDAC-5 Consortium for Advanced Modeling of Particle Accelerators (CAMPA) will advance numerical algorithms and accelerate community modeling codes in a cohesive manner: from beam source, over energy boost, transport, injection, storage, to application or interaction. Such start-to-end modeling will enable the exploration of hybrid accelerators, with conventional and advanced elements, as the next step for advanced accelerator modeling. Following open community standards, we seed an open ecosystem of codes that can be readily combined with each other and machine learning frameworks. These will cover ultrafast to ultraprecise modeling for future hybrid accelerator design, even enabling virtual test stands and twins of accelerators that can be used in operations.


[331] 2303.12895

Caching Through the Skies: The Case of LEO Satellites Connected Edges for 6G and Beyond

The deployment of low earth orbit (LEO) satellites with terrestrial networks can potentially increase the efficiency and reduce the cost of relaying content from a data center to a set of edge caches hosted by 6G and beyond enabled macro base stations. In this work, the characteristics of the communication system and the mobility of LEO satellites are thoroughly discussed to describe the channel characteristics of LEO satellites, in terms of their frequency bands, latency, Doppler shifts, fading effects, and satellite access. Three different scenarios are proposed for the relay of data from data centers to edge caches via LEO satellites, which are the "Immediate Forward", "Relay and Forward", and "Store and Forward" scenarios. A comparative problem formulation is utilized to obtain numerical results from simulations to demonstrate the effectiveness and validity as well as the trade-offs of the proposed system model. The simulation results indicate that the integration of LEO satellites in edge caching for 6G and beyond networks decreased the required transmission power for relaying the data from the data center to the edge caches. Future research directions based on the proposed model are discussed.


[332] 2303.12908

Self-supervised Learning with Speech Modulation Dropout

We show that training a multi-headed self-attention-based deep network to predict deleted, information-dense 2-8 Hz speech modulations over a 1.5-second section of a speech utterance is an effective way to make machines learn to extract speech modulations using time-domain contextual information. Our work exhibits that, once trained on large volumes of unlabelled data, the outputs of the self-attention layers vary in time with a modulation peak at 4 Hz. These pre-trained layers can be used to initialize parts of an Automatic Speech Recognition system to reduce its reliance on labeled speech data greatly.


[333] 2303.12966

Rate-Tunable Control Barrier Functions: Methods and Algorithms for Online Adaptation

Control Barrier Functions offer safety certificates by dictating controllers that enforce safety constraints. However, their response depends on the classK function that is used to restrict the rate of change of the barrier function along the system trajectories. This paper introduces the notion of Rate Tunable Control Barrier Function (RT-CBF), which allows for online tuning of the response of CBF-based controllers. In contrast to the existing CBF approaches that use a fixed (predefined) classK function to ensure safety, we parameterize and adapt the classK function parameters online. Furthermore, we discuss the challenges associated with multiple barrier constraints, namely ensuring that they admit a common control input that satisfies them simultaneously for all time. In practice, RT-CBF enables designing parameter dynamics for (1) a better-performing response, where performance is defined in terms of the cost accumulated over a time horizon, or (2) a less conservative response. We propose a model-predictive framework that computes the sensitivity of the future states with respect to the parameters and uses Sequential Quadratic Programming for deriving an online law to update the parameters in the direction of improving the performance. When prediction is not possible, we also provide point-wise sufficient conditions to be imposed on any user-given parameter dynamics so that multiple CBF constraints continue to admit common control input with time. Finally, we introduce RT-CBFs for decentralized uncooperative multi-agent systems, where a trust factor, computed based on the instantaneous ease of constraint satisfaction, is used to update parameters online for a less conservative response.


[334] 2303.12971

Differentiable hybrid neural modeling for fluid-structure interaction

Solving complex fluid-structure interaction (FSI) problems, which are described by nonlinear partial differential equations, is crucial in various scientific and engineering applications. Traditional computational fluid dynamics based solvers are inadequate to handle the increasing demand for large-scale and long-period simulations. The ever-increasing availability of data and rapid advancement in deep learning (DL) have opened new avenues to tackle these challenges through data-enabled modeling. The seamless integration of DL and classic numerical techniques through the differentiable programming framework can significantly improve data-driven modeling performance. In this study, we propose a differentiable hybrid neural modeling framework for efficient simulation of FSI problems, where the numerically discretized FSI physics based on the immersed boundary method is seamlessly integrated with sequential neural networks using differentiable programming. All modules are programmed in JAX, where automatic differentiation enables gradient back-propagation over the entire model rollout trajectory, allowing the hybrid neural FSI model to be trained as a whole in an end-to-end, sequence-to-sequence manner. Through several FSI benchmark cases, we demonstrate the merit and capability of the proposed method in modeling FSI dynamics for both rigid and flexible bodies. The proposed model has also demonstrated its superiority over baseline purely data-driven neural models, weakly-coupled hybrid neural models, and purely numerical FSI solvers in terms of accuracy, robustness, and generalizability.


[335] 2303.12980

GMP-Featurizer: A parallelized Python package for efficiently computing the Gaussian Multipole features of atomic systems

GMP-Featurizer is a lightweight, accurate, efficient, and scalable software package for calculating the Gaussian Multipole (GMP) features \cite{GMP} for a variety of atomic systems with elements across the periodic table. Starting from the GMP feature computation module from AmpTorch \cite{amptorch}, the capability of GMP-Featurizer has since been greatly improved, including its accuracy and efficiency, as well as the ability to parallelize on different cores, even machines. Moreover, this python package only has very few dependencies that are all standard python libraries, plus cffi for C++ code interfacing and Ray \cite{Ray} for parallelization, making it lightweight and robust. A set of unit tests are designed to ensure the reliability of its outputs. A set of extensive examples and tutorials, as well as two sets of pseudopotential files (needed for specifying the GMP feature set), are also included in this package for its users. Overall, this package is designed to serve as a standard implementation for chemical and material scientists who are interested in developing models based on GMP features. The source code for this package is freely available to the public under the Apache 2.0 license.


[336] 2303.13027

Weighted Pressure and Mode Matching for Sound Field Reproduction: Theoretical and Experimental Comparisons

Two sound field reproduction methods, weighted pressure matching and weighted mode matching, are theoretically and experimentally compared. The weighted pressure and mode matching are a generalization of conventional pressure and mode matching, respectively. Both methods are derived by introducing a weighting matrix in the pressure and mode matching. The weighting matrix in the weighted pressure matching is defined on the basis of the kernel interpolation of the sound field from pressure at a discrete set of control points. In the weighted mode matching, the weighting matrix is defined by a regional integration of spherical wavefunctions. It is theoretically shown that the weighted pressure matching is a special case of the weighted mode matching by infinite-dimensional harmonic analysis for estimating expansion coefficients from pressure observations. The difference between the two methods are discussed through experiments.


[337] 2303.13033

Federated Uncertainty-Aware Aggregation for Fundus Diabetic Retinopathy Staging

Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), which considers the reliability of each client and produces a confidence estimation for the DR staging. In our FedUAA, an aggregated encoder is shared by all clients for learning a global representation of fundus images, while a novel temperature-warmed uncertainty head (TWEU) is utilized for each client for local personalized staging criteria. Our TWEU employs an evidential deep layer to produce the uncertainty score with the DR staging results for client reliability evaluation. Furthermore, we developed a novel uncertainty-aware weighting module (UAW) to dynamically adjust the weights of model aggregation based on the uncertainty score distribution of each client. In our experiments, we collect five publicly available datasets from different institutions to conduct a dataset for federated DR staging to satisfy the real non-iid condition. The experimental results demonstrate that our FedUAA achieves better DR staging performance with higher reliability compared to other federated learning methods. Our proposed FedUAA paradigm effectively addresses the challenges of collaboratively training DR staging models across multiple institutions, and provides a robust and reliable solution for the deployment of DR diagnosis models in real-world clinical scenarios.


[338] 2303.13037

Universal Linear Intensity Transformations Using Spatially-Incoherent Diffractive Processors

Under spatially-coherent light, a diffractive optical network composed of structured surfaces can be designed to perform any arbitrary complex-valued linear transformation between its input and output fields-of-view (FOVs) if the total number (N) of optimizable phase-only diffractive features is greater than or equal to ~2 Ni x No, where Ni and No refer to the number of useful pixels at the input and the output FOVs, respectively. Here we report the design of a spatially-incoherent diffractive optical processor that can approximate any arbitrary linear transformation in time-averaged intensity between its input and output FOVs. Under spatially-incoherent monochromatic light, the spatially-varying intensity point spread functon(H) of a diffractive network, corresponding to a given, arbitrarily-selected linear intensity transformation, can be written as H(m,n;m',n')=|h(m,n;m',n')|^2, where h is the spatially-coherent point-spread function of the same diffractive network, and (m,n) and (m',n') define the coordinates of the output and input FOVs, respectively. Using deep learning, supervised through examples of input-output profiles, we numerically demonstrate that a spatially-incoherent diffractive network can be trained to all-optically perform any arbitrary linear intensity transformation between its input and output if N is greater than or equal to ~2 Ni x No. These results constitute the first demonstration of universal linear intensity transformations performed on an input FOV under spatially-incoherent illumination and will be useful for designing all-optical visual processors that can work with incoherent, natural light.


[339] 2303.13056

Predicting the Initial Conditions of the Universe using Deep Learning

Finding the initial conditions that led to the current state of the universe is challenging because it involves searching over a vast input space of initial conditions, along with modeling their evolution via tools such as N-body simulations which are computationally expensive. Deep learning has emerged as an alternate modeling tool that can learn the mapping between the linear input of an N-body simulation and the final nonlinear displacements at redshift zero, which can significantly accelerate the forward modeling. However, this does not help reduce the search space for initial conditions. In this paper, we demonstrate for the first time that a deep learning model can be trained for the reverse mapping. We train a V-Net based convolutional neural network, which outputs the linear displacement of an N-body system, given the current time nonlinear displacement and the cosmological parameters of the system. We demonstrate that this neural network accurately recovers the initial linear displacement field over a wide range of scales ($<1$-$2\%$ error up to nearly $k = 1\ \mathrm{Mpc}^{-1}\,h$), despite the ill-defined nature of the inverse problem at smaller scales. Specifically, smaller scales are dominated by nonlinear effects which makes the backward dynamics much more susceptible to numerical and computational errors leading to highly divergent backward trajectories and a one-to-many backward mapping. The results of our method motivate that neural network based models can act as good approximators of the initial linear states and their predictions can serve as good starting points for sampling-based methods to infer the initial states of the universe.


[340] 2303.13110

OCELOT: Overlapped Cell on Tissue Dataset for Histopathology

Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by pathologists in the cell detection models, mainly due to the lack of datasets containing both cell and tissue annotations with overlapping regions. To overcome this limitation, we propose and publicly release OCELOT, a dataset purposely dedicated to the study of cell-tissue relationships for cell detection in histopathology. OCELOT provides overlapping cell and tissue annotations on images acquired from multiple organs. Within this setting, we also propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. When compared against a model trained only for the cell detection task, our proposed approaches improve cell detection performance on 3 datasets: proposed OCELOT, public TIGER, and internal CARP datasets. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score. We believe the contributions of this paper, including the release of the OCELOT dataset at https://lunit-io.github.io/research/publications/ocelot are a crucial starting point toward the important research direction of incorporating cell-tissue relationships in computation pathology.


[341] 2303.13111

A Permutable Hybrid Network for Volumetric Medical Image Segmentation

The advent of Vision Transformer (ViT) has brought substantial advancements in 3D volumetric benchmarks, particularly in 3D medical image segmentation. Concurrently, Multi-Layer Perceptron (MLP) networks have regained popularity among researchers due to their comparable results to ViT, albeit with the exclusion of the heavy self-attention module. This paper introduces a permutable hybrid network for volumetric medical image segmentation, named PHNet, which exploits the advantages of convolution neural network (CNN) and MLP. PHNet addresses the intrinsic isotropy problem of 3D volumetric data by utilizing both 2D and 3D CNN to extract local information. Besides, we propose an efficient Multi-Layer Permute Perceptron module, named MLPP, which enhances the original MLP by obtaining long-range dependence while retaining positional information. Extensive experimental results validate that PHNet outperforms the state-of-the-art methods on two public datasets, namely, COVID-19-20 and Synapse. Moreover, the ablation study demonstrates the effectiveness of PHNet in harnessing the strengths of both CNN and MLP. The code will be accessible to the public upon acceptance.


[342] 2303.13117

RLOR: A Flexible Framework of Deep Reinforcement Learning for Operation Research

Reinforcement learning has been applied in operation research and has shown promise in solving large combinatorial optimization problems. However, existing works focus on developing neural network architectures for certain problems. These works lack the flexibility to incorporate recent advances in reinforcement learning, as well as the flexibility of customizing model architectures for operation research problems. In this work, we analyze the end-to-end autoregressive models for vehicle routing problems and show that these models can benefit from the recent advances in reinforcement learning with a careful re-implementation of the model architecture. In particular, we re-implemented the Attention Model and trained it with Proximal Policy Optimization (PPO) in CleanRL, showing at least 8 times speed up in training time. We hereby introduce RLOR, a flexible framework for Deep Reinforcement Learning for Operation Research. We believe that a flexible framework is key to developing deep reinforcement learning models for operation research problems. The code of our work is publicly available at https://github.com/cpwan/RLOR.


[343] 2303.13203

A Confident Labelling Strategy Based on Deep Learning for Improving Early Detection of Knee OsteoArthritis

Knee OsteoArthritis (KOA) is a prevalent musculoskeletal disorder that causes decreased mobility in seniors. The diagnosis provided by physicians is subjective, however, as it relies on personal experience and the semi-quantitative Kellgren-Lawrence (KL) scoring system. KOA has been successfully diagnosed by Computer-Aided Diagnostic (CAD) systems that use deep learning techniques like Convolutional Neural Networks (CNN). In this paper, we propose a novel Siamese-based network, and we introduce a new hybrid loss strategy for the early detection of KOA. The model extends the classical Siamese network by integrating a collection of Global Average Pooling (GAP) layers for feature extraction at each level. Then, to improve the classification performance, a novel training strategy that partitions each training batch into low-, medium- and high-confidence subsets, and a specific hybrid loss function are used for each new label attributed to each sample. The final loss function is then derived by combining the latter loss functions with optimized weights. Our test results demonstrate that our proposed approach significantly improves the detection performance.


[344] 2303.13243

Pyramid Multi-branch Fusion DCNN with Multi-Head Self-Attention for Mandarin Speech Recognition

As one of the major branches of automatic speech recognition, attention-based models greatly improves the feature representation ability of the model. In particular, the multi-head mechanism is employed in the attention, hoping to learn speech features of more aspects in different attention subspaces. For speech recognition of complex languages, on the one hand, a small head size will lead to an obvious shortage of learnable aspects. On the other hand, we need to reduce the dimension of each subspace to keep the size of the overall feature space unchanged when we increase the number of heads, which will significantly weaken the ability to represent the feature of each subspace. Therefore, this paper explores how to use a small attention subspace to represent complete speech features while ensuring many heads. In this work we propose a novel neural network architecture, namely, pyramid multi-branch fusion DCNN with multi-head self-attention. The proposed architecture is inspired by Dilated Convolution Neural Networks (DCNN), it uses multiple branches with DCNN to extract the feature of the input speech under different receptive fields. To reduce the number of parameters, every two branches are merged until all the branches are merged into one. Thus, its shape is like a pyramid rotated 90 degrees. We demonstrate that on Aishell-1, a widely used Mandarin speech dataset, our model achieves a character error rate (CER) of 6.45% on the test sets.


[345] 2303.13248

Numerical Bifurcation Analysis of Turing and Symmetry Broken Patterns of a Vegetation PDE Model

We study the mechanisms of pattern formation for vegetation dynamics in water-limited regions. Our analysis is based on a set of two partial differential equations (PDEs) of reaction-diffusion type for the biomass and water and one ordinary differential equation (ODE) describing the dependence of the toxicity on the biomass. We perform a linear stability analysis in the one-dimensional finite space, we derive analytically the conditions for the appearance of Turing instability that gives rise to spatio-temporal patterns emanating from the homogeneous solution, and provide its dependence with respect to the size of the domain. Furthermore, we perform a numerical bifurcation analysis in order to study the pattern formation of the inhomogeneous solution, with respect to the precipitation rate, thus analyzing the stability and symmetry properties of the emanating patterns. Based on the numerical bifurcation analysis, we have found new patterns, which form due to the onset of secondary bifurcations from the primary Turing instability, thus giving rise to a multistability of asymmetric solutions.


[346] 2303.13278

Improved Anisotropic Gaussian Filters

Elongated anisotropic Gaussian filters are used for the orientation estimation of fibers. In cases where computed tomography images are noisy, roughly resolved, and of low contrast, they are the method of choice even if being efficient only in virtual 2D slices. However, minor inaccuracies in the anisotropic Gaussian filters can carry over to the orientation estimation. Therefore, we propose a modified algorithm for 2D anisotropic Gaussian filters and show that this improves their precision. Applied to synthetic images of fiber bundles, it is more accurate and robust to noise. Finally, we demonstrate the effectiveness of our approach by applying it to real-world images of sheet molding compounds.


[347] 2303.13322

Tight Data-Driven Linear Relaxations for Constraint Screening in Robust Unit Commitment

The daily operation of real-world power systems and their underlying markets relies on the timely solution of the unit commitment problem. However, given its computational complexity, several optimization-based methods have been proposed to lighten its problem formulation by removing redundant line flow constraints. These approaches often ignore the spatial couplings of renewable generation and demand, which have an inherent impact of market outcomes. Moreover, the elimination procedures primarily focus on the feasible region and exclude how the problem's objective function plays a role here. To address these pitfalls, we move to rule out redundant and inactive constraints over a tight linear programming relaxation of the original unit commitment feasibility region by adding valid inequality constraints. We extend the optimization-based approach called umbrella constraint discovery through the enforcement of a consistency logic on the set of constraints by adding the proposed inequality constraints to the formulation. Hence, we reduce the conservativeness of the screening approach using the available historical data and thus lead to a tighter unit commitment formulation. Numerical tests are performed on standard IEEE test networks to substantiate the effectiveness of the proposed approach.


[348] 2303.13323

Deep Generative Multi-Agent Imitation Model as a Computational Benchmark for Evaluating Human Performance in Complex Interactive Tasks: A Case Study in Football

Evaluating the performance of human is a common need across many applications, such as in engineering and sports. When evaluating human performance in completing complex and interactive tasks, the most common way is to use a metric having been proved efficient for that context, or to use subjective measurement techniques. However, this can be an error prone and unreliable process since static metrics cannot capture all the complex contexts associated with such tasks and biases exist in subjective measurement. The objective of our research is to create data-driven AI agents as computational benchmarks to evaluate human performance in solving difficult tasks involving multiple humans and contextual factors. We demonstrate this within the context of football performance analysis. We train a generative model based on Conditional Variational Recurrent Neural Network (VRNN) Model on a large player and ball tracking dataset. The trained model is used to imitate the interactions between two teams and predict the performance from each team. Then the trained Conditional VRNN Model is used as a benchmark to evaluate team performance. The experimental results on Premier League football dataset demonstrates the usefulness of our method to existing state-of-the-art static metric used in football analytics.


[349] 2303.13332

Clinically Relevant Latent Space Embedding of Cancer Histopathology Slides through Variational Autoencoder Based Image Compression

In this paper, we introduce a Variational Autoencoder (VAE) based training approach that can compress and decompress cancer pathology slides at a compression ratio of 1:512, which is better than the previously reported state of the art (SOTA) in the literature, while still maintaining accuracy in clinical validation tasks. The compression approach was tested on more common computer vision datasets such as CIFAR10, and we explore which image characteristics enable this compression ratio on cancer imaging data but not generic images. We generate and visualize embeddings from the compressed latent space and demonstrate how they are useful for clinical interpretation of data, and how in the future such latent embeddings can be used to accelerate search of clinical imaging data.


[350] 2303.13337

Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

The 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin \textit{et al}. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon's algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the J\"ulich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88\% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.


[351] 2303.13393

Synthetic aperture radar imaging below a random rough surface

Motivated by applications in unmanned aerial based ground penetrating radar for detecting buried landmines, we consider the problem of imaging small point like scatterers situated in a lossy medium below a random rough surface. Both the random rough surface and the absorption in the lossy medium significantly impede the target detection and imaging process. Using principal component analysis we effectively remove the reflection from the air-soil interface. We then use a modification of the classical synthetic aperture radar imaging functional to image the targets. This imaging method introduces a user-defined parameter, $\delta$, which scales the resolution by $\sqrt{\delta}$ allowing for target localization with sub wavelength accuracy. Numerical results in two dimensions illustrate the robustness of the approach for imaging multiple targets. However, the depth at which targets are detectable is limited due to the absorption in the lossy medium.


[352] 2303.13404

MSFA-Frequency-Aware Transformer for Hyperspectral Images Demosaicing

Hyperspectral imaging systems that use multispectral filter arrays (MSFA) capture only one spectral component in each pixel. Hyperspectral demosaicing is used to recover the non-measured components. While deep learning methods have shown promise in this area, they still suffer from several challenges, including limited modeling of non-local dependencies, lack of consideration of the periodic MSFA pattern that could be linked to periodic artifacts, and difficulty in recovering high-frequency details. To address these challenges, this paper proposes a novel de-mosaicing framework, the MSFA-frequency-aware Transformer network (FDM-Net). FDM-Net integrates a novel MSFA-frequency-aware multi-head self-attention mechanism (MaFormer) and a filter-based Fourier zero-padding method to reconstruct high pass components with greater difficulty and low pass components with relative ease, separately. The advantage of Maformer is that it can leverage the MSFA information and non-local dependencies present in the data. Additionally, we introduce a joint spatial and frequency loss to transfer MSFA information and enhance training on frequency components that are hard to recover. Our experimental results demonstrate that FDM-Net outperforms state-of-the-art methods with 6dB PSNR, and reconstructs high-fidelity details successfully.


[353] 2303.13407

Adaptive Endpointing with Deep Contextual Multi-armed Bandits

Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal endpointing configuration given utterance-level audio features in an online setting, while avoiding hyperparameter grid-search. Our method does not require ground truth labels, and only uses online learning from reward signals without requiring annotated labels. Specifically, we propose a deep contextual multi-armed bandit-based approach, which combines the representational power of neural networks with the action exploration behavior of Thompson modeling algorithms. We compare our approach to several baselines, and show that our deep bandit models also succeed in reducing early cutoff errors while maintaining low latency.


[354] 2303.13410

Volume and Mass Conservation in Lagrangian Meshfree Methods

Meshfree Lagrangian frameworks for free surface flow simulations do not conserve fluid volume. Meshfree particle methods like SPH are not mimetic, in the sense that discrete mass conservation does not imply discrete volume conservation. On the other hand, meshfree collocation methods typically do not use any notion of mass. As a result, they are neither mass conservative nor volume conservative at the discrete level. In this paper, we give an overview of various sources of conservation errors across different meshfree methods. The present work focuses on one specific issue: unreliable volume and mass definitions. We introduce the concept of representative masses and densities, which are essential for accurate post-processing especially in meshfree collocation methods. Using these, we introduce an artificial compression or expansion in the fluid to rectify errors in volume conservation. Numerical experiments show that the introduced frameworks significantly improve volume conservation behaviour, even for complex industrial test cases such as automotive water crossing.


[355] 2303.13443

Cliques, Chromatic Number, and Independent Sets in the Semi-random Process

The semi-random graph process is a single player game in which the player is initially presented an empty graph on $n$ vertices. In each round, a vertex $u$ is presented to the player independently and uniformly at random. The player then adaptively selects a vertex $v$, and adds the edge $uv$ to the graph. For a fixed monotone graph property, the objective of the player is to force the graph to satisfy this property with high probability in as few rounds as possible. In this paper, we investigate the following three properties: containing a complete graph of order $k$, having the chromatic number at least $k$, and not having an independent set of size at least $k$.


[356] 2303.13453

Better Together: Dialogue Separation and Voice Activity Detection for Audio Personalization in TV

In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This is undesired, especially during passages without dialogue. We propose to combine DS and Voice Activity Detection (VAD), both recently proposed for TV audio. When their combination suggests dialogue inactivity, background components leaking in the dialogue estimate are reassigned to the background estimate. A clear improvement of the audio quality is shown for dialogue-free signals, without performance drops when dialogue is active. A post-processed VAD estimate with improved detection accuracy is also generated. It is concluded that DS and VAD can improve each other and are better used together.


[357] 2303.13460

Complexity reduction of large-scale stochastic systems using linear quadratic Gaussian balancing

In this paper, we consider a model reduction technique for stabilizable and detectable stochastic systems. It is based on a pair of Gramians that we analyze in terms of well-posedness. Subsequently, dominant subspaces of the stochastic systems are identified exploiting these Gramians. An associated balancing related scheme is proposed that removes unimportant information from the stochastic dynamics in order to obtain a reduced system. We show that this reduced model preserves important features like stabilizability and detectability. Additionally, a comprehensive error analysis based on eigenvalues of the Gramian pair product is conducted. This provides an a-priori criterion for the reduction quality which we illustrate in numerical experiments.


[358] 2303.13462

Generalization with quantum geometry for learning unitaries

Generalization is the ability of quantum machine learning models to make accurate predictions on new data by learning from training data. Here, we introduce the data quantum Fisher information metric (DQFIM) to determine when a model can generalize. For variational learning of unitaries, the DQFIM quantifies the amount of circuit parameters and training data needed to successfully train and generalize. We apply the DQFIM to explain when a constant number of training states and polynomial number of parameters are sufficient for generalization. Further, we can improve generalization by removing symmetries from training data. Finally, we show that out-of-distribution generalization, where training and testing data are drawn from different data distributions, can be better than using the same distribution. Our work opens up new approaches to improve generalization in quantum machine learning.


[359] 2303.13486

The strength of a simplex is the key to a continuous isometry classification of Euclidean clouds of unlabelled points

This paper solves the continuous classification problem for finite clouds of unlabelled points under Euclidean isometry. The Lipschitz continuity of required invariants in a suitable metric under perturbations of points is motivated by the inevitable noise in measurements of real objects. The best solved case of this isometry classification is known as the SSS theorem in school geometry saying that any triangle up to congruence (isometry in the plane) has a continuous complete invariant of three side lengths. However, there is no easy extension of the SSS theorem even to four points in the plane partially due to a 4-parameter family of 4-point clouds that have the same six pairwise distances. The computational time of most past metrics that are invariant under isometry was exponential in the size of the input. The final obstacle was the discontinuity of previous invariants at singular configurations, for example, when a triangle degenerates to a straight line. All the challenges above are now resolved by the Simplexwise Centred Distributions that combine inter-point distances of a given cloud with the new strength of a simplex that finally guarantees the Lipschitz continuity. The computational times of new invariants and metrics are polynomial in the number of points for a fixed Euclidean dimension.