Advancing Multi-Modal Sensing through Expandable Modality Alignment


Abstract

Sensing technology is widely used for comprehending the physical world, with numerous modalities explored in past decades. While there has been considerable work on multi-modality learning, they all require data of all modalities be paired. How to leverage multi-modality data with partially pairings remains an open problem.

To tackle this challenge, we introduce the Babel framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies. Babel serves as a scalable pre-trained multi-modal sensing neural network, currently aligning six sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. To overcome the scarcity of complete paired data, the key idea of Babel involves transforming the N-modality alignment into a series of two-modality alignments by devising the expandable network architecture. This concept is also realized via a series of novel techniques, including the pre-trained modality tower that capitalizes on available single-modal networks, and the adaptive training strategy balancing the contribution of the newly incorporated modality with the previously established modality alignment.

Evaluation demonstrates Babel’s outstanding performance on eight human activity recognition datasets, compared to various baselines e.g., the top multi-modal sensing framework, single-modal sensing networks, and multi-modal large language models. Babel not only effectively fuses multiple available modalities (up to 22% accuracy increase), but also enhance the performance of individual modality (12% averaged accuracy improvement). Case studies also highlight exciting application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.

1

1 Introduction↩︎

Motivation: Sensing offers unique abilities to perceive the physical world. It has been widely deployed in various applications across diverse fields, including health-care, mixed reality, smart driving, and many others. Over the past decades, numerous sensing modalities have been explored [1][9]. Each sensing modality provides a unique and complementary viewpoint for observing the world, thereby necessitating the simultaneous use of multiple sensing modalities, known as multi-modal sensing.

Early methods for organizing multiple sensing modalities relied on handcrafted heuristics or features [10], which is proved difficult to scale across various tasks due to the complexity of sensing signals and environments. Recent advancements in Deep Learning (DL) based methods have offered promising solutions [11][13]. Given paired modality data as inputs, DL methods could identify the complementarity among various sensing modalities, a process known as modality alignment.

Figure 1: Align multiple sensing modalities into one unified representation to enhance sensing and empower new applications.

Modality alignment projects the representations of each sensing modality into a unified and shared space, as depicted in Fig. 1. The alignment process could empower sensing in several ways: firstly, the alignment process restructures the features of one modality with the help from another, thereby allowing modalities to complement each other. Secondly, the unified feature space enables the ease of fusing multiple modalities. More intriguingly, the aligned modalities could potentially give rise to new applications. One could utilize the unified feature space as the proxy, using the signals from one sensing modality to retrieve the representations of another modality (e.g., leveraging Wi-Fi channel state information (CSI) to obtain visual representation, thus enabling Wi-Fi imaging). The aligned features across modalities could also naturally serve as a protocol to bridge sensing abilities with Large Language Models (LLMs).

To harness the power of modality alignment, our goal is to develop a pre-trained foundation sensing model that is capable of aligning a wide range of prevalent sensing modalities. This would enable the employment of either a singular modality or a fusion of aligned modalities for downstream sensing tasks.

Challenges: Although modality alignment in AI is a growing research area, its application in sensing presents significant hurdles. The fundamental challenge in supporting multi-modality in wireless sensing is the data scarcity, specifically, (i) the scarcity of paired data, which is essential for aligning two modalities, and (ii) the scarcity of multi-paired modalities. For instance, the widely-used CLIP [14] required 400 million image-text pairs for pre-training. In sensing, there lacks paired data from all modalities since some modality data require specialized hardware and expertise to collect. Existing datasets only contain data from a subset of modalities (e.g., up to 3) [15][18]. How to take advantage of partially paired data becomes increasingly important.

Despite considerable work, existing research [19][24] struggle to fully incorporate multiple sensing modalities. For instance, due to the shortage of paired data, OneLLM [19] supports a limited number of sensing modalities, i.e., IMU only, with subpar performance (see Table [tab:performance95mllms]). Cosmos [21] pioneered the alignment of multiple modalities, but due to the scarcity of multi-paired modalities, it aligns a limited number of modalities e.g., IMU and depth.

Our Work: We present Babel, establishing the first scalable pre-trained network aligning six sensing modalities as shown in Fig. 1. The design of Babel is underpinned by two key insights. (i) Despite the scarcity of paired data, there exist well-developed encoders or feature extractors for single modality sensing, which have been extensively explored by experts. By leveraging these encoders, the amount of paired data required for modality alignment can be significantly reduced. (ii) Even though few datasets provide more than three paired modalities, numerous paired datasets exist that share common sensing modalities. These shared modalities can serve as a bridge for multi-modality alignment (see Fig. 2).

Drawing from these insights, the key idea of Babel is to achieve expandable multi-sensing modality alignment by transforming an N-modality alignment problem into a sequence of binary modality alignments. The expandability presents a dual advantage: firstly, it allows for the integration of a new modality by aligning it with a previously incorporated modality in the network; secondly, it offers the capability to adapt to each different dataset by absorbing new insights while preserving the knowledge of previously aligned modalities. This expandability facilitates the effective utilization of partially paired data in the sensing community.

To realize this expandability, we innovatively introduce three key techniques: the pre-trained modality tower (§4), the expandable network architecture (§5), and the adaptive training strategy (§6). Each modality utilizes a modality tower to extract features from raw data. We build these towers using existing singular-modal sensing feature extractors (e.g., LIMU-BERT [25] for IMU), and extent with concept alignment modules for the contrastive learning with other modality towers. The expandable network architecture enables sequential training phases with only paired samples. Within it we also propose the prototype network, shared by all modalities, maintains the knowledge of aligned modalities when adding new ones. Lastly, our adaptive training strategy balances the contribution of newly added modalities to the unified representation, optimally assimilating new knowledge during model growth without disrupting established alignments.

We offer a comprehensive implementation of Babel, encompassing the network architecture, data preparation and processing, as well as the training strategies. In Babel, we currently align six sensing modalities: two for wireless sensing, namely Wi-Fi and mmWave, two for mobile sensing, specifically IMU and LiDAR, and two for general vision, namely RGB and depth. As an expandable framework, Babel is allowed for aligning more modalities in the future by the community effort. Aligned modalities can be chosen for downstream tasks, either individually or in combination. Eventually five datasets are utilized to construct Babel, including UTD-MHAD [15], Kinetics-400 [18], OPERANet [17], XRF55 [26], and MM-Fi [16].

The current pre-trained Babel is evaluated on the typical downstream sensing task of Human Activity Recognition (HAR), across eight datasets, which include both multi-modal and singular-modal datasets [15][17], [26][30]. To demonstrate Babel’s capability in modality enhancement and fusion, we compared it to an array of baselines, including the state-of-the-art (SOTA) multi-modal sensing framework [21], singular-modal sensing networks [30][33], and the emerging Multi-modal Large Language Models (MLLMs) [19], [20], [23], [24]. Specifically, owing to the alignment across numerous sensing modalities, Babel improves the accuracy by up to 20%, and 12% on average across six modalities, compared to the performance before the alignment. Compared to the SOTA singular-modal network, Babel also brings the consistent accuracy improvements across various datasets. Due to the unified representation space, Babel increases multi-modal sensing fusion accuracy by up to 22% compared to current multi-modal frameworks. When comparing with MLLMs, Babel surpasses them by 25.2% across HAR datasets. In additional to HAR, we also present two real application case studies to highlight Babel’s potential. The first is sensing imaging to illustrate cross-modality retrieval. With Babel, the original image-to-image unCLIP [34] diffusion model can be supplemented with non-visual data as input to generate images. The other case aims to bridge the gap between LLM and sensing. By injecting the IMU sensing signal through Babel into the Video-LLaMA [35], the LLM can understand the sensing signals without any training.

To summarize, the contributions of the paper include:

  • Babel, to the best of our knowledge, is the first expandable framework for the multi-modal sensing based on the partially paired sensing datasets, currently aligning six sensing modalities.

  • Within Babel, we introduce key techniques for learning with scarce paired sensing data and modalities, including the pre-trained modality tower, expandable network architecture, and adaptive training strategy.

  • We demonstrate Babel’s superior performance in modality enhancement and fusion. Additionally, we highlight Babel’s potential in the field of cross-modality retrieval, and its ability to bridge LLMs for enhanced comprehension of the physical world.

2 Background and Motivation↩︎

2.1 Multi-Modal Sensing↩︎

Sensing, through various modalities like vision, micro-electro-mechanical-system (MEMS) sensors, and wireless RF sensors, is now a ubiquitous method for comprehending the physical world, capturing a wide range of information from the environment or specific objects. Multiple sensing modalities provide complementary capabilities. For instance, LiDAR creates long-range 3D environmental maps for vehicles, ultrasonic sensors offer close-range detection, and cameras interpret road signs for driver-assistance systems. This combination leads to the concept of multi-modal sensing.

Efficient multi-modal sensing relies on effectively fusing insights from each modality. Early strategies used handcrafted heuristics to link modalities, like A3’s [10] use of a gyroscope for smartphone attitude estimation, supplemented by calibration from magnetometer and accelerometer. However, these strategies aren’t scalable, especially with complex sensing signals and environments, making the construction of heuristics for each task increasingly unfeasible. Recently, DL has been used for multi-modal sensing, automatically revealing correlations among diverse sensing modalities through supervised or self-supervised learning, and show superior performance [36][40], [40].

2.2 The Power of Modality Alignment↩︎

DL-based multi-modal sensing methods outperform heuristic ones due to modality alignment. The alignment restructures one modality’s feature according to the representation space of another, eventually projects each modality’s features into a unified representation space through the training. This space facilitates the ease of manipulation. Consequently, modality alignment offers a comprehensive understanding of sensing data by utilizing each modality’s unique strengths.

Beyond augmenting multi-modal sensing fusion, the modality alignment could further empower new sensing applications. As illustrated in Fig. 1, aligned modalities are mutually retrievable. Specifically, utilizing the joint feature space as a proxy, one could employ Wi-Fi channel state information (CSI) to derive the corresponding embedding in the visual modality, subsequently generating an image from the visual embeddings. Intuitively, this could be considered as an alternative realization of Wi-Fi imaging. The joint feature space also facilitates the use of multiple aligned modalities for retrieval, thereby enabling multi-modal imaging without necessitating additional intervention.

The rise of Large Language Models (LLMs), presents new opportunities for sensing to interact with and understand the physical world [41]. Despite the development of multi-modal LLMs, the wide range of sensing modalities requires a unified sensing ontology for seamless integration with LLMs [42]. This alignment creates a unified representation that bridges the gap between various sensing modalities and LLMs.

To demonstrate the power of modality alignment, Babel aligns six prevalent sensing modalities, including vision, depth, IMU, Wi-Fi, mmWave, and LiDAR.

2.3 Challenges and Opportunities↩︎

Figure 2: Five public datasets, XRF55 [26], OPERANet [17], MM-Fi [16], UTD-MHAD [15] and Kinectics-400 [18] with paired data cover six sensing modalities.

Modality alignment is a growing AI research field, involving various methods [43], [44]. Of these, contrastive learning (CL) is notable. CL, a self-supervised learning method, differentiates similar and dissimilar samples by comparing positive (similar) and negative (dissimilar) pairs. The goal is to generate representations where similar samples are close, and dissimilar ones are far apart in the feature space. Contrastive Language-Image Pretraining (CLIP) [14] exemplifies the effective use of CL in aligning text and image modalities. CLIP, trained on a large Internet image-caption pairs, learns to associate semantically related texts and images.

Nonetheless, it is still challenging to apply CL to align multiple sensing modalities, due to the fundamental data scarcity issue. For instance, CLIP’s training necessitates approximately \(400,000,000\) image-text pairs, a scale of data that public datasets with paired sensing samples fail to match. In fact, public multi-modal sensing datasets [15], [33], [45], as an example, contain a mere \(600-42,000\) sample pairs, a stark contrast to the required volume.

Additionally, there exists numerous sensing modalities. The alignment of N sensing modalities generally necessitates a substantial amount of N-tuple data. Regrettably, the most public dataset currently available usually contains triplets of paired sensing modalities. There are no public datasets that cater to the alignment of a greater number of sensing modalities, such as six or more.

Furthermore, the data scarcity issue cannot be easily resolved through large-scale data collection. Unlike visual or linguistic data, which are already available on the Internet, sensing data is generated at and confined to end clients. The collection of such data often necessitates specialized hardware or software. The transmission of this data is resource-intensive and poses significant privacy concerns, rendering ubiquitous sensing data collection impractical. The simultaneous collection of multi-modal sensing data even compounds this problem.

Thereby, the existing research [19][22], [24], [46] may struggle to align multiple sensing modalities in a scalable manner. For instance, ImageBind [24], and OneLLM [19] are primarily constrained to supporting IMU due to the scarcity of paired data on other sensing modalities. Cosmo [21] can only align a limited array of sensing modalities as the demand for N-tupled paired data escalates when aligning N modalities. In Babel, we endeavor to align six sensing modalities with the existing constrained datasets, based on the following opportunities we identified.

Over the past few decades, researchers have developed a range of feature extractors or encoders for single modality sensing, based on either signal processing or DL techniques. These have been proven effective in extracting representative features for various downstream tasks. This presents an opportunity to leverage existing single modality encoders in constructing the modality alignment network. In this way, we might significantly reduce the trainable parameters, thereby decreasing the data requirement in accordance.

Furthermore, we observe that while no dataset provides N-tuple samples for N modalities, numerous datasets do contain paired data for two modalities. Despite these datasets potentially being collected for different tasks, they may share common modalities. Fig. 2 illustrates this occurrence, with five datasets encompassing six modalities. This presents an opportunity to assemble a compact dataset with restricted modalities and exploit the shared modalities to incrementally expand the modality alignment network.

To cope with the data scarcity challenge, and to capitalize on the aforementioned opportunities, we introduce Babel.

3 Babel Overview↩︎

Figure 3: Overview of Babel.

Babel, to the best of our knowledge, is the first scalable multi-modal pre-trained network, specifically designed for sensing applications, suitable for a multitude of downstream tasks. Babel consists of the model architecture designs, training strategies as well as the data preparation and processing techniques. In Babel, we present two designs to build the network with constraint data, namely pre-trained modality tower and expandable model architecture to cope with the scarcity challenge of paired sensing data and multi-paired sensing modalities.

In the design of the pre-trained modality tower, our aim is to harness the power of existing feature extractor within singular modality sensing to construct the modality alignment network, thereby significantly decreasing the necessity for extensive paired training samples.

The crux of this design lies in the efficient alignment of representations across pre-trained encoders. Thereby, we introduce the modality tower, consisting of the pre-trained encoder, and the concept alignment module. The encoder could be based on signal processing and neural networks from existing DL models. The concept alignment module then aligns embeddings (features) from encoders. During training, pre-trained encoders are frozen, and the concept alignment module is updated.

In the design of the expandable model architecture, we try to convert the contrastive training process with N-tuple samples into a sequence of training phases involving only paired samples, thereby significantly reducing the need for tupled samples, rendering the alignment of multiple modalities truly feasible.

As illustrated in Fig. 3, we initially align two modalities to form a trunk network. We then introduce a new branch modality, identifying the junction modality within the trunk that pairs with the branch according to available training samples. Through CL, the branch is merged with the trunk to form the updated trunk, by aligning the branch and junction modality. We refer this process as growth. The crux of this design lies in effectively maintaining knowledge of aligned modalities while assimilating new insights from the newly merged modality. Thereby, we introduce prototype network, which is shared by all modalities and is carefully updated during training with our adaptive training strategy.

Adaptive training strategy is explicitly engineered for the sensing modality alignment. Particularly, during each training phase, we aim to create an embedding space where similar sample pairs converge by adjusting each modality’s representation. The adjustment weights are vital as modalities contribute differently to the final space. More weight should be given to modalities with more clear signals, while those with more noise or fewer insights should contribute less, to preserve aligned modalities’ knowledge. This balance varies depending on the modality combinations, datasets, and tasks. Hence, we propose an adaptive strategy for automatically determining weights.

Next we would introduce these designs in detail.

4 Pre-trained Modality Tower↩︎

4.1 Assembling Modality Towers↩︎

In the alignment of each modality, our initial step involves constructing a modality tower. Subsequent to this, we execute the contrastive learning on these modality towers. The modality tower incorporates two fundamental components: a pre-trained encoder, and a concept alignment module.

Comparing with conventional modality alignment methods i.e., CLIP, Babel’s key design lies in the utilization of a pre-trained encoder within a singular modality, proving particularly effective for sensing modalities.

Babel’s effectiveness can be attributed to two key factors. Firstly, the process of assembling the modality tower adheres to the proven method of parameter-efficient fine-tuning (PEFT) [47], a technique notably successful in addressing the vision-language modality alignment problem, as evidenced by models like LiT [48] and APE [49]. The concept alignment module could be regarded as an adapter in the context of PEFT practices. Secondly, the successful application of PEFT necessitates that the encoders can capture generic features. For modalities such as vision and language, it typically demands pre-training on a substantial corpus of data, ensuring that the pre-trained model does not exhibit significant domain shift and adequately covers representative features for a majority of downstream tasks.

Pertaining to sensing modalities, the input signals are typically modulated, bearing distinct physical interpretations, thereby making them distinctly defined and explicable in terms of physics. As sensing techniques advance, these representative features are further amplified. As a result, we note that the representative features of sensing modalities for a multitude of downstream tasks often remain consistent. This consistency facilitates our opportunity to leverage singular modality encoders in constructing the modality tower, following the practice of PEFT.

The particular encoder for each modality is chosen based on the following criteria. For modalities dedicated to sensing tasks, such as mmWave, we tend to choose signal processing-based encoders, due to their capability to extract universally applicable features with well-defined physical meanings. For more ubiquitous sensing modalities, like Wi-Fi, which are often noisy, we lean towards DL-based encoders, owing to their proficiency in de-noising. As these DL encoders are trained on specific datasets, it is essential to avoid models with the domain shift during the selection. Therefore, our primary consideration is the model’s capacity. Eventually we evaluate and compare the selected candidates by fine-tuning and testing them on a variety of singular modality datasets. The encoder demonstrating superior generality is chosen.

To further boost the performance on certain modalities, we also propose our modality tower augmentation technique, which is elaborated upon in §4.3.

Figure 4: The alignment of two modality towers with pre-trained encoders and concept alignment modules.

4.2 Aligning Modality Towers↩︎

Upon assembling the modality tower for a given modality, we strive to align them through the contrastive learning. Next, we would illustrate our modality alignment process using the alignment of two modalities as an example. The alignment of multiple modalities would be discussed in §5.

As illustrated in Fig. 4, given the dataset \(E_{\alpha\beta}\) comprising paired samples of modality \(\alpha\) and modality \(\beta\), our first step is to structure the positive pairs \(P\) and negative pairs \(Z\) essential for the contrastive learning process. Specifically, the dataset \(E_{\alpha\beta}\) includes sample pairs \((\chi_\alpha, \chi_\beta)\) that are initially synchronized. For instance, in the UTD-MHAD dataset [15], Each sample pair signifies a sequence of IMU readings and a concurrent video recording series of the same human activity, captured within a span of 5 seconds. From the dataset \(E_{\alpha\beta}\), we randomly select a batch \(M\) comprising \(m\) sample pairs. Within this batch, for a given sample of modality \(\alpha\), denoted as \(\chi_\alpha^i\) where \(i \in N\), we construct its corresponding positive pair \(P_\alpha^i\) and negative pairs \(Z_\alpha^i\) in the following manner, \[\begin{align} P_\alpha^i &= (\chi_\alpha^i, \chi_\beta^i), 1 \leq i \leq m, \\ Z_\alpha^i &= \{(\chi_\alpha^i, \chi_\beta^j)\}, 1 \leq i,j \leq m, i \neq j, \end{align}\]

Likewise, we can construct the positive pair \(P_\beta^i\) and negative pairs \(Z_\beta^i\) for the \(i\)th sample of modality \(\beta\) within the batch \(M\). Ultimately, for the batch \(M\) consisting of \(m\) pairs, we could derive \(m\) positive pairs and \(m^2-m\) negative pairs, which will be utilized in the sequential contrastive learning.

Throughout the training phase, the assembled positive pairs \(P\) and negative pairs \(Z\) are processed through the modality tower. The contrastive loss \(L\) is computed on a per-batch basis for each batch \(M\),

\[\label{equ:loss} L_{\alpha\beta}^M = \frac{L_{\alpha\leftarrow\beta}^M + L_{\beta\leftarrow\alpha}^M}{2},\tag{1}\]

where \(L_{\alpha\leftarrow\beta}^M\) and \(L_{\beta\leftarrow\alpha}^M\) denote the computed contrastive loss transitioning from modality \(\alpha\) to modality \(\beta\) and vice versa within the batch \(M\), as defined subsequently,

\[\label{equ:sim} L_{\alpha\leftarrow\beta}^M = -\sum_{i=1}^{m}log(\frac{\exp(sim(P_\alpha^i)/\tau)}{\sum\exp(sim(N_\alpha^i)/\tau}),\tag{2}\]

where \(\tau\) is a temperature parameter employed to scale the logits. In our implementation, we set \(\tau\) to 0.07. The function \(sim\) represents the cosine similarity function utilized to examine the output embeddings from \(\Gamma_\alpha\) and \(\Gamma_\beta\). Similarly, we can compute \(L_{\beta\rightarrow\alpha}^M\). Eventually we use \(L_{\alpha\beta}^M\) to update the concept alignment modules of modality towers of \(\alpha\) and \(\beta\).

As a pre-trained network, when Babel is incorporated into downstream tasks, we would introduce an additional task-specific network. For instance, a classifier head is introduced for activity classification tasks. Owing to the modality alignment, the aligned embedding from each modality can be straightforwardly concatenated for downstream tasks. As will be demonstrated in the evaluation, the output embeddings, enhanced by modality alignment, are significantly superior. Consequently, we can attain SOTA results even with a very simple classifier, such as a 2-layer MLP, when only applying one-shot learning.

4.3 Augmenting Modality Towers↩︎

We also propose to augment the modality towers by employing multiple encoders for the particular modality. The concept of modality tower augmentation is inspired by model ensembling [50], [51], where multiple weak learners combine to create a stronger one, improving accuracy and performance. This method has proven to effectively decrease variance and bias in each weak learner.

In Babel, we would construct an augmented modality tower when incorporating additional encoder. We align the augmented modality towers in accordance with the process delineated in 4.2. Specifically, We construct two modality towers, \(\Gamma_\alpha^\epsilon\) and \(\Gamma_\alpha^\eta\), using pre-trained encoders \(\epsilon\) and \(\eta\) respectively. We align these towers using positive pairs \(P_\alpha^i=(\chi_\alpha^i, \chi_\alpha^i)\) and negative pairs \(Z_\alpha^i=\{(\chi_\alpha^i, \chi_\alpha^j)\}\) where \(i\neq{j}\). The alignment is achieved through loss functions from Equations 1 and 2 . The similarity \(sim\) is computed using output embeddings from both towers.

Figure 5: The alignment of multiple modalities with the prototype network in the expandable network architecture.

5 Expandable Model Architecture↩︎

a

Without alignment

b

Triplet alignment

c

Babel(IMU-Skeleton-Video)

d

Babel(Skeleton-Video-IMU)

Figure 6: t-SNE representations of three modalities obtained by different modality alignment approaches..

5.1 Prototype Network↩︎

Aligning multiple sensing modalities (such as six or more) with partially paired modalities is challenging. In response to this, one of key designs in Babel is the expandable model architecture, which transforms the training process for \(N\) modality alignment into a series of two modality alignment phases, exploiting existing datasets with paired modalities.

To elaborate, consider the alignment of three modalities: \(\alpha, \beta, \kappa\), with the available datasets \(E_{\alpha\beta}\) and \(E_{\alpha\kappa}\). We initially employ \(E_{\alpha\beta}\) to align the modalities \(\alpha\) and \(\beta\), as discussed in §4.2, yielding the network \(H_{\alpha\beta}\), which we term the trunk network. Subsequently, we aim to integrate an additional modality \(\kappa\) into the trunk \(H_{\alpha\beta}\).

Given that dataset \(E_{\alpha\kappa}\) provides corresponding pairs between the modalities \(\alpha\) and \(\kappa\), we designate \(\alpha\) as the junction modality. From the trunk \(H_{\alpha\beta}\), we select the trained modality tower \(\Gamma_{\alpha}\). We then construct a new modality tower \(\Gamma_{\kappa}\), referred to as the branch. This branch is integrated into the trunk network by aligning the junction modality tower \(\Gamma_{\alpha}\) with the branch modality tower \(\Gamma_{\kappa}\), utilizing samples from the dataset \(E_{\alpha\kappa}\). We refer to this procedure as network growth. Fig. 5 illustrates the network growth in our expandable network architecture.

The challenge of facilitating network growth lies in maintaining the knowledge of previously aligned modalities while concurrently assimilating new insights from the additional modality. Therefore, during this growth phase, it is not suitable to directly align \(\Gamma_{\alpha}\) and \(\Gamma_{\kappa}\) as outlined in §4.2, since any updates to the junction modality \(\Gamma_{\alpha}\) may significantly disrupt the already aligned modalities, such as modality \(\beta\).

To this end, we introduce the prototype network. As shown in Fig. 5, it is specifically incorporated into the trunk network, succeeding the concept alignment module of each modality tower. The prototype network is shared across all modality towers within the trunk network. It serves as a coordinating entity for all the learned knowledge across aligned modalities. Therefore by adjusting the updates on the prototype network, we could strike a balance between acquiring new knowledge from the branch modality and avoiding catastrophic forgetting of the trunk network.

Revisiting our previous example, during the initial alignment of modalities \(\alpha\) and \(\beta\), we concurrently update the prototype network \(\Upsilon\) while training the concept alignment module of the modality towers \(\Gamma_{\alpha}\) and \(\Gamma_{\beta}\). Subsequently, during the network growth phase involving the branch modality \(\kappa\) and the junction modality \(\alpha\), the contrastive learning process would updates the branch and junction modality tower \(\Gamma_{\kappa}\) along with the prototype network \(\Upsilon\).

In our implementation, the structure of the prototype network is kept relatively straightforward, resembling a 2-4 layer MLP. Despite its simplicity, this design enables several advantages for the alignment of multiple modalities. First, during each network growth phase, it allows us to utilize different datasets, even for disparate tasks. Second, this design facilitates the repeated enhancement of aligned modalities using varied datasets. By assimilating insights from these different datasets, it becomes feasible to construct a pre-trained network with substantial generality.

Together with the prototype network, we also devise the adaptive training strategy to regulate the extent to which the trunk network acquires new knowledge, which would be discussed in §6.

5.2 Growth Orders↩︎

Babel transforms the \(N\)-tuple modality alignment into a sequence of two-modality alignment phases, thereby raising a potential question regarding the differences between the conventional completed alignment and our expandable alignment with varying modality growth orders. To analyze this, we utilize a three-modality alignment, i.e., IMU, skeleton, and video from UTD-MHAD dataset [15], as an example.

As depicted in Fig.6, we utilize t-SNE to render the representation space of each modality visible. As evident in Figure 6 (a), before alignment, features that have not undergone alignment training exhibit significant distribution differences. Fig. 6 (b) shows the conventional triplet alignment successfully bridge the modality gaps, aligning the three modalities. In contrast, the expandable network architecture within Babel employs a sequence of two-modality alignment training phases as a replacement for the joint alignment. As illustrated in Fig. 6 (c), we initially align the IMU and skeleton modalities followed by the video modality, effectively bridging the modality gaps as well.

Our method is flexible regarding alignment order. Fig. 6 (d) shows representations from each modality achieved by an alternately ordered network: first aligning skeleton and video, then IMU. Despite varying sequences, a common representation space is achievable. Further evaluation would be discussed in Section 8.1.5.

6 Adaptive Training Strategy↩︎

We in further propose our training strategies to optimally integrate the insights derived from the newly aligned modality during network growth. Specifically, we implement two strategies for the training of the concept alignment module and the prototype network, respectively.

For the training of the concept alignment module during network growth, we employ adaptive weighted contrastive training. The key of this design lies in dynamically adjust the proportion of proximity between modalities during the modal alignment process.

As per Equation 3 , the contrastive loss in aligning modality \(\alpha\) and \(\beta\) includes two parts: \(L_{\alpha\leftarrow\beta}\), the loss when \(\beta\) approximates \(\alpha\), and \(L_{\beta\leftarrow\alpha}\), the loss when \(\alpha\) approximates \(\beta\). We find reliable and unreliable modalities in various modality combinations and datasets. Naturally, modalities with robust encoders and abundant data are more reliable, so we expect less reliable ones to converge towards them. During network growth, careful updates are needed in the junction modality tower to add insights from the branch without disrupting aligned modalities. Hence, we integrate weights into Equation 1 as follows:

\[\label{equ:adaptive} L_{\alpha\beta}^M = \frac{w_{{\alpha\leftarrow\beta}}\cdot{L_{\alpha\leftarrow\beta}^M}+ w_{{\beta\leftarrow\alpha}}\cdot{L_{\beta\leftarrow\alpha}^M}}{2},\tag{3}\]

where \(M\) represents a batch randomly drawn from the dataset \(E_{\alpha\beta}\), and \(w_{{\alpha\leftarrow\beta}}\) and \(w_{{\beta\leftarrow\alpha}}\) denote the normalized weights. Intuitively, we lean towards attributing a larger weight \(w_{{\alpha\leftarrow\beta}}\) if modality \(\alpha\) is deemed more reliable and established, while a smaller weight is assigned otherwise.

Identifying the appropriate weights presents a challenge. A static weighting scheme is suboptimal as each modality may differ in respect to data volume and quality, encoder proficiency, as well as the fresh insights and contributions it brings to the aligned modalities. As such, we opt for a dynamic weighting strategy. Particularly, we employ gradients as an indicator to adaptively modify the weights, \[w_{\alpha\leftarrow\beta}^M = \frac{1}{\|\nabla_{\alpha\leftarrow\beta}^M({\Gamma_\alpha}, {\Gamma_\beta})\|},\]

where \(\nabla\) represents the accumulated gradients of all parameters within the concept alignment modules of the modality towers \({\Gamma_\alpha}\) and \({\Gamma_\beta}\) when computing the loss \(L_{\alpha\leftarrow\beta}^M\) within the batch \(M\). We calculate \(w_{\beta\leftarrow\alpha}^M\) in a similar way. Then we normalize them as, \[w_{\alpha\leftarrow\beta}^M+w_{\beta\leftarrow\alpha}^M=1,\]

Figure 7: Adaptive training weights of branch and junction modality during the network growth.

Gradients effectively indicate how the loss function varies with the model parameters. If training pairs don’t provide new insights to the trunk network during the growth, it leads to small gradients in the junction modality tower, prompting a higher weight, bringing the branch network nearer to the trunk. When the branch modality tower’s gradients are significant, the assigned weight speeds up the absorption of insights from the trunk network, ensuring alignment in the unified representation space.

Fig. 7 shows the dynamic weight adaptation in the multi-modal alignment network construction using Babel. This merges Wi-Fi as a branch modality into the trunk network, with the skeleton as the junction modality, using the OPERANet dataset for training [17]. Initially, the skeleton modality, enriched with trunk network’s aligned knowledge, is more reliable than the Wi-Fi branch modality, thus, it’s assigned a near-one weight to speed up convergence with the junction modality. After around 6,000 training iterations, alignment is essentially achieved. Then, our dynamic weight adaptation mechanism adjusts to enable knowledge exchange between the junction and branch modalities, creating a comprehensive representation space.

For the training of the prototype network during network growth, we employ the exponential moving average (EMA) methodology. This strategy aids in preserving stability in the prototype representations by slowly incorporating fresh information while safeguarding the accumulated knowledge. We supplement this with knowledge distillation during the EMA process. This technique assists in preserving crucial information gleaned from prior modalities whilst incorporating novel ones.

7 Implementations↩︎

7.1 Data Preparation↩︎

Overall,we utilize five datasets for the alignment, as itemized in Table 1, comprising paired samples across divergent dual modalities. These datasets are for human activity recognition (HAR) tasks, but the certain activities are totally different. Despite the provision of activity labels within these datasets, we adopt a self-supervised training approach, labels are not used. Throughout every dataset, depth signals undergo a conversion into a human skeleton format. As such, we employ the term skeleton to denote the depth modality.

Table 1: Datasets and their corresponding data pairs utilized to train the hexa-modal alignment network.
Dataset Modalities # Pairs
UTD-MHAD [15] IMU and Skeleton \(613\)
MM-Fi [16] LiDAR and Video \(17,528\)
OPERANet [17] Wi-Fi and Skeleton \(25,433\)
XRF55 [26] mmWave and Wi-Fi \(42,900\)
Kinetics-400 [18] Video and Skeleton \(234,619\)

Skeleton and IMU pairs. UTD-MHAD dataset [15] is used, encompassing the skeleton and 9-axis IMU data pairs, captured via the Microsoft Kinect sensor and the wearable inertial sensor with respective sampling rates of 30Hz and 50Hz. The dataset embodies 27 distinct actions performed by 8 subjects. Each subject repeated the action for 4 times, totaling 861 paired samples. We use 613 pairs for the training.

LiDAR and video pairs. MM-Fi dataset [16] is used, which contains 27 distinct actions performed by 40 human subjects. The LiDAR is collected in the point cloud format. MM-Fi [16] dataset provides 17,528 pairs for our training.

Wi-Fi and skeleton pairs. OPERANet dataset [17] is used, which contains the paired Wi-Fi CSI and skeleton data. The Wi-Fi CSI is gathered from the Intel 5300 platform across 30 subcarriers, employing a sampling rate of 1600Hz, with 3 transmitters and 3 receivers. The skeleton data is obtained from the Microsoft Kinect sensor. The dataset encompasses roughly 8 hours of annotated measurements collected in two different rooms with 6 participants performing 6 daily activities. OPERANet provides 25,433 pairs for our training.

mmWave and Wi-Fi pairs. XRF55 dataset [26] is used, which is collected from a TI IWR6843ISK radar for mmWave and Intel 5300 for Wi-Fi CSI. It includes HAR data from 39 subjects performing 55 unique actions, each repeated 20 times. In total, 42,900 pairs are provided for training.

Video and skeleton pairs. Kinetics-400 dataset [18] is used, which contains 400 distinct human action classes, each characterized by at least 400 video clips extracted from YouTube. Each clip, approximately 10 seconds long, portrays a variety of human actions. The skeleton is extracted from use clips using OpenPose [52]. Overall, as a dataset in vision modality, the Kinetics dataset provides 23,4619 pairs.

7.2 Data Augmentation↩︎

We implement two data augmentation techniques on the raw data, ultimately enlarging the data pairs by 600\(\times\). (i) Down-sampling. Raw pairs undergo down-sampling at different ratios, simulating diverse sampling rates on various devices or accelerating the action at distinct ratios. This method augments the raw pairs by a factor of 300\(\times\). (ii) Action-segmentation. The raw action sequence is randomly truncated, simulating incomplete activity sensing. We ensure the segmented sequence’s shortest length is over 50% of the original length. This method amplifies the raw pairs by a factor of 300\(\times\).

7.3 Selections of Pre-trained Encoders↩︎

Next we introduce the pre-trained encoders we use for building the modality alignment network.

IMU. We utilize the LIMU-BERT encoder [25], renowned for its proficiency in generating generalized representations. It is pre-trained on a range of IMU datasets.

Skeleton. We utilize the Spatial-Temporal Graph Convolutional Network (ST-GCN) [53] as our encoder, which is pre-trained on extensive datasets, notably the NTU-RGBD [54].

Video. We employ ResNet3D model  [55] as the encoder, which is pre-trained on Kinetics-400 dataset [18].

Wi-Fi. For Wi-Fi CSI, we fails to obtain one powerful pre-trained encoder. Therefore we apply multiple encoders to augment the modality tower of Wi-Fi. Specifically, we utilized a Vision Transformer (ViT) and a combination of Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) as our encoders. They are pre-trained on UT-HAR [56] datasets.

mmWave. We employ the signal processing based encoder for this modality. We use doppler fast fourier transform (FFT) and angle FFT, generating range-doppler heatmaps and range-angle heatmaps, respectively. We supply an additional spatial ResNet18 [26] to further extract features from them.

LiDAR. We use the Point Transformer [57], which is pre-trained on the ModelNet40 dataset [58]. The encoder cannot extract temporal features, we add an additional ST-GCN as the additional temporal feature extractor, which is pre-trained on the NTU-RGBD [54] dataset.

7.4 Training Details↩︎

We commence the training process with the IMU and skeleton modalities. Subsequently, we integrate the video modality, aligning it with the pre-existing skeleton modality. Next, we incorporate the Wi-Fi modality into our framework, leveraging the paired Wi-Fi and skeleton data. This is followed by the introduction of the mmWave modality, which is linked with the intermediate Wi-Fi modality. Ultimately, we incorporate the LiDAR modality, capitalizing on its integration with the paired video modality.

We employ the AdamW optimizer [59] with a batch size of 256 and an initial learning rate of \(1\times10^{-4}\). For each phase of network growth, we judiciously allocate a varying number of training epochs, typically up to 500, or cease the training process once convergence is attained. The learning rate for downstream tasks is adjusted between 0.001 and 0.1. We train on two NVIDIA A100 GPUs, spending around 20 hours to align six modalities.

8 Evaluation↩︎

We evaluate pre-trained Babel by employing a typical downstream sensing task, human activity recognition (HAR). Furthermore, we would demonstrate two applications enabled by Babel, namely the cross-modality retrieval and the LLM integration.

8.1 Evaluation on HAR↩︎

We evaluate Babel on 8 datasets, including 4 multi-modal datasets namely UTD-MHAD [15], OPERANet [17], XRF55 [26] and MM-Fi [16], 4 singular-modal datasets, namely UCI [27], Widar3.0 [28], mRI [29] and MSRAction3D [30].

We compare Babel with a broad range of baselines, including the multi-modal sensing baseline Cosmo [21], SOTA singular-modal sensing baselines, LIMU-BERT [31], SenseFi [60], MARS [61], MeteorNet [62], and PointTransformer [57]. We compare with MLLMs which hold potential for interpreting sensing signals, including OneLLM [19] and M4 [20].

Unless otherwise noted, the results for Babel are obtained from a one-shot setting, where only one labeled sample per class is used to train the downstream classifier. Given the inherent difficulty of obtaining labeled samples for sensing applications, this setting highlights Babel’s performance as a pre-trained network.

width=

8.1.1 Performance on Multi-modal Datasets.↩︎

Table [tab:overall95performance] shows the evaluation results on four multi-modal datasets. The modality alignment technique in Babel significantly improves performance in each individual modality, particularly in sensing modalities. For instance, classification accuracy for 27 human activities in the IMU modality increases from 20.19% to 31.77% before and after alignments. For the skeleton modality, there is a significant improvement by 12.02%. The Wi-Fi modality sees an approximate 10.74% enhancement. The mmWave modality shows a substantial increase from 30.32% to 50.30%, and the LiDAR modality achieves an accuracy of 43.91%, up from 28.43%. Overall, Babel brings around averaged 12% accuracy improvement on six aligned modalities across various datasets. Such gains are achieved by aligning each modality into a unified representation space, facilitating mutual learning. Sensing modalities benefit significantly, while gains are limited for video modalities.

The unified representation space in Babel allows for effective fusion, as shown in Table [tab:overall95performance]. When IMU and Video modalities are fused, Babel achieves a 33.17% accuracy on UTD-MHAD [15], outperforming both the individual IMU and video modalities. Likewise, a 58.97% accuracy is achieved on XRF55 [26] when Wi-Fi and mmWave modalities are merged. Note that even modality combinations (like IMU&Video fusion) not included in pre-training datasets are evaluated and obtain the superior performance, highlighting Babel’s flexibility and offering developers numerous opportunities to choose any one or combined modalities for their tasks.

8.1.2 Performance on Singular-modal Datasets.↩︎

The supervised learning performance of Babel on four full singular-modal datasets is detailed in Table [tab:overall95performance95single]. Owing to the effectiveness of multi-modality alignment, Babel consistently outperforms the SOTA methods for each individual modality. Notably, Babel demonstrates significant improvements for the mmWave and LiDAR modalities, achieving gains of 8.93% and 6.14%, respectively. In the Wi-Fi modality, Babel outperforms SenseFi by 4.4%. For the IMU modality, Babel attains an accuracy of 81.4%. It is important to note that none of the datasets evaluated here were included in the pre-training dataset collection, highlighting the generality of Babel.

Besides performance improvements, individual modalities could be empowered with new capabilities through Babel. For example, Wi-Fi sensing can be considerably improved in terms of its cross-domain capability. To showcase, we select two settings with different transmitter-receiver arrangements, denoted as S1 and S2 from the OPERANet dataset [17]. The encoder ViT [63] trained on UT_HAR [56] attains an accuracy of 26% for classifying 6 activities on S2. This stands for the baseline of the cross-domain Wi-Fi sensing. Then we fine-tune the encoder on S1, then test on S2, achieving an accuracy of \(54.5\)%. It showcases the performance of the conventional domain adaptation approach. Finally, we use Babel, which accomplishes \(62.47\)% accuracy without any fine-tuning, evidencing its cross-domain capability.

width=

8.1.3 Comparison with Cosmo [21]↩︎

Cosmo is the SOTA sensing fusion framework, but unlike Babel, it requires all modalities to coexist within one dataset, limiting its expandability to datasets with complete paired data. Thus to compared with Cosmo, we utilize the same paired IMU-skeleton from the UTD-MHAD [15] that it excels. An equal amount of data is employed to train both Cosmo and a bi-modality version of Babel. We train Cosmo and Babel for 10 times with different random seeds. In the fusion of IMU and skeleton modalities, Cosmo achieves an averaged classification accuracy of 56.3%, while Babel attains 63.02% on UTD-MHAD.

What’s more, Cosmo’s performance can also be attributed to the integration of an additional network structure and its corresponding training procedure (for each downstream task), referred to as iterative fusion learning. Conversely, we aim to highlight Babel’s efficacy as a pre-trained network with a simple downstream task design. When applying the same downstream network (i.e., MLP), Babel could achieve around 22% accuracy improvement compared to Cosmo. Furthermore, as a expandable solution, Babel allows aligning more modalities without retraining pre-existing ones. This enhances Babel’s performance when introducing modalities.

width=

8.1.4 Comparison with MLLMs.↩︎

There has been a significant development in MLLMs [19], [23]. These models are capable of understanding multi-modal inputs, including sensing modalities like IMU potentially. For comparison, we select typical MLLMs e.g., OneLLM [19]and M4 [20], and evaluate their performance on the UTD-MHAD [15] with HAR tasks. OneLLM and M4 use Meta-Transformer [23] and ImageBind [24] to interpret sensing signals, respectively. The results are summarized in Table [tab:performance95mllms]. Firstly, current MLLMs can only support a limited number of sensing modalities, like IMU. Secondly, they only achieve a classification accuracy of around 5%-6%. In stark contrast, Babel significantly outperforms these with a classification accuracy of 31.77% on IMU while supporting other five sensing modalities.

The rationale that these MLLMs seems supporting sensing modalities, but struggle to comprehend IMU data and manage HAR tasks, is their training only on the Ego4D dataset [64]. Without sufficient training, these models are restricted to trained data, limiting their cross-domain capabilities. Furthermore, these MLLMs are unable to be trained on other sensing datasets due to data scarcity and absence of techniques like pre-trained modality tower and the expandable architecture, which are introduced in Babel.

8.1.5 Ablation Study and Growth Orders.↩︎

The techniques, including the pre-trained encoders, expandable network architecture and adaptive training strategy, are all essential for constructing Babel. Particularly, Without pre-trained modality tower, training wouldn’t converge due to limited samples. On UTD-MHAD [15], without prototype network, the previously aligned modality would drop about averagely 44.7% relatively after introducing a new modality. Without adaptive training, the overall performance would decrease by up to 7.2%.

Thanks to the techniques proposed in Babel, the order of modality growth does not significantly influence the end-to-end performance once the training is sufficient. To evaluate this, we devise four network growth sequences according to different heuristics: (i) random order; (ii) alignment from the most robust to weakest modality (skeleton, video, LiDAR, IMU, Wi-Fi, mmWave); (iii) alignment based on data diversity, taking into account the number of actions, subjects, and scenes of the used datasets; (iv) alignment based on the data amount of used datasets, organized from largest to smallest. As shown in Table [tab:performance95align95orders], growth order doesn’t significantly affect the performance. For instance, with different growth orders, the performance on IMU and Wi-Fi modality varies less than 3% and 2%, respectively. This highlights Babel’s robustness.

width=

8.1.6 System Overhead↩︎

The pre-trained hexa-modal Babel takes around 1.1GB on the disk, including pre-trained encoders, concept alignment modules and the prototype network. Babel takes 1.4-9.92GB memory, depending on the selected modalities, using the FP32 precision. We evaluate the Babel’s inference latency on one Nvidia A100 GPU. The exact inference latency depends on the selected modalities. For instance, per each sample, i.e., a sequence of sensing data spanning 3-4 seconds, Babel takes 98.6ms for IMU modality, 206.7ms for LiDAR, For skeleton, WiFi, and mmWave, Babel only takes around 130ms. When fusing modalities, modality towers could be executed in parallel. Babel takes 265ms for Wi-Fi and video, 138ms for IMU and skeleton together.

8.2 Case Study↩︎

8.2.1 Cross-modality retrieval↩︎

The alignment of diverse sensing modalities in Babel potentially opens up the possibilities for cross-modality retrieval applications. This involves obtaining the representations of one modality using signals from other modalities as inputs. Such applications could be promising. For instance, using wireless sensing signals as input to retrieve visual representations could be considered an example of sensing imaging.

To showcase, we construct a prototype designed to retrieve visual representations and generate images using non-visual sensors, such as IMU. Specifically, we align Babel with unCLIP [34], an image-to-image diffusion model. unCLIP employs an image encoder to obtain the embeddings of the input image and then uses these embeddings to guide the diffusion process, thereby generating images that bear stylistic similarities to the input image. We incorporate unCLIP’s image encoder into our Babel network, enabling the sensing modalities to be interpreted by the diffusion module in unCLIP. We use L1 loss to align Babel and unCLIP.

Fig. 8 demonstrates the images generated using IMU as inputs, representing the sensor readings of a person gesturing with hands. Leveraging unCLP, the actions captured by IMU are visually represented. The environmental information and other visual styles are provided through text prompts. We believe this area of research opens up interesting possibilities, offering a pathway to visualizing the physical world through non-visual sensors.

Figure 8: Images generated through cross-modality retrieval. The action information (waving hands) is input via IMU, the environment information is input through text prompts.

8.2.2 Bridge with LLMs.↩︎

The alignment of diverse sensing modalities into a unified representation presents an advantageous prospect for integration with LLMs. To demonstrate, we integrate Babel with Video-LLaMA [35], which is a multi-modal LLM with the ability to understand both visual and audio contents.

We establish the alignment between the video modality in Babel and that in Video-LLaMA. Specifically, we judiciously select the video encoder from Video-LLaMA and construct a modality tower for integration into Babel. We employ the L1 loss in this scenario, ensuring the video encoder of Video-LLaMA remains frozen while all modalities in Babel align towards Video-LLaMA. This strategy aims to generate embeddings of sensing modalities that could potentially be interpreted by Video-LLaMA.

Figure 9: With Babel, Video-LLaMA accepts IMU readings as inputs and conduct a preliminary analysis of the actions represented by these IMU readings.

Fig. 9 provides an impressive illustration where we input an IMU sequence depicting a woman waving her hands. These IMU readings are processed by Babel and subsequently fed into Video-LLaMA. Remarkably, without any specific training on LLMs, it successfully deciphers the action captured by the IMU data and, when promoted, differentiates between diverse actions, such as squatting or waving hands. This exemplifies the potential of bridging sensing and LLMs via the modality alignment introduced by Babel. Our future research will concentrate on improving Babel, aiming to bolster the model’s capability to provide deeper insights and more accurate interpretations of physical world based on a broader spectrum of sensing modalities, and bringing such the capabilities to LLMs.

9 Related Work↩︎

Multi-modal Sensing. The development of multi-modal sensing networks is an emerging research area. Recently Cosmo [21] pioneered the application of contrastive fusion learning in multi-modal sensing, incorporating RGB, depth and IMU modalities. MESEN [22] employs multi-modal contrastive learning to improve the performance of singular-modal sensing. In contrast to Cosmo and MESEN, Babel, to the best of our knowledge, is the first expandable pre-trained network for multi-modal sensing, which seamlessly facilitates the alignment of additional modalities from diverse, scattered datasets. The concept of multi-modal sensing is extensively employed across a wide array of applications. For instance, [65] integrates RFID and RGB for recognizing human-object interactions. [66] leverages LiDARs, cameras, and IMU and GNSS devices worn by animals to recognize animal behavior. To locate target individuals, [67] utilizes Wi-Fi Fine Timing Measurements and IMU data to associate individuals in a video with a matched query ID. GaitVibe+ [68] enhances structural vibration-based footstep localization using temporary cameras and vibration sensors for in-home gait analysis. [69] presents an acoustic and camera sensing system that ameliorates range estimation for applications in robotics and other domains. These applications could potentially benefit through Babel.

Modality Alignment and MLLMs. Contrastive Learning (CL) is widely used for modality alignment, particularly in visual and linguistic modalities [14], [48], [49]. Despite the prevalence of CL, its implementation in multi-modal sensing introduces significant challenges. To overcome data scarcity, we have introduced several techniques in Babel, which are essential for aligning multiple sensing modalities. There is also a growing trend of aligning a broader range of modalities. For example, Meta-Transformer [23] uses a unified frozen visual encoder to derive representations across 12 different modalities. ImageBind [24] combines six modalities using only image-paired data through contrastive learning. However, these approaches provide limited support for sensing modalities, and their performance is suboptimal, as demonstrated in Table [tab:performance95mllms]. This is due to their lack of sufficient training without an expandable architecture like Babel. Discussions have also been held regarding the ability of LLMs to reason about the real physical world [41]. [23], [24] usually serve as multi-modal encoders to integrate with LLMs for understanding sensing signals. For instance, OneLLM [19] employs Meta-Transformer [23], while M4 [20] and AnyMAL [46] use ImageBind [24]. Babel, as a scalable pre-trained network, could potentially advance these MLLMs by better understanding more sensing modalities, which is our future work.

10 Conclusion and Future Work↩︎

We present Babel, a expandable modality alignment framework designed for sensing applications. The pre-trained Babel has been proficiently aligned with six prevalent sensing modalities, IMU, skeleton, video, Wi-Fi, LiDAR, and mmWave. Babel demonstrated the superior performance for HAR tasks across various datasets compared to an array of baselines. As Babel is a scalable network, we call for the community to further enhance and align additional helpful modalities into Babel.

References↩︎

[1]
Qifan Pu, Sidhant Gupta, Shyam Gollakota, and Shwetak Patel.2013. . In Proc. of ACM MOBICOM.
[2]
Fadel Adib Dina Katabi.2013. . In Proc. of ACM SIGCOMM.
[3]
Mingmin Zhao, Yonglong Tian, Hang Zhao, Mohammad Abu Alsheikh, Tianhong Li, Rumen Hristov, Zachary Kabelac, Dina Katabi, and Antonio Torralba.2018. RF-Based 3D Skeletons.
[4]
Jue Wang, Deepak Vasisht, and Dina Katabi.2014. . In Proc. of ACM SIGCOMM.
[5]
Teng Wei Xinyu Zhang.2016. . In Proc. of ACM MobiCom.
[6]
Teng Wei Xinyu Zhang.2015. . In Proc. of ACM MobiCom.
[7]
2021. . In Proc. of ACM MobiSys.
[8]
Rajalakshmi Nandakumar, Shyam Gollakota, and Nathaniel Watson.2015. . In Proc. of ACM MobiSys.
[9]
Wenguang Mao, Jian He, and Lili Qiu.2016. . In Proc. of ACM MobiCom.
[10]
Pengfei Zhou, Mo Li, and Guobin Shen.2014. . In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking(MobiCom ’14).
[11]
Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh, Yonglong Tian, Hang Zhao, Antonio Torralba, and Dina Katabi.2018. . In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR ’18).
[12]
Shohreh Deldari, Hao Xue, Aaqib Saeed, Daniel V. Smith, and Flora D. Salim.2022. . Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.6, 3(2022), 108:1–108:28.
[13]
Yash Jain, Chi Ian Tang, Chulhong Min, Fahim Kawsar, and Akhil Mathur.2022. . Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.6, 1(2022), 17:1–17:28.
[14]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.2021. Learning Transferable Visual Models From Natural Language Supervision.  [cs.CV].
[15]
Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz.2015. . In 2015 IEEE International conference on image processing (ICIP). IEEE, 168–172.
[16]
Jianfei Yang, He Huang, Yunjiao Zhou, Xinyan Chen, Yuecong Xu, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie.2023. MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset for Versatile Wireless Sensing.  [eess.SP].
[17]
Mohammud J. Bocus, Wenda Li, Shelly Vishwakarma, Roget Kou, Chong Tang, Karl Woodbridge, Ian Craddock, Ryan McConville, Raul Santos-Rodriguez, Kevin Chetty, and Robert Piechocki.2021. OPERAnet: A Multimodal Activity Recognition Dataset Acquired from Radio Frequency and Vision-based Sensors.  [eess.SP].
[18]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman.2017. The Kinetics Human Action Video Dataset.  [cs.CV].
[19]
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue.2023. OneLLM: One Framework to Align All Modalities with Language.  [cs.CV].
[20]
Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, and Mengwei Xu.2024. . In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking(Washington D.C., DC, USA) (ACM MobiCom ’24). Association for Computing Machinery, New York, NY, USA, 279–295. ://doi.org/10.1145/3636534.3649361.
[21]
Xiaomin Ouyang, Xian Shuai, Jiayu Zhou, Ivy Wang Shi, Zhiyuan Xie, Guoliang Xing, and Jianwei Huang.2022. . In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 324–337.
[22]
Lilin Xu, Chaojie Gu, Rui Tan, Shibo He, and Jiming Chen.2023. . (2023).
[23]
Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue.2023. Meta-Transformer: A Unified Framework for Multimodal Learning.  [cs.CV].
[24]
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra.2023. ImageBind: One Embedding Space To Bind Them All.  [cs.CV].
[25]
Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li, and Guobin Shen.2022. . GetMobile: Mobile Computing and Communications26, 3(2022), 39–42.
[26]
Fei Wang, Yizhe Lv, Mengdie Zhu, Han Ding, and Jinsong Han.2024. . Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.8, 1, Article 21(mar2024), 34 pages.
[27]
Jorge-L Reyes-Ortiz, Luca Oneto, Albert Samà, Xavier Parra, and Davide Anguita.2016. . Neurocomputing171(2016), 754–767.
[28]
Yi Zhang, Yue Zheng, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang.2022. . IEEE Transactions on Pattern Analysis and Machine Intelligence44, 11(2022), 8671–8688. ://doi.org/10.1109/TPAMI.2021.3105387.
[29]
Sizhe An, Yin Li, and Umit Ogras.2022. mRI: Multi-modal 3D Human Pose Estimation Dataset using mmWave, RGB-D, and Inertial Sensors.  [cs.CV]://arxiv.org/abs/2210.08394.
[30]
Wanqing Li, Zhengyou Zhang, and Zicheng Liu.2010. . In 2010 IEEE computer society conference on computer vision and pattern recognition-workshops. IEEE, 9–14.
[31]
Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li, and Guobin Shen.2021. . In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 220–233.
[32]
Jianfei Yang, Xinyan Chen, Dazhuo Wang, Han Zou, Chris Xiaoxuan Lu, Sumei Sun, and Lihua Xie.2023. SenseFi: A Library and Benchmark on Deep-Learning-Empowered WiFi Human Sensing.  [cs.LG].
[33]
Sizhe An, Yin Li, and Umit Ogras.2022. . Advances in Neural Information Processing Systems35(2022), 27414–27426.
[34]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.2022. Hierarchical Text-Conditional Image Generation with CLIP Latents.  [cs.CV].
[35]
Hang Zhang, Xin Li, and Lidong Bing.2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding.  [cs.CL].
[36]
Isibor Kennedy Ihianle, Augustine O Nwajana, Solomon Henry Ebenuwa, Richard I Otuka, Kayode Owa, and Mobolaji O Orisatoki.2020. . IEEE Access8(2020), 179028–179038.
[37]
Jiaxin Li, Danfeng Hong, Lianru Gao, Jing Yao, Ke Zheng, Bing Zhang, and Jocelyn Chanussot.2022. . International Journal of Applied Earth Observation and Geoinformation112(2022), 102926.
[38]
Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D Lane, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar.2018. . Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies1, 4(2018), 1–27.
[39]
Seungeun Chung, Jiyoun Lim, Kyoung Ju Noh, Gague Kim, and Hyuntae Jeong.2019. . Sensors19, 7(2019), 1716.
[40]
Batool Salehi, Guillem Reus-Muns, Debashri Roy, Zifeng Wang, Tong Jian, Jennifer Dy, Stratis Ioannidis, and Kaushik Chowdhury.2022. . IEEE Transactions on Vehicular Technology71, 7(2022), 7639–7655.
[41]
Huatao Xu, Liying Han, Qirui Yang, Mo Li, and Mani Srivastava.2024. Penetrative AI: Making LLMs Comprehend the Physical World.  [cs.AI].
[42]
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, and Yunxin Liu.2024. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security.  [cs.HC].
[43]
Aäron van den Oord, Yazhe Li, and Oriol Vinyals.2018. . CoRRabs/1807.03748(2018). ://arxiv.org/abs/1807.03748.
[44]
Yonglong Tian, Dilip Krishnan, and Phillip Isola.2019. . CoRRabs/1906.05849(2019). ://arxiv.org/abs/1906.05849.
[45]
Anjun Chen, Xiangyu Wang, Shaohao Zhu, Yanxu Li, Jiming Chen, and Qi Ye.2023. mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar.  [cs.CV].
[46]
Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, and Anuj Kumar.2023. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model.  [cs.LG]://arxiv.org/abs/2309.16058.
[47]
Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel.2022. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning.  [cs.LG].
[48]
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer.2022. LiT: Zero-Shot Transfer with Locked-image text Tuning.  [cs.CV].
[49]
Elan Rosenfeld, Preetum Nakkiran, Hadi Pouransari, Oncel Tuzel, and Fartash Faghri.2022. APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations.  [cs.LG].
[50]
Leo Breiman.1996. . Machine learning24(1996), 123–140.
[51]
Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes.2004. . In Proceedings of the twenty-first international conference on Machine learning. 18.
[52]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.2017. . In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291–7299.
[53]
Sijie Yan, Yuanjun Xiong, and Dahua Lin.2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition.  [cs.CV].
[54]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang.2016. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis.  [cs.CV].
[55]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri.2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition.  [cs.CV].
[56]
Siamak Yousefi, Hirokazu Narui, Sankalp Dayal, Stefano Ermon, and Shahrokh Valaee.2017. . IEEE Communications Magazine55, 10(2017), 98–104.
[57]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip H. S. Torr, and Vladlen Koltun.2020. . CoRRabs/2012.09164(2020). ://arxiv.org/abs/2012.09164.
[58]
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao.2015. . In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912–1920.
[59]
Ilya Loshchilov Frank Hutter.2019. Decoupled Weight Decay Regularization.  [cs.LG].
[60]
Jianfei Yang, Xinyan Chen, Dazhuo Wang, Han Zou, Chris Xiaoxuan Lu, Sumei Sun, and Lihua Xie.2023. SenseFi: A Library and Benchmark on Deep-Learning-Empowered WiFi Human Sensing.  [cs.LG]://arxiv.org/abs/2207.07859.
[61]
Sizhe An Umit Y Ogras.2021. . ACM Transactions on Embedded Computing Systems (TECS)20, 5s(2021), 1–22.
[62]
Xingyu Liu, Mengyuan Yan, and Jeannette Bohg.2019. . In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9246–9255.
[63]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al2020. . arXiv preprint arXiv:2010.11929(2020).
[64]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al2022. . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012.
[65]
Xiulong Liu, Dongdong Liu, Jiuwu Zhang, Tao Gu, and Keqiu Li.2021. . In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 296–308.
[66]
Ziwei Wang, Jiajun Liu, Reza Arablouei, Greg Bishop-Hurley, Melissa Matthews, and Paulo Borges.2022. . In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 900–902.
[67]
Hansi Liu, Abrar Alali, Mohamed Ibrahim, Hongyu Li, Marco Gruteser, Shubham Jain, Kristin Dana, Ashwin Ashok, Bin Cheng, and Hongsheng Lu.2021. . In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services(MobiSys ’21). Association for Computing Machinery, New York, NY, USA, 499–500.
[68]
Yiwen Dong, Jingxiao Liu, and Hae Young Noh.2022. . In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems(SenSys ’22). ACM.
[69]
Lewis Girod Deborah Estrin.2001. . In Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No. 01CH37180), Vol. 3. IEEE, 1312–1320.

  1. \(^\dagger\)The Work is done during internship at Microsoft Research↩︎