T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

Tanvir Mahmud
University of Texas at Austin


Yapeng Tian
University of Texas at Dallas


Diana Marculescu
University of Texas at Austin


Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://github.com/enyac-group/T-VSL/tree/main.

1 Introduction↩︎

While observing a conversation between two individuals, we can easily associate the audio signal with the corresponding speaking person in the visual scene. This remarkable ability to perceive audio-visual correspondence stems from our extensive exposure to both single-source sounds and multi-source mixtures in everyday life. Inspired by this human capability, significant research efforts [1][4] have been dedicated to developing visual sound source localization approaches in recent years.

Figure 1: Comparison of our T-VSL with state-of-the-art methods on single-source (Top Row) and multi-source (Bottom Row) sound localization on VGGSound Sources [5], VGGSound-Instruments [4], and MUSIC [6] benchmark datasets. We use the same setup for the fair comparison.

Visual sound source localization aims to locate the visual regions representing sound sources present in a video. Earlier methods on single-source localization [5], [7][12] mostly used audio-visual correspondence as guidance for localizing sounding objects in each frame. These approaches, developed for single sound source localization, suffer from significant performance drop in localizing multi-source mixtures. Multi-source localization from mixtures is very challenging, as the model must learn to distinguish each sounding source within a mixture and establish their cross-modal relationships, without access to individual source sounds. Earlier methods [1], [4], [13], [14] on multi-source localization attempt to infer fine-grained cross-modal associations directly from noisy multi-source mixtures, often leading to sub-optimal performance in practice for learning improper source correspondence. In this paper, we tackle the problem by introducing a unified solution for localizing visual sound sources in both single and multi-source mixtures.

The primary challenge of this task stems from the difficulty in disentangling the cross-modal correspondence of each sounding object from natural mixtures, given the absence of one-to-one corresponding single-source pairs. Additionally, the presence of silent visual objects and noise from invisible background sources further complicates this alignment process. To solve this problem, our key idea is to leverage the text modality as a coarse supervision for disentangling categorical audio-visual correspondence in natural mixtures, which differentiates our method from existing works. Unlike audio and visual modalities that may contain significant noises from different sources present in mixtures, textual representation can explicitly discriminate across multiple sources. However, a critical obstacle to utilizing text guidance in visual sound source localization is grounding fine-grained audio and visual features with their textual representation. Later, CLIP [15] introduced learning visual language grounding from web-scraped 400M image-text pairs. Recently, AudioCLIP [16] introduced the audio modality in the existing CLIP architecture, thereby generating tri-modal feature grounding through large-scale training. Our approach capitalizes on the tri-modal joint embedding space of AudioCLIP to disentangle one-to-one audio-visual correspondence in natural mixtures.

To this end, in this paper, we propose a novel text-guided multi-source localization framework that can discover fine-grained audio-visual semantic correspondence in multi-source mixtures. Initially, we detect the class instances of visual sounding objects in the frame using the noisy mixture features from AudioCLIP image and audio encoders. Then, the text representation of each detected sounding source class instance is extracted with AudioCLIP text encoder, which serves as a coarse guidance for audio and visual feature separation. Specifically, we disentangle the categorical audio and visual features of each sounding source using the coarse text-guidance through additional audio and image conditioning blocks, respectively. Finally, the extracted categorical audio-visual semantic features are further aligned through an audio-visual correspondence block for localizing each sounding source. In comparison with prior multi-source baselines, our approach can selectively localize the semantic visual region of each class of sounding objects in mixtures of varying number of sources. Extensive experiments on MUSIC [6], VGGSound [17], and VGGSound-Instruments [4] demonstrate significant performance improvements compared to existing single and multi-source baselines. Moreover, our method shows promising zero-shot transferability on unseen classes during test time. In addition, thorough ablation studies and qualitative analysis vividly showcase the effectiveness of the proposed framework.

Our main contributions can be summarized as follows:

  • We propose a novel text-guided multi-source localization framework to disentangle one-to-one audio-visual correspondence from natural mixtures.

  • Our work is the first to introduce AudioCLIP for utilizing the tri-modal feature grounding from large-scale pre-training in solving multi-source localization problems.

  • Our proposed framework can operate with a flexible number of sources and shows promising zero-shot performance on unseen audio-visual classes.

  • Extensive experiments on benchmark datasets clearly demonstrate the superiority of our proposed framework over other state-of-the-art methods.

2 Related Work↩︎

Figure 2: The proposed text-guided visual sound source localization (T-VSL) framework (for \(K = 2\)). We use the text modality to disentangle the fine-grained audio-visual correspondence from mixtures. Initially, we detect the audio-visual class instances from multi-source mixtures using AudioCLIP joint embedding space. Later, categorical text features of each detected \(K\) classes are used as coarse guidance in conditioning blocks to extract categorical visual and audio features. Afterwards, cross-modal feature alignment on extracted categorical features is performed in audio-visual correspondence block. Finally, cosine similarity of mean categorical audio features, and aligned visual features are used recursively to extract localization map of each class.

Visual Sound Source Localization. Visual sound source localization aims to locate the visual regions of sounding objects in video frames. Prior work explored diverse methods for audio-visual feature alignment, including traditional statistical machine learning approaches [18][20] to latest deep neural nets [5], [7][13]. However, most of these self-supervised and weakly-supervised methods are mostly designed for localizing single-source sounds. A two-stream architecture is proposed in Attention10k [7] which incorporates cross-attention to locate sounding sources. Later, motion cues present in videos are incorporated for finer localization [9]. Recently, various contrastive learning based methods are explored in LVS [5], EZ-VSL [10], and SLAVC [11] for single-source localization. Despite their promising performance, these single-source localization methods can hardly discriminate audio-visual correspondence in multi-source mixtures.

Sounds mostly appear in mixtures in the presence of significant background noises. The main challenges of multi-source localization is to disentangle the audio-visual correspondence in mixtures without having access to clean single-source audio-visual pairs. Several recent works have explored multi-source localization with explicit approaches for mixtures [1], [4], [13], [14]. To selectively isolate the silent objects in visual frames, DSOL [14] proposed a two-stage weakly-supervised training method. Mix-and-localize [4] proposed a self-supervised training with contrastive random walk to inherently align each sounding source with its visual representation. Recently, AVGN [1] introduced weakly-supervised grouping of category-aware features with learnable prompts. However, these approaches tackle the audio-visual correspondence in mixtures without any explicit guidance for single-source thereby resulting in sub-optimal performance. In contrast, we leverage text representation of fine-grained sounding sources to disentangle one-to-one audio-visual correspondence from multi-source mixtures. To the best of our knowledge, this work is the first to utilize text representation of sound sources for solving multi-source localization.

Audio, Visual, and Text Grounding. Large-scale pre-training on web-scraped data has been used to generate grounded features across modalities. CLIP [15] first introduced vision-language grounding using \(400\)M image-text pairs, which has been extensively studied for zero-shot classification [21][25], [25], [26], text-video retrieval [27][29], open-vocabulary segmentation [30][36] and object detection [37][40]. Recently, AudioCLIP [16] and Wav2CLIP [41] introduced additional audio grounding in original CLIP model by learning on image, text, and audio triplets. Very recently, a CLIP-based single source localization method [42] is introduced that attempts replacing the text-conditioning branch of a supervised image segmentation baseline, trained with dense supervision, with audio conditioning. In relation with this concurrent work, our focus lies primarily on weakly supervised sound source localization in multi-source mixtures, leveraging the self-supervised AudioCLIP model.

3 Method↩︎

Given a mixture of audio and a video frame, our objective is to spatially localize each sounding object in the frame. We propose a novel text-guided multi-source localization framework to disentangle the fine-grained audio-visual correspondence, which is illustrated in Fig. 2. To suppress the background noises, we initially detect the common class instances in both audio and visual modality (Sec. 3.2). In particular, we leverage tri-modal AudioCLIP encoders to detect \(K\) visual-sounding class instances from \(N\) classes (\(K \leq N\)). The primary challenge in multi-source localization lies in establishing one-to-one audio-visual correspondences for individual sources without access to single-source samples. To overcome this, we leverage the text modality to generate coarse representation of the \(K\) detected sources in the mixture. Later, by exploiting the multi-modal grounding of AudioCLIP, we extract categorical audio and visual features from multi-source representation using text guidance (Sec. 3.3). Rather than aligning audio-visual mixture features as prior work, we introduce the categorical audio-visual feature alignment to further enhance one-to-one correspondence (Sec. 3.4). Finally, cosine similarity of categorical audio features and aligned visual features is used to extract semantic localization maps.

3.1 Preliminaries↩︎

In this section, we formulate the multi-source localization problem, and revisit the AudioCLIP [16] for audio, visual, and text grounding.

Problem Formulation. Let’s assume the dataset contains a total of \(N\) classes of sounding events across all videos. Given a video frame and an audio mixture, our objective is to localize \(K (K \leq N)\) sound sources within the frame. For each audio-visual mixture pair, we can extract \(Y = \{y_i \}_{i=1}^{N}\) binary ground truth labels, along with categorical text representations for the on-screen sound sources in the mixture. During training, we are limited to video-level sounding class entities, excluding bounding boxes or mask-level annotations, thus constituting weakly-supervised learning.

Revisit AudioCLIP for multi-modal grounding. AudioCLIP [16] learns a joint embedding space of audio, visual, and text triplets through contrastive learning. Let’s consider the dataset \(\mathcal{D} = \{a, v, t\}_{i=1}^{Z}\) containing audio spectrogram \(a \in \mathbb{R}^{M \times F}\), text \(t \in \mathbb{R}^{l}\), and image \(v \in \mathbb{R}^{C \times H \times W}\) triplets. Also, separate audio encoder (\(E_a: \mathbb{R}^{M\times F} \rightarrow \mathbb{R}^{D}\)), visual encoder (\(E_v: \mathbb{R}^{C\times H\times W} \rightarrow \mathbb{R}^{D}\)), and text encoder (\(E_t: \mathbb{R}^{l} \rightarrow \mathbb{R}^{D}\)) are used for each modality. Hence, the extracted audio, visual, and text features are represented by \(A = E_a(a)\), \(V = E_v(v)\), and \(T = E_t(t)\), respectively, where \(A, V, T \in \mathbb{R}^{D}\). Later, InfoNCE contrastive loss (\(\mathcal{L}_{cnt}\)[43] across each modality pair is used for feature grounding given by: \[\begin{align}\label{eq:ac95loss} & \mathcal{L}_{aclip} = \mathcal{L}_{cnt} (A, T) + \mathcal{L}_{cnt} (V, T) + \mathcal{L}_{cnt} (A, V) \end{align}\tag{1}\] This training objective, combined with large-scale pre-training on web-scraped data, establishes a grounded joint embedding space across all three modalities.

3.2 Audio-Visual Class Instance Detection↩︎

The target sounding objects for localization must be present in both the audio mixture and the corresponding video frame. However, audio mixtures may contain noise from off-screen background sources, while video frames may include non-sounding objects. To address these challenges, we first detect the class instances of visible sounding objects in the frame. Then, we utilize this detection to suppress redundant features in the mixture, ensuring that the localization focuses on the relevant sounding objects.

Token extraction from Uni-modal encoders. Multiple sounding objects can be spatially located in different regions of the reference frame, while the audio mixture may contain the temporally distributed signal of different sources. This makes it challenging to detect multiple class instances with uni-dimensional single-token features generated from AudioCLIP encoders. Instead of a single pooled token representation, we extract numerous patch tokens from audio and visual encoders with simple modifications, such that \(\widehat{E}_a: \mathbb{R}^{M\times F} \rightarrow \mathbb{R}^{n_a \times D}\) and \(\widehat{E}_v: \mathbb{R}^{C\times H\times W} \rightarrow \mathbb{R}^{n_v \times D}\). Here, \(n_a = m \times f\) and \(n_v = h \times w\) represent the number of audio and visual patch tokens, respectively. Afterwards, we apply linear projectors on these extracted patch tokens, such that \(P_a: \mathbb{R}^{n_a \times D} \rightarrow \mathbb{R}^{n_a \times D}\) and \(P_v: \mathbb{R}^{n_v \times D} \rightarrow \mathbb{R}^{n_v \times D}\). Thus, extracted audio and visual patch tokens are represented as \({A'} = P_a(\widehat{E}_a(a))\), \({V'} = P_v(\widehat{E}_v(v))\), respectively, where \({A'} \in \mathbb{R}^{n_a\times D}\) and \({V'} \in \mathbb{R}^{n_v\times D}\). Simultaneously, categorical text features \(\mathcal{T} = {\{e^t_i\}_{i=1}^{N}} \in \mathbb{R}^{N \times D}\) representing all \(N\) sound source classes in the dataset are extracted.

Multi-label class instance detection. With the categorical text features \(\mathcal{T}\), mixture audio \({A'}\), and visual \({V'}\) patch tokens, we detect class entities present in the mixture. Initially, we extract the mean-pooled audio-visual features \(X' \in \mathbb{R}^{2D}\) by simple concatenation, and then, apply fusion projector \(P_f : \mathbb{R}^{2D} \rightarrow \mathbb{R}^{D}\) to extract mixture audio-visual features \(F_{av} \in \mathbb{R}^{D}\). Finally, cosine similarity between \(N\) class text features \(\mathcal{T}\) and \(F_{av}\) is used to detect mixture class entities \(\widehat{Y} \in \mathbb{R}^{N}\), given by: \[\begin{align} & {F}_{av} = P_f (X'), \;{X}' = [\mathtt{Mean(}A') \oplus \mathtt{Mean}(V')],\\ & \widehat{Y} = \mathtt{sim} \left(\mathcal{T}, {F_{av}}\right) \end{align}\] where \([\;\oplus \;]\) denotes the concatenation operator, and \(\mathtt{sim}(X, Y)\) denotes cosine similarity between \(X\) and \(Y\). We formulate a multi-label classification objective \(\mathcal{L}_{mcid}\) by applying binary cross-entropy loss \(\mathcal{L}_{bce}(\cdot)\) on each prediction using video-level ground-truth labels \(Y \in \mathbb{R}^{N}\) as: \[\mathcal{L}_{mcid} = \sum_{i=1}^N \mathcal{L}_{bce} (y_i, \widehat{y}_i)\] In particular, we train audio, visual, and fusion projectors (\(P_a, P_v, P_f\)) on noisy multi-source data keeping the AudioCLIP backbones frozen. Hence, these projectors assist in extracting relevant foreground features from noisy audio and visual mixture features extracted from AudioCLIP.

Table 1: Performance comparison of single-source localization on VGGSound-single, VGGSound-Instruments, and MUSIC-Solo datasets. For the fair comparison, we use pre-trained AudioCLIP audio and image encoders for baseline methods.
Method VGGSound-Single VGGSound-Instruments MUSIC-Solo
AP(%) IoU@0.5(%) AUC(%) AP(%) IoU@0.3(%) AUC(%) AP(%) IoU@0.5(%) AUC(%)
Current SOTA Based Methods
OTS [44] ECCV18 38.9 43.2 43.7 50.9 56.9 38.4 76.2 57.8 48.9
CoarsetoFine [13] ECCV20 39.4 43.1 44.8 51.1 57.5 39.1 76.3 58.1 49.7
LVS [5] CVPR21 39.7 44.6 45.3 51.5 57.9 39.4 76.1 57.1 49.3
EZ-VSL [10] ECCV22 40.6 45.1 47.2 52.3 58.8 39.9 78.4 58.7 51.1
Mix-and-Localize [4] CVPR22 41.8 45.6 47.4 53.8 59.3 41.5 78.9 59.9 52.7
DSOL [14] NeurIPS20 46.8 47.9 61.6 43.7 62.7 54.6
MarginNCE [45] ICASSP23 42.5 46.9 48.3 56.9 61.8 44.2 83.1 63.3 55.2
FNAC [2] CVPR23 43.3 48.4 49.1 57.2 63.2 45.3 84.6 64.5 56.4
AVGN [1] CVPR23 44.1 49.6 49.5 59.3 64.7 46.1 85.4 65.8 56.9
Alignment [3] ICCV23 45.3 50.8 50.2 60.4 65.8 48.7 86.1 66.4 57.2
CLIP-Based Baseline Methods
AudioCLIP [16] ICASSP22 42.8 47.4 48.5 58.3 62.7 45.2 83.8 63.1 55.7
Wav2CLIP  [41] ICASSP22 39.3 43.6 44.7 53.8 57.2 42.0 78.9 59.9 51.2
T-VSL (Ours) 48.1 53.7 52.9 64.6 69.5 51.4 88.2 68.5 60.1

3.3 Audio and Visual Conditioning Blocks↩︎

Establishing one-to-one correspondences for each source in multi-source mixtures is particularly challenging, as both audio and visual features contain mixed representations of various sources. Prior works [1], [4] on multi-source localization have attempted to learn audio-visual alignment intrinsically without explicit single-source guidance. However, in the absence of single-source data, these methods struggle to adequately disentangle multi-source features. In contrast, we leverage the categorical text features of \(K\) classes present in the mixture as coarse guidance for fine-grained feature disentanglement from multi-source mixtures. We use ground truth weak labels \(Y\) of representative audios in the mixture to extract text embedding of \(K\) visual sounding sources \(\mathcal{T}_K = \{e^t_k\}_{k=1}^K \subseteq \mathcal{T}\) where \(e^t_k \in \mathbb{R}^{1 \times D}\).

With the categorical text features \(\mathcal{T}_K\) of \(K\) sources, we independently disentangle the multi-object visual patch tokens \({V'}\) and mixture audio patch tokens \({A'}\) by using cross-attention \(\phi^v (\cdot)\) and \(\phi^a (\cdot)\), respectively, as: \[\begin{align} & \widetilde{V}_k = \phi^v(V', e^t_k, V'), \;\mathnormal{f}^v_k = \mathtt{Mean}(\widetilde{V}_k), \\ & \widetilde{\mathnormal{f}}^v_k = P_{vc}(\mathnormal{f}^v_k), \;\;\forall k = \{1, \dots, K\} \end{align}\] \[\begin{align} & \widetilde{A}_k = \phi^a(A', e^t_k, A'), \;\mathnormal{f}^a_k = \mathtt{Mean}(\widetilde{A}_k), \\ & \widetilde{\mathnormal{f}}^a_k = P_{ac}(\mathnormal{f}^a_k), \;\;\forall k = \{1, \dots, K\} \end{align}\] \[\begin{align} \phi(A, B, C) = \left(\dfrac{A B^\top}{\undefined{A}\undefined{B}}\right) \odot C \end{align}\] where \(\widetilde{\mathnormal{f}}^a_k, \widetilde{\mathnormal{f}}^v_k, \mathnormal{f}^a, \mathnormal{f}^v \in\mathbb{R}^{1\times D}\), \(\widetilde{V}_k \in \mathbb{R}^{n_v\times D}\), \(\widetilde{A}_k \in \mathbb{R}^{n_a\times D}\), and \(P_{ac}, P_{vc} : \mathbb{R}^D \rightarrow \mathbb{R}^D\) are linear audio and visual projectors on conditioned mean features, respectively, and \(\odot\) is the Hadamard product.

Additionally, to guide extracting categorical audio (\(\widetilde{A}_k\)) and visual (\(\widetilde{V}_k\)) patch tokens, we introduce coarse supervision on the output of projected mean patch tokens \(\widetilde{\mathnormal{f}}^a_k\), \(\widetilde{\mathnormal{f}}^v_k\) using \(N\)-class text embedding \(\mathcal{T} \in \mathbb{R}^{N \times D}\). Hence, the class conditioning loss (\(\mathcal{L}_{cls}\)) is given by: \[\begin{align} & \mathbf{e}^v_k = \mathtt{sim}(\mathcal{T}, \widetilde{\mathnormal{f}}^v_k);\;\mathbf{e}^a_k = \mathtt{sim}(\mathcal{T}, \widetilde{\mathnormal{f}}^a_k) \\ & \mathcal{L}_{cls} = \sum_{k=1}^K \mathcal{L}_{ce}( \mathbf{e}^v_k, \mathbf{h}_k) + \mathcal{L}_{ce}(\mathbf{e}^a_k, \mathbf{h}_k) \end{align}\] where \(\mathbf{h}_k \in \mathbb{R}^{N}\) denotes an one-hot encoding of the class label, and \(\mathbf{e}^v_k \in \mathbb{R}^{N}\), \(\mathbf{e}^a_k \in \mathbb{R}^{N}\) are visual and audio class predictions for each \(k^{th}\) source, respectively.

3.4 Audio-Visual Correspondence Block↩︎

While text representation provides coarse guidance for disentangling fine-grained spatio-temporal audio and visual features, it has inherent limitations in practice. For example, multiple instances of a sounding object class can appear in the visual frame, including silent instances, which are difficult to distinguish using simple text class representation. To address this challenge, we introduce an audio-visual correspondence block to further refine the alignment of audio and visual features based on extracted categorical features.

With the disentangled categorical audio (\(\widetilde{A}_k\)) and visual (\(\widetilde{V}_k\)) patch tokens, we apply cross-attention \(\phi^{av}(\cdot)\) to enhance audio-visual patch alignment as: \[\begin{align} & g_k^a = \mathtt{Mean}(\widetilde{A}_k), \;\widehat{\mathnormal{g}}^a_k = P_{av}(\mathnormal{g}^a_k) \\ & \widehat{V}_k = \phi^{av}(\widetilde{V}_k, \widehat{\mathnormal{g}}^a_k, \widetilde{V}_k), \;\mathnormal{g}^v_k = \mathtt{Mean}(\widehat{V}_k) \\ & \widehat{\mathnormal{g}}^v_k = P_{va}(\mathnormal{g}^v_k), \;\forall k = \{1, \dots, K\} \end{align}\] where \(P_{av}, P_{va} : \mathbb{R}^D \rightarrow \mathbb{R}^D\) are linear projectors. Afterwards, to correspond between projected mean categorical audio and visual features \(\widehat{\mathnormal{g}}^a_k, \widehat{\mathnormal{g}}^v_k \in \mathbb{R}^{D}\), we apply audio-visual contrastive loss (\(\mathcal{L}_{av}\)) given by: \[\begin{align} & \mathcal{L}_{av} = \mathcal{L}_{cnt}(\widehat{g}_k^v, \widehat{g}_k^a) \\ & \mathcal{L}_{total} = \mathcal{L}_{av} + \mathcal{L}_{cls} + \mathcal{L}_{mcid} \end{align}\] where \(\mathcal{L}_{total}\) is the total accumulated loss.

Inference: During inference, we estimate cosine similarity across projected mean audio feature \(\widehat{g}_k^a \in \mathbb{R}^{D}\) and aligned visual patch tokens (\(\widehat{V}_k = \{\widehat{v}_k^j\}_{j=1}^{n_v}, \widehat{v}_k^j \in \mathbb{R}^{D}\)). Finally, after reshaping and up-scaling through bilinear interpolation, we generate \(K\)-source heat-maps \(\mathbf{{H}} \in \mathbb{R}^{K \times H \times W}\).

Table 2: Performance comparison of multi-source localization on VGGSound-Duet, VGGSound-Instruments, and MUSIC-Duet datasets. For the fair comparison, we use pre-trained AudioCLIP audio and image encoders for baseline methods.
Method VGGSound-Duet VGGSound-Instruments MUSIC-Duet
CAP(%) CIoU@0.3(%) AUC(%) CAP(%) CIoU@0.1(%) AUC(%) CAP(%) CIoU@0.3(%) AUC(%)
Current SOTA Single-Source Methods
OTS [44] ECCV18 23.9 26.4 27.5 28.7 77.8 18.2 51.2 29.4 24.3
CoarsetoFine [13] ECCV20 28.4 27.9 78.8 19.7 31.8 24.1
LVS [5] CVPR21 31.8 30.1 81.1 23.2 33.1 26.9
EZ-VSL [10] ECCV22 32.4 30.6 80.3 22.9 34.3 26.7
MarginNCE [45] ICASSP23 27.1 33.2 31.2 33.5 82.3 24.7 55.1 36.1 28.5
FNAC [2] CVPR23 29.8 35.1 33.4 35.4 84.1 26.8 57.3 37.9 30.6
Alignment [3] ICCV23 30.4 35.8 33.9 35.6 84.9 27.1 57.8 38.3 30.9
Current SOTA Multi-Source Methods
Mix-and-Localize [4] CVPR22 29.5 35.6 34.1 36.3 84.5 26.7 57.4 38.1 30.7
DSOL [14] NeurIPS20 36.9 34.6 85.9 28.4 38.8 31.3
AVGN [1] CVPR23 31.9 37.8 35.4 38.1 86.7 28.8 59.2 39.3 32.4
CLIP-Based Baseline Methods
AudioCLIP [16] ICASSP22 28.4 34.9 32.8 34.7 83.8 25.9 56.1 36.9 29.2
Wav2CLIP [41] ICASSP22 26.3 31.4 28.9 31.2 80.3 22.2 52.6 32.2 27.2
T-VSL (ours) 35.7 40.1 37.9 41.8 89.6 31.5 62.9 43.2 35.9

4 Results and Discussions↩︎

4.1 Experimental setup↩︎

Datasets. We used MUSIC, VGGSound-Instruments, and VGGSound datasets for the performance evaluation1. MUSIC [6] dataset contains \(445\) solo music videos of \(11\) instruments and \(142\) duet music videos of \(8\) instruments from YouTube. Following prior works [1], we use MUSIC-Solo for single-source and MUSIC-Duet for multi-source localization. From MUSIC-Solo, we use \(350\) videos for training and remaining \(95\) for evaluation. For MUSIC-Duet, we use \(120\) videos for training and remaining \(22\) videos for evaluation. VGGSound-Instruments [4] is a subset of VGGSound dataset [17] containing around \(32\)k video clips from \(37\) music instruments of \(10\)s duration. Apart from the musical instrument datasets, we gather around \(150\)k videos of \(10\)s duration from original VGGSound dataset [17] representing single-source sounds of \(221\) categories, such as animals, people, nature, instruments, etc. For the single-source evaluation on VGGSound, we use the full test set of \(5158\) videos, while on VGGSound-Instruments, we use \(446\) videos following prior work [1], [4]. For the multi-source training and evaluation in both VGGSound and VGGSound-Instruments, we randomly concatenate two frames from different sounding sources to produce the multi-source input image with a size of \(448\times224\), and mix the single-source audios, following prior work [1], [4].

Evaluation Metrics. For evaluating single-source localization performance, we use Intersection over Union (IoU), average precision (AP), and Area Under Curve (AUC) following prior work [1], [4]. For the multi-source localization evaluation, we use Class-aware Average precision (CAP), Class-aware IoU (CIoU), and Area Under Curve (AUC) following prior work [1], [4]. Similarly, for fair comparison with the prior work [1], we use the same thresholds in these metrics: we use IoU@0.5 and CIoU@0.3 for the MUSIC-Solo and MUSIC-Duet datasets, IoU@0.3 and IoU@0.5 for single-source VGGSound-Instruments and VGGSound datasets, and CIoU@0.1 and CIoU@0.3 for multi-source VGGSound-Instruments and VGGSound.

Implementation details. We use the AudioCLIP [16] pre-trained encoders containing ResNet-50 [8] image encoder, ESResNeXt [46] audio encoder, and transformer [47] based text encoder. For the fair comparison, we reproduce the result of existing methods with AudioCLIP image and audio encoders, instead of using separately pre-trained encoders. We use \(224\times 224\) resolution for the single-source input image, \(D=1024\), \(n_v = 49\) for \(7\times 7 (h\times w)\) spatial maps generated from the visual encoder, and \(n_a=60\) for \(10 \times 6 (m\times f)\) time-frequency map generated from the audio encoder. We use Adam optimizer [48] with a batch size of \(256\) and with a learning rate of \(1e-4\).

CLIP-Based Baseline methods. We adopt several CLIP-based baseline methods for the sound source localization task, which are described as follows:

  • AudioCLIP [16]: We fine-tune pre-trained image and audio encoders of AudioCLIP on the target dataset. We estimate the localization heatmap for the sounding sources using the similar cosine similarity across patch tokens.

  • Wav2CLIP [41]: Similar to AudioCLIP baseline, we experiment with Wav2CLIP audio and image encoders to extract localization heatmaps of sounding sources.

Figure 3: We present qualitative comparisons on challenging multi-source localization with SOTA single and multi-source baseline methods. Here, blue color represents high-attention values to the sounding object, and red color represents low-attention values. The proposed T-VSL can selectively isolate the sounding regions from the background and generates more precise localization maps for sounding sources.

4.2 Comparison to SOTA Baselines↩︎

Single-source localization: We present the quantitative results on single-source localization in Tab. 1. We apply the AudioCLIP image and audio encoders in all baseline methods for the fair comparison with our methods. We observe some performance improvements with several SOTA methods [1][3] over the AudioCLIP baseline. However, we also notice some reduction in performance from the AudioCLIP baseline in several methods [4], [10], [44]. We hypothesize that these can be due to the alterations of AudioCLIP grounding with the modified contrastive loss. In contrast, by utilizing all three modalities for sound source localization, the proposed T-VSL achieves the best performance over all CLIP-based baselines as well as SOTA methods. T-VSL outperforms the current SOTA single source localization method Alignment [3] by \(+2.9\), \(+3.7\), \(+2.1\) higher IoUs on VGGSound-Single, VGGSound-Instruments, and MUSIC-solo datasets, respectively. Moreover, T-VSL achieves \(+6.3\), \(+6.8\), and \(+5.4\) higher IoUs over the AudioCLIP baseline [49]. We hypothesize that the enhanced refinement of noisy audio and visual features with text guidance in T-VSL effectively contributes to performance improvement.

Multi-source localization: We present performance comparison in multi-source sound localization in Tab. 2. As before, we apply the AudioCLIP image and audio encoders in all baseline methods for the fair comparison with our methods. We observe some performance improvements, mostly in SOTA multi-source methods [1], [4], [14], over the AudioCLIP baseline. In most single source methods, we observe performance loss in multi-source cases over the baseline. The proposed T-VSL achieves the best performance that outperforms the SOTA multi-source baseline AVGN [1] by \(+2.3\), \(+2.9\), and \(+3.9\) CIoUs on multi-source VGGSound-Duet, VGGSound-Instruments, and MUSIC-Duet datasets, respectively. Also, T-VSL achieves \(+5.2\), \(+5.8\), and \(+6.3\) higher CIoUs over the AudioCLIP baseline. These significant performance gains on multi-source benchmarks demonstrate the effectiveness of the proposed T-VSL in disentangling multi-source mixtures.

Qualitative comparisons: In addition, we present qualitative comparisons on multi-source localization in Fig. 3 among single-source baseline Alignment [3], self-supervised multi-source baseline Mix-and-Localize [4], weakly-supervised multi-source baseline AVGN [1], and proposed T-VSL. Single-source SOTA baseline Alignment [3] struggles in multi-source localization due to the lack of audio-visual correspondence, mostly in the presence of noisy audios and silent visual objects. However, without any guidance on fine-grained source localization, multi-source baselines [1], [4] underperform on localizing challenging multi-source mixtures. Notably, our method generates superior localization maps, which shows the effectiveness of T-VSL in multi-source feature disentanglement.

4.3 Ablation on T-VSL building blocks↩︎

We present the ablation study of different building blocks of T-VSL in Tab. 3 under both single-source and multi-source settings. The vanilla baseline contains AudioCLIP image and audio encoders that directly operate on the input data. By integrating the audio-visual correspondence (AVC) block on extracted mixture patch tokens, we notice considerable performance increase in single-source (by \(+1.1\) AP and \(+0.7\) IoU@0.5) and multi-source localization (by \(+1.3\) CAP and \(+1.4\) CIoU@0.3). By integrating the audio and image conditioning blocks following audio-visual instance detection for separating fine-grained class features, we observe additional improvements of \(+1.8\) AP and \(+2.3\) AP in single-source, and \(+3.1\) CAP and \(+3.6\) AP in multi-source localization, respectively. Finally, the combination of all building blocks results in the best performance that improves the baseline by \(+5.3\) AP, and \(6.3\) IoU@0.5 in single-source and by \(+7.3\) CAP and \(+5.2\) CIoU@0.3 in multi-source localization. This further demonstrates the effective disentanglement of audio-visual features facilitated by intermediate text guidance.

Table 3: Ablation study on audio conditioning block, image-conditioning block, and audio-visual correspondence block (AVC) of T-VSL in single-source and multi-source localization tasks.
Audio Cond. Image Cond. AVC VGGSound-Single VGGSound-Duet
4-5 (lr)6-7 AP(%) IoU@0.5(%) CAP(%) CIoU@0.3(%)
42.8 47.4 28.4 34.9
43.9 48.1 29.7 36.3
45.7 50.9 32.8 38.1
46.2 51.4 33.3 38.6
48.1 53.7 35.7 40.1
Table 4: Ablation on zero-shot transfer across single and multi-source datasets with 50%-50% class split. We report IoU@0.5(%) for MUSIC-Solo, CIoU@0.3(%) for MUSIC-Duet, and CIoU@0.1(%) for VGGSound-Instruments. We use same AudioCLIP encoders for all baselines.
Target Dataset Proposed T-VSL Mix-and- Localize [4] Alignment [3] FNAC [2]
MUSIC-Duet 55.8 50.9 51.7 51.2
MUSIC-Solo 80.5 75.1 77.6 77.4
VGGSound-Instr. 81.7 76.8 76.5 76.1

4.4 Zero-shot transfer across datasets↩︎

We can perform zero-shot transfer across datasets with proposed T-VSL by simply replacing the \(N\)-class text prompts in Fig. 2. Since other weakly-supervised methods cannot perform zero-shot transfer, we present comparisons with self-supervised Mix-and-Localize [4], Alignment [3], and FNAC [2] methods in Tab. 4. To ensure a fair comparison, we conducted uniform training with 50%-50% class split for all methods with same AudioCLIP encoders. We note that our method achieves significantly better performance on zero-shot transfer to target datasets than these compared approaches. In particular, we achieve \(+4.9\), \(+5.4\), \(+4.9\) improvements on target datasets than the multi-source Mix-and-Localize [4]. Also, we achieve \(+4.1\), \(+2.9\), and \(+5.2\) higher scores than the SOTA single source baseline, Alignment [3]. This result suggests that our T-VSL has superior generalization capabilities for zero-shot localization tasks in unseen classes.

4.5 Robustness to a higher number of sources↩︎

The proposed T-VSL is not limited by the number of sources present in training mixtures unlike prior work [4]. We present comparative analysis with SOTA multi-source AVGN [1], and self-supervised single source methods FNAC [2] and Alignment [3] in Tab. 5 on robustness to higher number of test sources in VGGSound dataset. We train all methods on the VGGSound-Single dataset with same encoders for a fair comparison. We note that our method consistently maintains significantly higher CIoU score irrespective of test scenarios, and also achieves notably smaller relative performance drop for a large number of sources in mixtures. These results further demonstrate the robustness of the proposed T-VSL in disentangling multi-source mixtures.

Table 5: Robustness to higher number of sources in VGGSound dataset when trained with single-source data. CIoU@0.3(%) score is reported. Same AudioCLIP encoders are used for all baselines.
Test Source No. Proposed T-VSL AVGN  [1] Alignment  [3] FNAC  [2]
2 39.6 36.8 35.4 34.1
3 35.7 30.8 28.9 27.1
4 29.5 22.4 19.3 18.8
Table 6: Ablation study on the effect of learnable text prompts on VGGSound-Single and VGGSound-Duet datasets. Integration of learnable prompts results in noticable performance improvements in both single and multi-source localization.
Prompt Length VGGSound-Single VGGSound-Duet
2-3 (lr)4-5 AP(%) IoU@0.5(%) CAP(%) CIoU@0.3(%)
0 48.1 53.7 35.7 40.1
2 48.5 54.1 36.0 40.4
4 49.2 54.6 36.8 41.3
8 50.3 55.2 37.1 42.5
16 50.2 55.0 37.4 42.9
32 49.7 54.5 37.1 42.7

4.6 Use of learnable text prompts↩︎

To disentangle multi-source mixtures, we primarily use the class label of each sounding source as text prompts (e.g. dog barking, snake hissing). Inspired by the conditional text prompt learning in  [50], we study the use of learnable prompts along with class label representations. Instead of only using class label representation, a set of learnable embedding is integrated with class label embedding that provides additional flexibility to adapt the text prompt to target objective. We analyze the sound source localization performance with different length of learnable prompts. The performances are reported in Table 6. We notice considerable performance improvements by optimizing learnable prompts in both single and multi-source cases. However, such prompts limit zero-shot use cases to unseen classes.

5 Conclusion↩︎

In this paper, we propose T-VSL, a novel text-guided multi-source visual sound source localization framework that can disentangle one-to-one audio-visual correspondence from multi-source mixtures. We leverage the text modality to guide fine-grained feature separation and localization, from noisy audio and visual features of mixtures, by exploiting the joint embedding space of AudioCLIP. In comparison with SOTA multi-source baselines, our method shows superior zero-shot transfer across datasets. Moreover, our method demonstrates notable robustness on challenging test mixtures with higher number of sources than training scenarios. In both single-source and multi-source localization, our method largely outperforms existing weakly-supervised and self-supervised baselines on three benchmark datasets.


This research was supported in part by ONR Minerva program, iMAGiNE - the Intelligent Machine Engineering Consortium at UT Austin, and a UT Cockrell School of Engineering Doctoral Fellowship.

6 Significant Differences Between Concurrent CLIP-SSL and T-VSL↩︎

In comparison with the concurrent CLIP-based sound source localization framework CLIP-SSL [42], our proposed T-VSL has several significant differences for multi-source localization from mixtures. We highlight the major differences as follows:

1) Use of Self-supervised Pre-trained Encoders. CLIP-SSL used an off-the-shelf pre-trained mask-generator that relies on large-scale densely supervised pre-training on image segmentation datasets. However, sound source localization often demands complex spatio-temporal reasoning across audio and vision modalities, which is absent in image segmentation task. In contrast, we directly use self-supervised pre-trained AudioCLIP [16] model as our backbone, and introduced an weakly-supervised sound source localization framework. Therefore, our method inherently learns audio-visual correspondence for sound source localization without being limited by large-scale image segmentation supervised pre-training constraints as CLIP-SSL.

2) Text Guidance as Weak Supervision to Noisy Audio and Vision. The primary focus of CLIP-SSL is to replace the text query encoder of the supervised baseline with an audio encoder. However, environmental audio contains significant noises from background sources in contrast to cleaner text modality. In addition, presence of silent objects in visual scenes make the audio-visual correspondence more challening. Instead of replacing one modality (text with audio), the proposed T-VSL introduces a joint learning across audio, vision, and text modality utilizing the grounded tri-modal embedding space of AudioCLIP. Our approach particularly leverages the text modality as weak supervision to learn audio-visual correspondence in noisy mixtures, that effectively exploits all three modalities.

3) Disentanglement of Multi-Source Mixtures. The proposed T-VSL attempts to solve multi-source localization problem in an weakly supervised manner without having access to single-source audio-visual cues. Since both audio and visual modality contain noises from background sources, it is particularly challenging to learn their correspondence in multi-source scenarios. By leveraging the text representation of single-source sounds, we introduce multi-source audio-visual feature disentanglement for enhanced localization performance. In contrast, CLIP-SSL focuses on directly learning audio-visual correspondence following existing single-source baselines without explicitly tackling the multi-source localization problem. Such an approach often struggles in learning audio-visual correspondence in challenging multi-source mixtures, when single source sounds and corresponding visual objects are not available.

Figure 4: We present additional qualitative comparisons on challenging multi-source localization with SOTA single and multi-source baseline methods. Here, blue color represents high-attention values to the sounding object, and red color represents low-attention values. Similar to our prior observation, the proposed T-VSL generates more precise localization maps for sounding sources by selectively isolating the sounding regions from the background.

Table 7: Quantitative comparison on single and multi-stage architectures on VGGSound-Single and VGGSound-Duet datasets. Same AudioCLIP encoders are used. Proposed two-stage method generates superior performance compared to its single-stage counterpart.
Stage Number Method VGGSound-Single VGGSound-Duet
3-4 (lr)5-6 AP(%) IoU@0.5(%) CAP(%) CIoU@0.3(%)
Single-Stage AVGN CVPR23 44.1 49.6 31.9 37.8
Single-Stage T-VSL 46.8 51.5 33.7 38.7
Two-Stage T-VSL 48.1 53.7 35.7 40.1

7 Comparison between one-stage and two-stage architectures in T-VSL↩︎

We introduce a two-stage architecture in the proposed T-VSL, that consists of audio-visual class instance detection followed by iterative localization of each sounding source present in the multi-source mixture. In general, this two-stage approach simplifies the problem of challenging multi-source localization by initially detecting common audio-visual instances present in both audio and visual modality. However, we also study an one-stage alternative of the proposed T-VSL following AVGN [1], by removing the audio-visual class instance detection stage, and by incorporating all \(N\)-class text embedding of sounding sources in both audio and visual conditioning blocks, without explicitly separating \(K (K \leq N)\) class instances. We present the quantitative comparisons on VGGSound-Single and VGGSound-Duet datasets in Table 7. We note that the two-stage method in T-VSL achieves \(+2.2\) IoU@0.5% and \(+1.4\) CIoU@0.3% improvements on VGGSound-Single and VGGSound-Duet datasets, respectively, compared to its single-stage counterpart. We hypothesize that the proposed two-stage method greatly reduces the effect of background noises in challenging audio-visual correspondence learning from natural mixtures. In contrast, single-stage method introduces additional noises from background sources in audio-visual conditioning blocks for not explicitly suppressing background conditions.

8 Qualitative Comparisons↩︎

We present additional qualitative comparisons of the proposed T-VSL with SOTA methods on challenging multi-source localization in Figure 4. We note that the proposed T-VSL consistently achieves superior performance by generating more precise localization maps compared to other SOTA baselines, which follows our prior observation. In addition, T-VSL can selectively isolate the non-sounding silent objects, where other baselines struggle in isolating the silent objects present in the surroundings. These qualitative results demonstrate the effectiveness of T-VSL in disentangling multi-source mixtures as well as in learning audio-visual correspondence for sound source localization.


pp. 10565–10574, 2023.
pp. 6420–6429, 2023.
pp. 7777–7787, 2023.
pp. 10483–10492, 2022.
pp. 16867–16876, 2021.
pp. 570–586, 2018.
pp. 4358–4366, 2018.
pp. 9248–9257, 2019.
pp. 208–224, 2020.
pp. 292–308, 2020.
pp. 10077–10087, 2020.
pp. 8748–8763. PMLR, 2021.
pp. 976–980. IEEE, 2022.
pp. 721–725. IEEE, 2020.
pp. 19338–19347, 2023.
pp. 1802–1812, 2023.
pp. 746–754, 2023.
pp. 4675–4679, 2023.
pp. 19606–19616, 2023.
pp. 6568–6576, 2022.
pp. 15648–15658, 2023.
pp. 23033–23044. PMLR, 2023.
pp. 2935–2944, 2023.
pp. 19446–19455, 2023.
pp. 540–557. Springer, 2022.
pp. 1020–1031, 2023.
pp. 19413–19423, 2023.
pp. 393–408. Springer, 2022.
pp. 7051–7060, 2023.
pp. 728–755. Springer, 2022.
pp. 14084–14093, 2022.
pp. 4563–4567. IEEE, 2022.
pp. 435–451, 2018.
pp. 1–5. IEEE, 2023.
pp. 1–8. IEEE, 2021.
pp. 19413–19423, 2023.
pp. 16816–16825, 2022.

  1. Since many videos are no longer available for public use, these datasets become smaller. We use the same data split for all baselines and reproduce them under the same setting for a fair comparison.↩︎