4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

Wanhua Li\(^{1,\ast}\), Renping Zhou\(^{1, 2,\ast}\), Jiawei Zhou\(^{3}\), Yingwei Song\(^{1,4}\), Johannes Herter\(^{1,5}\),
Minghan Qin\(^{2}\), Gao Huang\(^{2,}\), Hanspeter Pfister\(^{1,}\)
\(^1\)Harvard University \(^2\)Tsinghua University \(^3\)Stony Brook University \(^4\)Brown University \(^5\)ETH Zürich
Project page:
https://4d-langsplat.github.io/


Abstract

Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.

1 Introduction↩︎

The ability to construct a language field [1], [2] that supports open vocabulary queries holds significant promise for various applications such as robotic navigation [3], 3D scene editing [4], and interactive virtual environments [5]. Due to the scarcity of large-scale 3D datasets with rich language annotations, current methods [1], [5], [6] leverage pre-trained models like CLIP [7] to extract pixel-wise features, which are then mapped to 3D spaces. Among them, LangSplat [2] received increasing attention due to its efficiency and accuracy, which grounds the precise masks generated by the Segment Anything Model (SAM) [8] with CLIP features into 3D Gaussians, achieving an accurate and efficient 3D language field by leveraging 3D Gaussian Splatting (3D-GS) [9]. LangSplat supports open-vocabulary queries in various semantic levels by learning three SAM-defined semantic levels.

Nothing endures but change. Real-world 3D scenes are rarely static, and they continuously change and evolve. To enable open-vocabulary queries in dynamic 4D scenes, it is crucial to consider that target objects may be in motion or transformation. For instance, querying a scene for “dog" in a dynamic environment may involve the dog running, jumping, or interacting with other elements. Beyond spatial changes, users may also want time-related queries, such as”running dog", which should only respond during the time segments when the dog is indeed running. Therefore, supporting time-agnostic and time-sensitive queries within a 4D language field is essential for realistic applications.

A straightforward approach to extend LangSplat to a 4D scene is to learn a deformable Gaussian field [10][12] with CLIP features. However, it cannot model the dynamic, time-evolving semantics as CLIP, designed for static image-text matching [13], [14], struggles to capture temporal information such as state changes, actions, and object conditions [15], [16]. Learning a precise 4D language field would require pixel-aligned, object-level video features as the 2D supervision to capture the spatiotemporal semantics of each object in a scene, yet current vision models [17], [18] predominantly extract global, video-level features. One could extract features by cropping interested objects and then obtain patch features. It inevitably includes background information, leading to imprecise semantic features [2]. Removing the background and extracting vision features only from the foreground object with accurate object masks leads to ambiguity in distinguishing between object and camera motion, since only the precise foreground objects are visible without a reference to the background context. These pose significant challenges for building an accurate and efficient 4D language field.

To address these challenges, we propose 4D LangSplat, which constructs a precise and efficient 4D Language Gaussian field to support time-agnostic and time-sensitive open-vocabulary queries. We first train a 4D Gaussian Splatting (4D-GS) [10] model to reconstruct the RGB scene, which is represented by a group of Gaussian points and a deformable decoder defining how the Gaussian point changes its location and shape over time. Our 4D LangSplat then enhances each Gaussian in 4D-GS with two language fields, where one learns time-invariant semantic fields with CLIP features as did in LangSplat, and the other learns time-varying semantic field to capture the dynamic semantics. The time-invariant semantic field encodes semantic information that does not change over time such as “human",”cup", and “dog". They are learned with CLIP features on three SAM-defined semantic levels.

For the time-varying semantic field, instead of learning from vision features, we propose to directly learn from textual features to capture temporally dynamic semantics. Recent years have witnessed huge progress [19], [20] of Multimodal Large Language Models (MLLMs), which take multimodal input, including image, video, and text, and generate coherent responses. Encouraged by the success of MLLMs, we propose a multimodal object-wise video prompting method that combines visual and text prompts to guide MLLMs in generating detailed, temporally consistent, high-quality captions for each object throughout a video. We then encode these captions using a large language model (LLM) to extract sentence embeddings, creating pixel-aligned, object-level features that serve as supervision for the 4D Language field. Recognizing the smooth transitions exhibited by objects across states in 4D scenes, we further introduce a status deformable network to model these continuous state changes effectively over time. Our network captures the gradual transitions across object states, enhancing the model’s temporal consistency and improving its handling of dynamic scenes. Figure [fig:teaser] visualizes the learned time-varying semantic field. Our experiments across multiple benchmarks validate that 4D LangSplat achieves precise and efficient results, supporting both time-agnostic and time-sensitive open-vocabulary queries in dynamic, real-world environments.

In summary, our contributions are threefold:

  • We introduce 4D LangSplat for open-vocabulary 4D spatial-temporal queries. To the best of our knowledge, we are the first to construct 4D language fields with object textual captions generated by MLLMs.

  • To model the smooth transitions across states for objects in 4D scenes, we further propose a status deformable network to capture continuous temporal changes.

  • Experiential results show that our method attains state-of-the-art performance for both time-agnostic and time-sensitive open-vocabulary queries.

2 Related Work↩︎

3D Gaussian Splatting. 3D-GS [9] is a powerful volumetric rendering technique that has gained attention for its real-time, high-quality rendering ability. It represents complex surfaces and scenes by projecting 3D Gaussian distributions into 2D image space. It has been widely used for many applications such as human reconstruction [21], [22], 3D editing [23], [24], mesh extraction [25], [26], autonomous driving [27], [28]. Recent work [11], [12], [29], [30] including 4D Gaussian Splatting (4D-GS) [10] has extended Gaussian Splatting to 4D by introducing deformable fields, allowing for dynamic scenes where Gaussian parameters evolve over time to capture both spatial and temporal transformations. However, 4D-GS primarily focuses on visual fidelity rather than semantic understanding, which limits its applicability in open-vocabulary language queries.

3D Language Field. Some early work [4], [31] usually ground 2D foundation model features [7], [32], [33] into a neural radiance field (NeRF) [34]. For example, Distilled Feature Fields (DFFs) propose to distill CLIP-LSeg [33] into NeRF for semantic scene editing. LERF [1] proposes to distill CLIP [7] features into NeRF to support open-vocabulary 3D querying. With the emergence of 3D-GS, many methods [35][38] adopt 3D-GS as the 3D scene representation and lift 2D foundation model features into 3D Gaussians. Among them, LangSpalt [2] attains precise and efficient language fields due to the introduction of SAM masks. By incorporating multiple levels of semantic granularity, LangSplat effectively supports open-vocabulary queries across whole objects, parts, and subparts. Although significant advances have been made in 3D language fields, 4D language fields for dynamic scenes remain largely unexplored, which is the focus of this paper.

Multimodal Large Language Models. The remarkable success of LLMs [39][42] has shown their ability to perform new tasks [43] following human instructions. Based on LLMs, the research on MLLMs [44][46] explores the possibility of multimodal chat ability [47], which represents a significant step forward in integrating visual and textual modalities for complex scene understanding. MLLMs usually employ a vision encoder to extract visual features and learn a connector to align visual features with LLMs. The recent models [48][50] demonstrate remarkable capabilities in generating coherent captions from multimodal inputs, including images and videos. In this paper, we propose to utilize the powerful multimodal process ability of MLLMs to convert video data into object-level captions, which are then used to train a 4D language field.

3 Method↩︎

3.1 Preliminaries↩︎

3D Gaussian Splatting. In 3D-GS [9], a scene is represented as a set of 3D Gaussian points. Each pixel in 2D images is computed by blending \(N\) sorted 3D Gaussian points that overlap the pixel: \[C = \sum_{i=1}^{N} c_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j), \label{eq:rendering953dgs}\tag{1}\] where \(c_i\) and \(\alpha_i\) are the color and density of \(i\)-th Gaussian.

LangSplat. Building upon 3D-GS, LangSplat [2] grounds 2D CLIP features into 3D Gaussians. To obtain a precise field, SAM is used to obtain accurate object masks and then CLIP features are extracted with masked objects. LangSplat adopts feature splatting to train the 3D language field: \[\boldsymbol{F} = \sum_{i =1}^{N} \boldsymbol{f}_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j), \label{eq:rendering95langsplat}\tag{2}\] where \(\boldsymbol{f}_i\) represents the language feature of the \(i\)-th Gaussians and \(\boldsymbol{F}\) is the rendered embedding in 2D images.

4D Gaussian Splatting. 4D-GS [10] extends the 3D-GS for dynamic scenes by introducing a deformable Gaussian field. Here, Gaussian parameters, including position, rotation, and scaling factor, are allowed to vary over time: \[(\mathcal{X}', r', s')= (\mathcal{X} + \Delta\mathcal{X}, r + \Delta r, s + \Delta s), \label{eq:rendering954dgs}\tag{3}\] where \(\mathcal{X}\), \(r\), and \(s\) represent the position, rotation, and scaling parameters, respectively. \(\Delta\mathcal{X}\), \(\Delta r\), and \(\Delta s\) denote the corresponding deformable networks, which are implemented by lightweight MLPs. The HexPlane [51], [52] representation is used to obtain rich 3D Gaussian features.

A straightforward approach to adapting LangSplat for 4D scenes is to extend its static 3D language Gaussian field with a deformable Gaussian field, as done in 4D-GS. However, this approach faces significant limitations due to the nature of CLIP features. CLIP [7] is designed primarily for static image-text alignment, making it ill-suited for capturing dynamic and time-evolving semantics in video. Recent research [15], [16], [53] further confirms that it struggles with understanding state changes, actions, object conditions, and temporal context. For a precise and accurate 4D language field, it is essential to obtain pixel-aligned, object-level features that track temporal semantics with fine-grained detail for each object in a scene. However, existing vision models [17], [18] primarily offer global, video-level features that overlook specific object-level information, making it difficult to represent spatiotemporal semantics at the object level. While cropping objects and obtaining patch-based features is possible, this includes background information, leading to inaccurate language fields. Further cropping objects with accurate masks makes it difficult for vision models to distinguish between object movement and camera motion, as there is no background reference.

Figure 1: The framework of constructing a time-varying semantic field in 4D LangSplat. We first use multimodal object-wise prompting to convert a video into pixel-aligned object-level caption features. Then, we learn a 4D language field with a status deformable network.

3.2 4D LangSplat Framework↩︎

To address these challenges, we introduce 4D LangSplat, which constructs accurate and efficient 4D language fields to support both time-sensitive and time-agnostic open-vocabulary queries in dynamic scenes. We first reconstruct the 4D dynamic RGB scene using 4D-GS [10]. In this stage, the RGB scene is represented by a set of deformable Gaussian points, each with parameters that adjust over time to capture object movement and shape transformations within the scene. Building on the learned 4D-GS model, we extend each Gaussian point with language embeddings to learn 4D language fields. To further capture temporal and spatial details, and to handle both time-sensitive and time-agnostic queries effectively, we simultaneously construct two types of semantic fields: a time-agnostic semantic field and a time-varying semantic field. The time-agnostic semantic field focuses on capturing semantic information that does not change over time. Although objects in the scene are dynamic, they still exhibit attributes that remain constant across time, such as static properties of entities like “dog",”human", and other objects within the environment. This semantic field emphasizes spatial details of these time-agnostic semantics. Conversely, the time-varying semantic field captures temporally dynamic semantics, such as “a running dog" , emphasizing semantic transitions over time.

For the time-agnostic semantic field, we still use CLIP features and lift them to 4D space, as they are sufficient for capturing time-agnostic semantics. Specifically, we learn a static language embedding for each deformable Gaussian point in the 4D-GS model. Similar to LangSplat, we utilize SAM’s hierarchical segmentation masks, learning three distinct time-agnostic semantic fields corresponding to the three levels of semantic granularity provided by SAM. Although each Gaussian point’s position and shape dynamically change over time, its semantic feature remains static. These static embeddings ensure spatial accuracy while focusing on stable semantic information derived from CLIP features. On the other hand, to learn the time-varying semantic field, we propose a novel approach that bypasses the limitations of vision-based feature supervision. Instead, visual data is converted into object-level captions by leveraging MLLMs. These captions are then encoded using an LLM to extract sentence embeddings, which are used as pixel-aligned, object-level features for training the semantic field. To effectively model the smooth, continuous transitions of Gaussian points between a limited set of states, we further introduce a status deformable network to enhance reconstruction quality. The framework of training time-varying 4D fields is illustrated in Figure 1.

3.3 Multimodal Object-Wise Video Prompting↩︎

Constructing a high-quality, dynamic 4D semantic field requires detailed, pixel-aligned object-level features that capture time-evolving semantics in video data. However, obtaining these fine-grained visual features is challenging due to the limitations of current vision models in distinguishing object-level details over time. To overcome this, we propose converting video segments into object-wise captions and extracting sentence embeddings from these captions to serve as precise, temporally consistent features.

Advances in MLLMs like GPT-4o [44], LLaVA-OneVision [50], and Qwen2-VL [48] enable high-quality language generation from multimodal inputs. These models process video, image, and text inputs to generate temporally consistent responses. Leveraging these capabilities, we propose a multimodal object-wise video prompting method, which combines visual and textual prompts to guide the MLLM in generating temporally consistent, object-specific, high-quality captions across video frames, encapsulating both spatial and temporal details.

Formally, let \(V = \{I_1, I_2, \dots, I_T\}\) be a video segment of \(T\) frames. For each frame, we apply SAM [8] in conjunction with DEVA tracking [54] to segment objects and maintain consistent object identities over time. This process yields temporally consistent masks for \(n\) objects present in the video, denoted as \(\{M_1, M_2, \dots, M_n\}\), where each mask \(M_i\) represents a specific object tracked across frames. Each frame \(I_t\) is segmented with the object masks at time step \(t\) \(\{M_{1,t}, M_{2,t}, \dots, M_{n,t}\}\).

To effectively generate instance-wise, object-specific captions while preserving the broader scene context, we need to guide the MLLM through precise prompting. Our goal is for the MLLM to generate captions focused solely on the target object without introducing details of other objects. However, the presence of other objects as background reference remains essential; without this context, the MLLM may lose track of spatial relationships and environmental context, which are critical for understanding the action and status of the target object. Thus, our approach employs prompting techniques to direct the MLLM’s attention to each object, enabling region-specific captioning that maintains overall scene awareness. Inspired by the recent visual prompting progress [55][57], we first use visual prompts to highlight the object of interest. Specifically, we build a visual prompt \(\mathcal{P}_{i,t}\) for each object \(i\) in frame \(I_t\): \[\mathcal{P}_{i,t} = \operatorname{Contour}(M_{i,t}) \cup \operatorname{Gray}( M_{i,t}) \cup \operatorname{Blur}( M_{i,t}), \label{eq:visualprompt}\tag{4}\] where \(\operatorname{Contour}(M_{i,t})\) highlights \(M_{i,t}\) with a red contour, \(\operatorname{Gray}(M_{i,t})\) converts the non-object area to grayscale, and \(\operatorname{Blur}( M_{i,t})\) applies a Gaussian blur to the background pixels. This prompt preserves essential background information while ensuring focus on the object of interest, improving the MLLM’s attention to the relevant target.

For temporal coherence, we first generate a high-level video-level motion description for object \(i\), noted as \(\mathcal{D}_i\), which summarizes the motion dynamics over \(T\) frames. This description is derived by prompting the MLLM with the entire video sequence \(V\) to capture object motion and interactions, defined as: \[\mathcal{D}_i = \operatorname{MLLM}(\{\mathcal{P}_{i,1},..., \mathcal{P}_{i,T}\},\mathcal{T}_{video}, V ), \label{eq:videocaption}\tag{5}\] where \(\mathcal{T}_{video}\) denotes the textual prompt that instructs the MLLM to generate video-level motion descriptions based on the visual prompts. This description \(\mathcal{D}_i\) is then used as context for generating frame-specific captions. For each frame \(I_t\), we combine \(\mathcal{D}_i\) with the visual prompt \(\mathcal{P}_{i,t}\) to generate a time-specific caption \(C_{i,t}\), capturing both the temporal and contextual details for object \(i\) in frame \(I_t\): \[C_{i,t} = \operatorname{MLLM}(\mathcal{D}_i, \mathcal{P}_{i,t}, \mathcal{T}_{frame}, V_t ), \label{eq:framecaption}\tag{6}\] where \(\mathcal{T}_{\text{frame}}\) denotes the textual prompt that instructs the MLLM to generate an object caption describing the object’s current action and status at a specific time step.

Each caption \(C_{i,t}\) provides semantic information for an object \(i\) at time \(t\). To encode this semantic data into features for training the 4D language field, we extract sentence embeddings \(\boldsymbol{e}_{i,t}\) for each caption \(C_{i,t}\). As LLMs exhibit strong processing ability for free-form text [40], [58], we further propose to utilize them to extract sentence embeddings. Specifically, a fined-tuned LLM [59] for sentence embedding tasks is used to extract features. This design choice allows our model to respond effectively to open-vocabulary queries as the embeddings are generated within a shared language space that aligns with natural language queries. Thus, for every pixel \((x, y) \in M_{i,t}\) within object \(i\)’s mask in frame \(I_t\), the feature \(\boldsymbol{F}_{x,y,t}\) is given by: \[\boldsymbol{F}_{x,y,t} = \boldsymbol{e}_{i,t}, \label{eq:featurelabels}\tag{7}\] where the embeddings \(\boldsymbol{F}_{x,y,t}\) serve as 2D supervision for the time-variable semantic field, providing pixel-aligned, object-wise features across frames.

3.4 Status Deformable Network↩︎

With the 2D semantic feature supervision information available, we use it to train a 4D field. A straightforward approach, analogous to the method used 4D-GS, would be to directly learn a deformation field \(\Delta \boldsymbol{f}\) for the semantic features of deformable Gaussian points. However, this straightforward approach allows the semantic features of each Gaussian point to change to any arbitrary semantic state, potentially increasing the learning complexity and compromising the temporal consistency of the features. In real-world dynamic scenes, each Gaussian point typically exhibits a gradual transition between a limited set of semantic states. For instance, an object like a person may transition smoothly among a finite set of actions (, standing, walking, running), rather than shifting to entirely unrelated semantic states. To model these smooth transitions and maintain a stable 4D semantic field, we propose a status deformable network that restricts the Gaussian point’s semantic features to evolve within a predefined set of states.

Specifically, we represent the semantic feature of a Gaussian point \(i\) at any time \(t\) as a linear combination of \(K\) state prototype features, \(\{\boldsymbol{S}_{i,1}, \boldsymbol{S}_{i,2}, \dots, \boldsymbol{S}_{i,K}\}\), where each state captures a specific, distinct semantic meaning. The semantic feature \(\boldsymbol{f}_{i, t}\) of a Gaussian point \(i\) at time \(t\) is: \[\boldsymbol{f}_{i, t} = \sum_{k=1}^{K} w_{i,t,k} \boldsymbol{S}_{i,k}, \label{eq:status95deformable95combination}\tag{8}\] where \(w_{i,t,k}\) denotes the weighting coefficient for each state \(k\) at time \(t\), with \(\sum_{k=1}^{K} w_{i,t,k} = 1\). This linear combination ensures that each Gaussian point’s semantic features transition gradually between predefined states.

To determine the appropriate weighting coefficients \(w_{k,t}\) for each Gaussian point over time, we employ an MLP decoder \(\phi\). This MLP takes as input the spatial-temporal features from Hexplane [51] and predicts weighting coefficients that reflect the temporal progression of semantic states. The MLP decoder \(\phi\) and the per-Gaussian states \(\{\boldsymbol{S}_{i,1}, \boldsymbol{S}_{i,2}, \dots, \boldsymbol{S}_{i,K}\}\) are jointly trained. This design ensures that the status deformable network adapts to both the spatial and temporal context, enabling smooth, consistent transitions among semantic states.

3.5 Open-vocabulary 4D Querying↩︎

After training, 4D LangSplat enables both time-agnostic and time-sensitive open-vocabulary queries. For time-agnostic queries, we utilize only the time-agnostic semantic field. We first render a feature image and then compute the relevance score [1] between this rendered feature image and the query. Following the post-processing strategy in LangSplat [2], we obtain the segmentation mask for each frame from the relevance score maps.

For time-sensitive queries, we combine both the time-agnostic and time-sensitive semantic fields. First, the time-agnostic semantic field is used to derive an initial mask for each frame, following the same procedure described above. This mask identifies where the queried object or entity exists, irrespective of time. To refine the query to specific time segments where the queried term is active (, an action occurring within a particular timeframe), we calculate the cosine similarity between the time-sensitive semantic field on the initial mask region and the query text. This similarity is computed across each frame within the masked region to determine when the time-sensitive characteristics of the query term are most strongly represented. Using the mean cosine similarity value across the entire video as a threshold, we identify the frames that exceed this threshold, indicating relevant time segments. The spatial mask obtained with the time-agnostic field is retained as the final mask prediction for the identified time segments.

This combination of time-agnostic and time-sensitive semantic fields enables accurate and efficient spatiotemporal querying, allowing 4D LangSplat to capture both the persistent and dynamic characteristics of objects in the scene.

4 Experiment↩︎

Table 1: Quantitative comparisons of time-sensitive querying on the HyperNeRF [60] dataset.
Method americano chickchicken split-cookie espresso Average
2-3 (lr)4-5 (lr)6-7 (lr)8-9 (lr)10-11 Acc(%) vIoU(%) Acc(%) vIoU(%) Acc(%) vIoU(%) Acc(%) vIoU(%) Acc(%) vIoU(%)
LangSplat [2] 45.19 23.16 53.26 18.20 73.58 33.08 44.03 16.15 54.01 22.65
Deformable CLIP 60.57 39.96 52.17 42.77 89.62 75.28 44.85 20.86 61.80 44.72
Non-Status Field 83.65 59.59 94.56 86.28 91.50 78.46 78.60 47.95 87.58 68.57
Ours 89.42 66.07 96.73 90.62 95.28 83.14 81.89 49.20 90.83 72.26
Table 2: Quantitative comparisons of time-agnostic querying on the HyperNeRF [60] and Neu3D [61] datasets (Numbers in %).
Method HyperNeRF Neu3D
2-3 (lr)4-5 mIoU mAcc mIoU mAcc
Feature-3DGS [36] 36.63 74.02 34.96 87.12
Gaussian Grouping [37] 50.49 80.92 49.93 95.05
LangSplat [2] 74.92 97.72 61.49 91.89
Ours 82.48 98.01 85.11 98.32
image

Comparisons of Text prompts.

Table 3: Results for different state numbers on chickchicken.
K 2 3 4 5 6
Acc (%) 94.56 97.82 95.65 94.56 94.56
vIoU (%) 88.05 91.93 89.11 88.98 86.28

4.1 Setup↩︎

Datasets. We conduct evaluations using two widely adopted datasets: HyperNeRF [60] and Neu3D [61] . Given the absence of semantic segmentation annotations for dynamic scenes in these datasets, we perform manual annotations to facilitate evaluation. More details regarding this process are provided in the Appendix 6.

Implementation Details. All experiments are conducted on a single Nvidia A100 GPU. For extracting CLIP features, we use the OpenCLIP ViT-B/16 model . For dynamic semantics, we leverage the Qwen2-VL-7B model as the backbone MLLM to generate time-varying captions, and use e5-mistral-7b [59] to encode them into embeddings. Following LangSplat [2], we also train an autoencoder to compress the feature dimension. The CLIP and the text features are compressed into 3 and 6 dimensions, respectively.

Figure 2: Visualization of time-sensitive querying results between Deformable CLIP and ours. The bottom row depicts the cosine similarity across frames, rescaled to (0,1) for direct comparison, while the horizontal bars indicate frames identified as relevant time segments. We observed that the CLIP-based method cannot understand dynamic semantics correctly, while our method recognizes them.

Figure 3: Comparison of time-sensitive query mask. We compare time-sensitive query masks between Deformable CLIP and ours. The CLIP-based method fails to identify time segments accurately, especially at the demarcation points during state transitions.

Baselines. Due to the absence of publicly available models for 4D language feature rendering, we use several 3D language feature rendering methods as baselines for evaluating time-agnostic querying, including LangSplat [2] and Feature-3DGS [36] . We also incorporate segmentation-based techniques, such as Gaussian Grouping [37], to assess semantic mask generation quality in our approach. Inspired by Segment Any 4D Gaussians [62], we enhance Gaussian Grouping to adapt to dynamic scenes.

Given the lack of dynamic language field rendering methods, we consider two additional baselines besides LangSplat for time-sensitive querying: Deformable CLIP and Non-Status Field. Deformable CLIP only utilizes the time-agnostic semantic fields of our method, which first trains a 4D-GS model to learn dynamic RGB fields, and then learns static CLIP fields on these pre-trained RGB fields. The Non-Status Field method utilizes both the time-agnostic semantic field and the time-sensitive semantic field of our method while removing the status deformable network. Instead, it directly learns a deformation field \(\Delta f\).

Metrics. For time-agnostic querying, we evaluate performance using mean accuracy (mAcc) and mean intersection over union (mIoU), calculated across all frames in the test set. For time-sensitive querying, we evaluate temporal performance using an accuracy metric, defined as \(\mathrm{Acc} = {n_{correct}}/{n_{all}}\), where \(n_{correct}\) and \(n_{all}\) represent the number of correctly predicted frames and the total frames in the test set, respectively. To assess segmentation quality, we adopt the metric from [63] and define \(\mathrm{vIoU} = \frac{1}{|S_u|}\sum_{t\in S_i}\mathrm{IoU}(\hat{s}_t,s_t)\), where \(\hat{s}_t\) and \(s_t\) are the predicted and ground truth masks at time \(t\), and \(S_u\) and \(S_i\) are the sets of frames in the union and intersection.

4.2 Main Results↩︎

Time-Agnostic Querying. Table 2 shows our results on two datasets. Our approach achieves the highest mIoU and mAcc scores, demonstrating strong segmentation performance across both datasets. In contrast, other methods struggle to capture object movement and shape changes, leading to worse performance on dynamic objects.

Time-Sensitive Querying. We perform dynamic querying on the HyperNeRF dataset, with Acc and vIoU results presented in Table 1. Our approach outperforms not only the LangSplat method but also the Deformable CLIP and Non-Status Field approaches. Specifically, our method achieves accuracy improvements of 29.03% and 3.25% and vIoU gains of 28.04% and 4.19%, respectively. Our approach introduces a multimodal object-wise video prompting method that surpasses traditional CLIP-based techniques. In comparison to Deformable CLIP, our time-varying semantic field effectively integrates spatial and temporal information. This ensures fluidity and coherence in semantic state transitions, underscoring the importance of MLLM video prompting (Section 3.3). Additionally, when compared to the Non-Status Field method, our approach highlights the significance of status modelling by introducing a status deformable network (Section 3.4), which enhances the model’s capability to handle complex, evolving states and further solidifies the robustness and versatility of our method in capturing nuanced dynamics.

Visualization. To demonstrate our learned time-sensitive language field, we applied PCA to reduce the dimensionality of the learned semantic features, producing a 3D visualization as shown in Figure [fig:teaser]. Our method better captures the dynamic semantic features of objects and renders consistent features accurately. In Figure 2, we illustrate the change in query-frame similarity scores over time for time-sensitive queries, comparing our approach to a CLIP-based method. As shown, CLIP, which is optimized for static image-text alignment, struggles to capture the most relevant time segments within dynamic video semantics, whereas our method successfully identifies these segments. In Figure 3, we present specific query masks. We observe that the CLIP-based approach fails to accurately capture time segments, especially at transition points in object states. For example, CLIP cannot reliably detect subtle transitions, such as when a cookie has just cracked or when a glass cup has started dripping coffee. In contrast, our method effectively identifies these nuanced changes, demonstrating its capability to handle dynamic state transitions accurately.

4.3 Ablation Studies↩︎

Multimodal Prompting. We evaluate the quality of generated captions using different combinations of textual and visual prompting methods. To quantify this, we defined a metric, \(\Delta_{\mathrm{sim}}=\overline{score}_{\mathrm{pos}} - \overline{score}_{\mathrm{neg}}\), where \(\overline{score}_{\mathrm{pos}}\) and \(\overline{score}_{\mathrm{neg}}\) represent the average cosine similarity scores between query and caption features, encoded by the e5 model, for positive and negative samples, respectively. A higher \(\Delta_{sim}\) indicates a stronger distinction between positive and negative examples, suggesting that the generated caption more effectively captures the spatiotemporal dynamics and semantic features of objects in the scene. Table [tab:aba-visual-prompt] shows that utilizing all three visual prompting strategies maximizes the MLLM’s focus on target objects. As shown in Table [tab:aba-text-prompt], incorporating pre-generated video-level motion descriptions resulting in a 0.87% improvement. Furthermore, adding image prompts enables a more accurate description.

State Numbers. Table 3 shows the ablation results of the status number \(K\). We observe that an appropriate increase in \(K\) led to better results, with \(K=3\) achieving the optimal performance, which was adopted in our experiments.

5 Conclusion↩︎

We present 4D LangSplat, a novel approach to constructing a dynamic 4D language field that supports both time-agnostic and time-sensitive open-vocabulary queries within evolving scenes. Our method leverages MLLMs to produce high-quality, object-specific captions that capture temporally consistent semantics across video frames. This enables 4D LangSplat to overcome the limitations of traditional vision feature-based approaches, which struggle to generate precise, object-level features in dynamic contexts. By incorporating multimodal object-wise video prompting, we obtain pixel-aligned language embeddings as training supervision. Furthermore, we introduce a status deformable network, which enforces smooth, structured transitions across limited object states. Our experimental results across multiple benchmarks demonstrate that 4D LangSplat achieves state-of-the-art performance in dynamic scenarios.

Acknowledgements↩︎

The work is supported in part by the National Key R&D Program of China under Grant 2024YFB4708200 and National Natural Science Foundation of China under Grant U24B20173, and in part by US NIH grant R01HD104969.

6 Datasets↩︎

Since there are no publicly available ground truth segmentation mask labels for the HyperNeRF [60] and Neu3D [61] datasets, nor annotations tailored for time-sensitive querying, we adopt the annotation pipeline outlined in Segment Any 4D Gaussians [62] and manually annotate the mask labels ourselves. Specifically, we leverage the Roboflow platform alongside the SAM (Segment Anything Model) framework for interactive annotation.

For the HyperNeRF dataset, where data is captured with a monocular camera, we select one frame every four frames as the training set. From the remaining data, we annotate a subset as the test set to ensure no overlap between the two sets. For the Neu3D dataset with 21 camera views, one is reserved for testing, and the remaining 20 are used for training, aligning with the 4D-GS [10] setting. To evaluate on the Neu3D dataset, we annotate every 20 frames from the test views.

7 Implementation Details↩︎

Multimodal Object-Wise Video Prompting. For Multimodal Object-Wise Video Prompting, we utilize the largest SAM-defined semantic levels as mask inputs for the Multimodal Large Language Models (MLLMs). The prompting process is outlined in Table 4, which provides the specific prompts used for MLLM prompting. For visual prompting, we employ a red contour line with a radius of 2 to delineate object boundaries. Additionally, we apply Gaussian blur with a radius of 10 and convert the images to grayscale mode to achieve gray-level augmentation. These techniques enhance the effectiveness of the visual input during the prompting process.

Table 4: Details of Text prompts
Video prompts Image prompts
I highlighted the objects I want you to describe in red outline and blurred the objects that don’t need you to describe. First please determine the object highlighted in red line in the video. Then briefly summarize the transformation process of this object. You have an understanding of the overall transformation process of the object: {video prompt}. Now, I have provided you with images extracted from this process. Please describe the specific state of the object(s) in the given image, without referring to the entire video process. Avoid describing states that you can’t infer directly from the picture. Avoid repeating descriptions in context. For example, if the context suggests the object is moving up and down but the image shows it is just moving down, explicitly only state that the object is in a moving down state. If the context suggests the object is breaking but the image shows it is complete right now, explicitly only state that the object appears to be complete. If context tells you something changes from green to blue, but it’s blue in this image, just state that the object is blue.

Autoencoder. Following LangSplat [2], we employ two autoencoders to compress the high-dimensional CLIP feature (512-dimension) and LLM feature (4096-dimension) separately. Specifically, two MLPs are used to compress 512-dimensional CLIP features and 4096-dimensional video features to 3 and 6 dimensions, respectively. The autoencoders are optimized with L2 loss. To enhance stability, a cosine similarity loss is also included as a regularization.

Table 5: Query Performance Comparison.
Method FPS
Gaussian Grouping [37] 1.47
Ours-agnostic 5.24
Ours-sensitive 4.05
Table 6: Comparison of mean IoU and mean Accuracy for various methods on the HyperNeRF [60] datasets.
Method americano chickchicken split-cookie
2-3 (lr)4-5 (lr)6-7 mIoU(%) mAcc(%) mIoU(%) mAcc(%) mIoU(%) mAcc(%)
Feature-3DGS [36] 34.65 62.96 47.21 87.22 47.03 68.25
Gaussian Grouping [37] 61.77 71.31 34.65 75.52 72.71 96.56
LangSplat [2] 72.08 97.61 75.98 97.86 76.54 97.32
Ours 83.48 98.77 86.50 98.81 90.04 98.67
Method espresso keyboard torchocolate
2-3 (lr)4-5 (lr)6-7 mIoU(%) mAcc(%) mIoU(%) mAcc(%) mIoU(%) mAcc(%)
Feature-3DGS [36] 24.04 80.13 42.14 80.98 24.71 64.58
Gaussian Grouping [37] 32.45 82.46 42.44 74.15 58.95 85.52
LangSplat [2] 82.93 98.66 72.42 96.75 69.55 98.09
Ours 83.52 97.95 79.53 95.71 71.79 98.10
Table 7: Comparison of mean IoU and mean Accuracy for various methods on the Neu3D [61] dataset.
Method coffee martini cook spinach cut roasted beef
2-3 (lr)4-5 (lr)6-7 mIoU(%) mAcc(%) mIoU(%) mAcc(%) mIoU(%) mAcc(%)
Feature-3DGS [36] 30.23 84.74 41.50 95.59 31.66 91.07
Gaussian Grouping [37] 71.37 97.34 46.45 93.79 54.70 93.25
LangSplat [2] 67.97 98.47 78.29 98.60 36.53 97.04
Ours 85.16 99.23 85.09 99.38 85.32 99.28
Method flame salmon flame steak sear steak
2-3 (lr)4-5 (lr)6-7 mIoU(%) mAcc(%) mIoU(%) mAcc(%) mIoU(%) mAcc(%)
Feature-3DGS [36] 54.33 77.13 27.27 88.23 24.78 85.94
Gaussian Grouping [37] 35.72 94.69 36.92 95.96 54.44 95.27
LangSplat [2] 66.01 82.16 64.05 97.77 78.29 98.60
Ours 89.88 94.35 88.44 98.27 76.78 99.38

Training Details. Our training pipeline is structured into four stages, progressively refining the model for robust performance in dynamic 4D language field construction. 1) In the initial stage, we train a static Gaussian field to reconstruct the RGB channel of static scenes. This provides a foundation for modeling the visual appearance of the scene. 2) Next, we incorporate semantic information into the static Gaussian field without introducing deformable networks. Semantic features are embedded into the scene by minimizing an \(L_1\) loss, ensuring accurate representations of the static scene’s semantics. 3) In the third stage, we extend the model to dynamic RGB scenes by introducing non-semantic deformation fields. Leveraging the approach of 4D-GS [10], we employ deformable networks to learn temporal and motion-based deformations that capture spatial and temporal dynamics for RGB scenes. 4) For time-agnostic semantic rendering, we refine the semantic features from the second stage while keeping the deformable network parameters fixed. For time-sensitive semantic rendering, we jointly train the status deformable network and the state prototype features to refine and model dynamic semantics effectively. For all datasets, the iterations for four stages are 3000, 1000, 10000, and 10000. The learning rates for the deformable network and the state prototype features are set to \(1.6 \times 10^{-4}\) and \(2.5 \times 10^{-3}\), respectively. Other training parameters remain consistent with those used in 4D-GS.

8 More Quantitative Results↩︎

In Table 6 and Table 7, we present a detailed evaluation of time-agnostic querying performance on the HyperNeRF and Neu3D datasets, respectively. Our method achieves a mean IoU exceeding 85% across all scenarios, outperforming the baseline methods in most scenes for both mean IoU and mean accuracy. These results underscore the robustness of our approach, demonstrating its ability to deliver superior segmentation accuracy and reliability compared to existing methods, even in dynamic scenes.

Table 5 further compares the runtime efficiency of our method with the baseline on the HyperNeRF dataset. The comparison encompasses the total time required for rendering semantic features and conducting open-vocabulary queries. Our method demonstrates significant advantages over the Gaussian Grouping approach, achieving faster runtime for both time-agnostic and time-sensitive queries. These findings validate our method as an efficient and scalable solution for handling open-vocabulary queries in dynamic 4D scenes.

Figure 4: Visualization of time-agnostic querying results on HyperNeRF [60] and Neu3D [61] datasets.

9 More Visualization Results↩︎

Figure 4 illustrates visualization results for time-agnostic querying. As depicted, our method demonstrates superior accuracy in capturing objects that correspond to semantic descriptions, compared to other methods. Furthermore, it effectively tracks the spatial dynamics of these objects across different temporal steps, showcasing its effectiveness in handling dynamic scenarios.

10 MLLM-based Embeddings↩︎

Since our method utilizes MLLMs to generate captions, the feature representation capability of the obtained embeddings is inherently limited by the capacity of the MLLMs, which constitutes a limitation of our approach. To verify that our MLLM-based embeddings indeed encode spatial-temporal information, we directly apply the MLLM-based embeddings, without any fine-tuning, to video classification and spatial-temporal action localization tasks using 2D videos. As shown in Tables 8 and 9, our results demonstrate that, even in a zero-shot setting, the MLLM-based embeddings achieve competitive performance compared to state-of-the-art (SOTA) methods specifically designed for these tasks. This indicates that MLLM-based embeddings inherently capture some spatial-temporal information. However, we also acknowledge that the performance of our approach is ultimately constrained by the representational capacity of the MLLMs.

Table 8: Accuracy Results (%) on the Video Classification task.
Method HMDB51 [64] UCF101 [65] Kinetics400 [66]
MLLM 58.34 78.97 55.14
IMP [67] 59.1 91.5 77.0
Table 9: Spatial-Temporal Action Localization Results (%) on UCF101 [65].
Method VmAP@0.1 VmAP@0.2 VmAP@0.5
MLLM 78.13 75.78 64.38
HIT [68] 86.1 88.8 74.3

References↩︎

[1]
J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and booktitle=Proceedings. of the I. I. C. on C. V. Tancik Matthew, “Lerf: Language embedded radiance fields,” 2023, pp. 19729–19739.
[2]
M. Qin, W. Li, J. Zhou, H. Wang, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Pfister Hanspeter, “Langsplat: 3d language gaussian splatting,” 2024, pp. 20051–20060.
[3]
C. Huang, O. Mees, A. Zeng, and booktitle=2023. I. I. C. on R. and A. (ICRA). Burgard Wolfram, “Visual language maps for robot navigation,” 2023 , organization={IEEE}, pp. 10608–10615.
[4]
S. Kobayashi, E. Matsumoto, and V. Sitzmann, “Decomposing nerf for editing via feature field distillation,” Advances in Neural Information Processing Systems, vol. 35, pp. 23311–23330, 2022.
[5]
K. Liu et al., “Weakly supervised 3d open-vocabulary segmentation,” Advances in Neural Information Processing Systems, vol. 36, pp. 53433–53456, 2023.
[6]
W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and booktitle=Conference. on R. L. Isola Phillip, “Distilled feature fields enable few-shot language-guided manipulation,” 2023 , organization={PMLR}, pp. 405–424.
[7]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021 , organization={PMLR}, pp. 8748–8763.
[8]
A. Kirillov et al., “Segment anything,” 2023, pp. 4015–4026.
[9]
B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D gaussian splatting for real-time radiance field rendering.” ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023.
[10]
G. Wu et al., “4d gaussian splatting for real-time dynamic scene rendering,” 2024, pp. 20310–20320.
[11]
Z. Yang, X. Gao, W. Zhou, S. Jiao, Y. Zhang, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Jin Xiaogang, “Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,” 2024, pp. 20331–20341.
[12]
Z. Li, Z. Chen, Z. Li, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Xu Yi, “Spacetime gaussian feature splatting for real-time dynamic view synthesis,” 2024, pp. 8508–8520.
[13]
W. Li et al., “Ordinalclip: Learning rank prompts for language-guided ordinal regression,” Advances in Neural Information Processing Systems, vol. 35, pp. 35313–35325, 2022.
[14]
T. Ding, W. Li, Z. Miao, and H. Pfister, “Tree of attributes prompt learning for vision-language models,” arXiv preprint arXiv:2410.11201, 2024.
[15]
S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Xie Saining, “Eyes wide shut? Exploring the visual shortcomings of multimodal llms,” 2024, pp. 9568–9578.
[16]
S. Tong et al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” arXiv preprint arXiv:2406.16860, 2024.
[17]
Y. Wang et al., “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” arXiv preprint arXiv:2307.06942, 2023.
[18]
H. Xu et al., “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” arXiv preprint arXiv:2109.14084, 2021.
[19]
OpenAI, “GPT-4V , howpublished = https://openai.com/index/gpt-4v-system-card/.” 2023.
[20]
G. Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.
[21]
M. Li, S. Yao, Z. Xie, and K. Chen, “Gaussianbody: Clothed human reconstruction via 3d gaussian splatting,” arXiv preprint arXiv:2401.09720, 2024.
[22]
Z. Shao et al., “Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting,” 2024, pp. 1606–1616.
[23]
Y. Chen et al., “Gaussianeditor: Swift and controllable 3d editing with gaussian splatting,” 2024, pp. 21476–21485.
[24]
J. Wang, J. Fang, X. Zhang, L. Xie, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Tian Qi, “Gaussianeditor: Editing 3d gaussians delicately with text instructions,” 2024, pp. 20902–20911.
[25]
J. Waczyńska, P. Borycki, S. Tadeja, J. Tabor, and P. Spurek, “Games: Mesh-based adapting and modification of gaussian splatting,” arXiv preprint arXiv:2402.01459, 2024.
[26]
L. Gao et al., “Mesh-based gaussian splatting for real-time large-scale deformation,” arXiv preprint arXiv:2402.04796, 2024.
[27]
X. Zhou, Z. Lin, X. Shan, Y. Wang, D. Sun, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Yang Ming-Hsuan, “Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes,” 2024, pp. 21634–21643.
[28]
H. Zhou et al., “Hugs: Holistic urban 3d scene understanding via gaussian splatting,” 2024, pp. 21336–21345.
[29]
Y. Lin, Z. Dai, S. Zhu, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Yao Yao, “Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle,” 2024, pp. 21136–21145.
[30]
J. Bae, S. Kim, Y. Yun, H. Lee, G. Bang, and Y. Uh, “Per-gaussian embedding-based deformation for deformable 3D gaussian splatting,” arXiv preprint arXiv:2404.03613, 2024.
[31]
V. Tschernezki, I. Laina, D. Larlus, and booktitle=2022. I. C. on 3D. V. (3DV). Vedaldi Andrea, “Neural feature fusion fields: 3d distillation of self-supervised 2d image representations,” 2022 , organization={IEEE}, pp. 443–453.
[32]
M. Caron et al., “Emerging properties in self-supervised vision transformers,” 2021, pp. 9650–9660.
[33]
B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven semantic segmentation,” arXiv preprint arXiv:2201.03546, 2022.
[34]
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[35]
J.-C. Shi, M. Wang, H.-B. Duan, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Guan Shao-Hua, “Language embedded 3d gaussians for open-vocabulary scene understanding,” 2024, pp. 5333–5343.
[36]
S. Zhou et al., “Feature 3DGS: Supercharging 3D gaussian splatting to enable distilled feature fields , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2024, pp. 21676–21685.
[37]
M. Ye, M. Danelljan, F. Yu, and booktitle=European. C. on C. V. Ke Lei, “Gaussian grouping: Segment and edit anything in 3d scenes,” 2025 , organization={Springer}, pp. 162–179.
[38]
J. Yu et al., “Language-embedded gaussian splats (legs): Incrementally building room-scale representations with a mobile robot,” arXiv preprint arXiv:2409.18108, 2024.
[39]
W.-L. Chiang et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90 chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023.
[40]
H. Touvron et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[41]
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[42]
J. Bai et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
[43]
W. Li, Z. Meng, J. Zhou, D. Wei, C. Gan, and book Pfister Hanspeter, “Advances in neural information processing systems,” 2024, vol. 37, pp. 2267–2291, title = SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization.
[44]
OpenAI, “Hello GPT-4o , howpublished = https://openai.com/index/hello-gpt-4o/.” 2024.
[45]
J. Bai et al., “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,” arXiv preprint arXiv:2308.12966, vol. 1, no. 2, p. 3, 2023.
[46]
H. Liu, C. Li, Y. Li, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Lee Yong Jae, “Improved baselines with visual instruction tuning,” 2024, pp. 26296–26306.
[47]
G. Huang, “Dynamic neural networks: Advantages and challenges,” National Science Review, vol. 11, no. 8, p. nwae088, 2024.
[48]
P. Wang et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024.
[49]
Z. Cheng et al., “VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in video-LLMs,” arXiv preprint arXiv:2406.07476, 2024.
[50]
B. Li et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024.
[51]
A. Cao and booktitle=Proceedings. of the I. C. on C. V. and P. R. Johnson Justin, “Hexplane: A fast representation for dynamic scenes,” 2023, pp. 130–141.
[52]
S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Kanazawa Angjoo, “K-planes: Explicit radiance fields in space, time, and appearance,” 2023, pp. 12479–12488.
[53]
M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
[54]
H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and booktitle=Proceedings. of the I. I. C. on C. V. Lee Joon-Young, “Tracking anything with decoupled video segmentation,” 2023, pp. 1316–1326.
[55]
S. Subramanian, W. Merrill, T. Darrell, M. Gardner, S. Singh, and A. Rohrbach, “Reclip: A strong zero-shot baseline for referring expression comprehension,” arXiv preprint arXiv:2204.05991, 2022.
[56]
A. Shtedritski, C. Rupprecht, and booktitle=Proceedings. of the I. I. C. on C. V. Vedaldi Andrea, “What does clip know about a red circle? Visual prompt engineering for vlms,” 2023, pp. 11987–11997.
[57]
L. Yang, Y. Wang, X. Li, X. Wang, and J. Yang, “Fine-grained visual prompting,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[58]
G. Team et al., “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024.
[59]
L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Improving text embeddings with large language models,” arXiv preprint arXiv:2401.00368, 2023.
[60]
K. Park et al., “Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,” arXiv preprint arXiv:2106.13228, 2021.
[61]
T. Li et al., “Neural 3d video synthesis from multi-view video,” 2022, pp. 5521–5531.
[62]
S. Ji et al., “Segment any 4D gaussians,” arXiv preprint arXiv:2407.04504, 2024.
[63]
Z. Zhang, Z. Zhao, Y. Zhao, Q. Wang, H. Liu, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Gao Lianli, “Where does it exist: Spatio-temporal video grounding for multi-form sentences,” 2020, pp. 10668–10677.
[64]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and booktitle=2011. I. conference on computer vision Serre Thomas, “HMDB: A large video database for human motion recognition,” 2011 , organization={IEEE}, pp. 2556–2563.
[65]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[66]
W. Kay et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
[67]
H. Akbari, D. Kondratyuk, Y. Cui, R. Hornung, H. Wang, and H. Adam, “Alternating gradient descent and mixture-of-experts for integrated multimodal perception,” NeurIPS, 2023.
[68]
G. J. Faure, M.-H. Chen, and booktitle=WACV. Lai Shang-Hong, “Holistic interaction transformer network for action detection,” 2023.