Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Zongshang Pang\(^{1}\)2  Mayu Otani\(^{2}\) Yuta Nakashima\(^{1}\)
\(^1\) Osaka University          \(^2\) CyberAgent, Inc.
{pangzs@is.ids, n-yuta@ids}``.osaka-u.ac.jp
otani_mayu@cyberagent.co.jp``


Abstract

Localizing user-queried events through natural language is crucial for video understanding models. Recent methods predominantly adapt Video LLMs to generate event boundary timestamps to handle temporal localization tasks, which struggle to leverage LLMs’ powerful semantic understanding. In this work, we introduce MeCo, a novel timestamp-free framework that enables video LLMs to fully harness their intrinsic semantic capabilities for temporal localization tasks. Rather than outputting boundary timestamps, MeCo partitions videos into holistic event and transition segments based on the proposed structural token generation and grounding pipeline, derived from video LLMs’ temporal structure understanding capability. We further propose a query-focused captioning task that compels the LLM to extract fine-grained, event-specific details, bridging the gap between localization and higher-level semantics and enhancing localization performance. Extensive experiments on diverse temporal localization tasks show that MeCo consistently outperforms boundary-centric methods, underscoring the benefits of a semantic-driven approach for temporal localization with video LLMs.1

1 Introduction↩︎

Localizing important temporal events based on user interests is an essential capability for video recognition systems to handle practical video tasks such as moment retrieval [1][4], action localization [5][8], video summarization [9][12], and dense video captioning [13][16]. Though specialist models used to be designed to handle each specific task, recent efforts have started leveraging video LLMs [17][24] to integrate these temporal localization tasks into a single framework [25][30].

Figure 1: As opposed to previous semantic-poor boundary-centric approaches [25][30], MeCo leverages video LLMs to capture the temporal structure and segment the video into transition and event segments. We also tune the model to perform query-focused captioning to scrutinize the detailed event semantics for more precise localization.

To enable video LLMs to perform temporal localization, current efforts primarily focus on adapting them to capture event boundaries through intricately designed timestamp representations and compatibility mechanisms [25][28], [30], [31]. However, this one-shot boundary timestamp generation approach overlooks the holistic temporal structures of videos, which provides essential context for forming event segments [32][34]. Moreover, it neglects a detailed analysis of the targets’ semantic content, which is crucial for successful visual search [35], [36]. Technically, LLMs are intrinsically limited in generating highly uninformative outputs, such as boundary timestamps in temporal localization or mask polygons in image segmentation [37], and thus cannot fully address the target tasks unless they produce more semantically driven outputs [38][42]. In this work, we enhance temporal localization for video LLMs by leveraging their semantic understanding to capture the holistic temporal structures and inspect the target events for precise localization rather than merely adapting them to output boundary timestamps. An intuitive illustration of the differences between our methodology and previous works is provided in 1.

Specifically, we propose structural tokens that leverage video LLMs’ temporal structure understanding [17], [21], [23] for holistic segmentation. In our approach, the LLM generates two types of tokens, an event token and a transition token, in the temporal order of significant event segments and background transitions. This process requires the model to distinguish semantic differences between events and transitions and to capture the overall narrative flow for accurate event localization. To map the structural tokens to their corresponding video segments, we propose a structural token grounding module to directly manipulate the semantic similarities between their LLM hidden states via contrastive learning [43][45], building on recent findings that LLM hidden states contain rich discriminative information [40], [41], [46], [47].

While the structural tokens partition the video into consecutive segments, enabling straightforward localization of queried events, we hypothesize that making the model aware of the fine-grained semantic details of such events can further enhance performance. Moreover, many downstream tasks, such as grounded question answering, require deep semantic understanding as well. To facilitate this, we design a query-focused captioning task that requires the model to generate detailed captions for the queried event segments to refine structural token grounding. These captions capture rich, query-specific semantics that not only improve temporal localization but also bolster the LLM’s complex reasoning capabilities.

The proposed framework, named MeCo, enables video LLMs to Measure twice, by considering both holistic video structure and fine-grained event semantics, before Cutting out once for all the queried event segments. This design fundamentally differs from boundary-centric approaches, making MeCo the first video LLM-based temporal localization framework that eliminates the need for generating any boundary-related information. Extensive comparisons show that MeCo outperforms previous video LLM-based temporal localization methods across grounding [4], [4], [9], [10], [48][53], dense video captioning [54][57], and complex reasoning [4], [28], [50], [58] tasks.

2 Related Work↩︎

Video Temporal Localization Tasks. Video temporal localization tasks such as moment retrieval [4], video summarization [9], [10], [12], [59], action localization [5][8], and dense video captioning [13], [16], require localizing salient event segments in response to a user-specified query or prompt, often with the need for precise event boundary timestamps. Furthermore, tasks such as dense video captioning and grounded video question answering [58], [60][62] also involve generating captions and performing complex reasoning about these localized events. Traditionally, these tasks have been addressed by specialist models with task-specific designs and domain-specific training data. Although unified models for localization-only tasks have been proposed [1][3], they cannot handle generative tasks like captioning.

Video LLMs. Early efforts to enable LLMs to perform video-level tasks used LLMs as agents built on chain-of-thought reasoning and tool-use mechanisms [63][66]. Advances in end-to-end multimodal pretraining [43], [67], [68] and instruction tuning [69][72] have led to the development of powerful video LLMs [17][24], [73][76]. Recent studies have shown that these models excel at temporal reasoning over very long videos, benefiting from LLMs’ long-context semantic retrieval and structural understanding abilities [21][24]. However, while they are effective for general video understanding tasks such as captioning and question answering, they do not address tasks that require event temporal localization.

Temporal Localization Video LLMs. Recent developments in temporal localization Video LLMs have enabled unified approaches for both localization and generation tasks. Models such as TimeChat [25] and VTimeLLM [29] fine-tune pre-trained video LLMs to output numeric tokens that represent event boundary timestamps. Subsequent works [26], [27], [30], [31] augment the LLM’s vocabulary with learnable timestamp tokens. For example, VTGLLM unifies the timestamp token lengths via padding [27]. Built on this, TRACE utilizes specialized timestamp encoder and decoder [30]. Observing that LLMs struggle with numeric tokens and a large number of newly introduced tokens, E.T.Chat [28] instead fine-tunes LLMs on a boundary embedding matching task using a single newly introduced token. As a result, current methods invariably concentrate on event segment boundaries, which, especially for longer events, fail to capture the rich semantic content essential for precise localization.

Building on these insights, we propose a novel framework that leverages LLMs’ intrinsic semantic understanding to overcome the limitations of boundary-centric methods. Instead of generating explicit timestamp boundaries, the proposed structural tokens partition videos into consecutive segments, which are further refined by query-focused captioning. This dual strategy, capturing both the overall temporal structure and fine-grained event semantics, represents a significant departure from existing approaches and delivers superior temporal localization performance.

3 Preliminary↩︎

Figure 2: An overview of the proposed MeCo framework. Given an input video and a localization-aware user prompt, MeCo generates structural tokens, including the event token <ent> and the transition token <tst>, to facilitate holistic temporal segmentation via structural token grounding. MeCo also generates query-focused captions, right before generating the <ent> token, to retrieve the semantic details in the queried segments for improving structural token-based localization performance.

Video temporal localization involves understanding user-specified events and determining their temporal boundaries. In simpler tasks [4], [6], [53], the model outputs start and end timestamps for each event, denoted as \(\xtime = \{(t_{i}^{s}, t_{i}^{e})\}^{M}_{i=1}\), where the value of \(M\) varies by task and dataset, , some temporal grounding tasks [48] requires \(M=1\), while extractive video summarization [9], [10] usually requires \(M>1\). More complicated tasks like dense video captioning [13] require the model to reason about which events to localize and generate responses \(\xtext = \{\vx_n\}_{n=1}^{N}\) to the user prompt [16], [62], where \(N\) is the number of tokens in the response.

Traditionally, \(\xtime\) and \(\xtext\) are handled by their corresponding specialized models. With the advent of video LLMs designed for temporal localization tasks [16], [25][31], \(\xtime\) can be generated in the form of LLM tokens via numeric or specific timestamp tokens and can thus share the same output space with \(\xtext\). Specifically, such temporal localization video LLMs usually start with a visual encoder and a resampler that extracts a set of frame feature maps \(\{\mF_t\}_{t=1}^{T}\) from a \(T\)-frame video, where a frame feature map \(\mF_t \in \Rbb^{P \times C}\) has \(P\) token embedding vectors, each of which is of \(C\) dimensions. An LLM decoder takes \(\{\mF_t\}_{t=1}^{T}\) and the tokenized user prompt \(\{\vq_l\}_{l=1}^{L}\) as input and generates timestamp tokens as well as textual tokens \(\mX=\{\xtime, \xtext\}\). Given temporal localization tuning data, a pre-trained video LLM will be fine-tuned with the language modeling loss: \[\cL_{\text{LM}}(\mX) = -\frac{1}{N}\sum_{n=1}^{N} \log p(\mX_n| \{\mF_t\}_{t=1}^{T}, \{\vq_l\}_{l=1}^{L}, \mX_{<n}), \label{eq:lm}\tag{1}\] where \(\mX_n\) is the \(n\)-th token in \(\mX\), \(\mX_{<n} = \{\mX_{n'}\}^{n-1}_{n'=1}\), and \(N\) is the now total number of tokens in the combined sequence of timestamp and textual tokens.

4 Method↩︎

LLMs struggle with numeric tokens [28], [77], [78] and require extensive pre-training to adapt to new tokens [16], [41], [42]. Moreover, fixating on event boundaries, which lack inherent semantic content, fails to leverage LLMs’ semantic understanding.

To fully exploit video LLMs’ potential in temporal localization, we propose to fine-tune them on a structural token generation task that induces video LLMs’ temporal structure understanding. The generated structural tokens enable holistic segmentation through a structural token grounding step, which readily yields the queried event segments. However, generating structural tokens alone does not capture the fine-grained semantics of the events, potentially bottlenecking the localization performance. To overcome this limitation, we introduce a query-focused captioning task that compels the LLM to extract detailed event semantics for more precise localization. An overview of the proposed pipeline is shown in 2.

4.1 Structural Token Generation↩︎

While watching a video, we naturally segment its narrative into distinct events, which facilitates efficient retrieval of the content we care about [32][36]. This holistic structural understanding, especially when transition segments that carry only background information are explicitly modeled, has been crucial for specialist action localization methods [79][82]. However, these approaches often represent all transition segments with a single prototype due to limited model capacity, thereby overlooking their temporal dynamics and the subtle semantic differences from the key events. Although video LLMs have demonstrated excellent temporal structure understanding for general video tasks [21][24], current video LLM-based temporal localization methods still fail to fully exploit this capability.

To induce the holistic structural understanding from video LLMs for temporal localization tasks, we propose to fine-tune them on the structural token generation task, which enables them to distinguish between event and transition segments, represented by their corresponding structural tokens <ent> and <tst>, which are newly introduced into the LLM vocabulary. Specifically, during training, given a \(T\)-frame video together with a set of \(M\) ground-truth queried event segments depicted by their boundary timestamps \(\{(t_{i}^{s}, t_{i}^{e})\}^{M}_{i=1}\), we can augment them with their neighboring transition segments to form an augmented set of segments \(\{(t_{i}^{s}, t_{i}^{e})\}^{M'}_{i=1}\), where \(M'\) is the total number of both the event and the transition segments. It holds that \(t_{1}^{s}=1\), \(t^{e}_{M'}=T\) and \(t^{e}_{i}+1=t^{s}_{i+1}\), such that the segments cover the entire video. Let \(\cI_{\text{ent}}\) be a set of indices of the queried event segments; the sequence of structural tokens can be defined as \(\xst=\{\text{ST}(i)\}_{i=1}^{M'}\) with \[\text{ST}(i) = \begin{cases} \texttt{<ent>} & \text{if } i \in \cI_{\text{ent}}, \\ \texttt{<tst>} & \text{otherwise }, \end{cases}\] and we call <ent> the event token and <tst> the transition token from now on.

4.2 Structural Token Grounding↩︎

To ground the structural tokens to their corresponding video segments, we maximize the log-likelihood of the structural tokens with respect to their corresponding segment frames. Formally, given the projected LLM hidden states of the segment frames \(\{\mH_{t}\}_{t=1}^{T}\) and the structural tokens \(\{\mathbf{s}_{i}\}_{i=1}^{M'}\) from two learnable MLP projectors [28], where \(\mH_{t}\in\Rbb^{P \times C}\) and \(\mathbf{s}_{i} \in \Rbb^{C}\), the structural token grounding loss can be formulated as: \[\begin{align} \cL_{\text{ST}} = -\frac{1}{M'}\sum_{i=1}^{M'} \displaystyle\sum_{t=t_{i}^{s}}^{t_{i}^{e}} \frac{\log p(\vh_{t}| \mathbf{s}_{i})}{t_{i}^e - t_{i}^s},\label{eq:st} \end{align}\tag{2}\] where \(\vh_{t}\in \Rbb^{C}\) is spatially average-pooled from \(\mH_{t}\), \(\tau\) is a learnable temperature parameter [43], and both \(\vh_{t}\) and \(\mathbf{s}_{i}\) are normalized to the unit sphere following [43], [44]. \(p(\vh_{t}| \mathbf{s}_{i})\) is the conditional probability of frame \(t\) given the structural token \(i\) and is computed as: \[p(\vh_{t}| \mathbf{s}_{i}) = \frac{\exp(\mathbf{s}_{i} {\cdot} \vh_{t} / \tau)}{\sum_{t'=1}^{T} \exp( \mathbf{s}_{i} {\cdot} \vh_{t'} / \tau)}, \label{eq:cond}\tag{3}\] which essentially makes 2 a contrastive learning objective [43], [44] that pulls together the structural tokens and the corresponding segment frames. We also primarily attempted the symmetric version of 2 by including \(p(\mathbf{s}_{i}|\vh_{t})\), similar to [43], but observed unstable training likely due to the problem setting and the optimization factors, which we leave for future exploration.

During inference, we compute 3 for all frames with respect to each structural token. We then obtain holistic temporal segmentation by assigning each frame to the structural token that leads to the highest conditional probability, which directly yields the queried event segments via the event tokens <ent>. The pseudocode of MeCo inference is provided in [code:inference]. Consequently, our approach enables video LLMs to perform temporal localization by leveraging semantic structure, eliminating the need to process uninformative boundary timestamps that arise as by-products of structural token-based localization.

Figure 3: Pseudocode of MeCo Inference.

4.3 Query-focused Captioning↩︎

Just as humans re-watch a clip to pinpoint details of interest, we believe that relying solely on structural tokens without scrutinizing event semantics hinders the model’s localization and reasoning capabilities. While “think step by step” strategies have boosted performance in tasks like math and coding [38], [83], it remains an untouched problem how to leverage this for enhancing temporal localization-related tasks.

To overcome this limitation, we introduce a query-focused captioning task that trains the model to generate detailed captions focusing on the queried segments. The task applies to a suite of temporal localization tasks, such as temporal grounding [4] and extractive video summarization [9], [10], to exploit the synergistic benefits of their respective data. An example of query-focused captions is provided in 2 and more are available in the supplementary material.

Essentially, query-focused captioning enables LLMs to extract detailed semantic information from the queried events, which we consider as the key to improving temporal localization performance. We leverage query-focused captioning by having the LLM generate the detailed captions for a queried event right before it produces its corresponding event token (, <ent>). This effectively guides the <ent> token to encode more semantic information about the event, and it thus can more precisely localize the event segment.

The overall tokens that the LLM needs to generate now become the interleaved sequence of the structural tokens and the query-focused caption tokens \(\mX_{\text{MeCo}}=\{\text{QFC}(i), \text{ST}(i)\}_{i=1}^{M'}\) with \[\text{QFC}(i) = \begin{cases} \texttt{[Cap]}_{i} & \text{if } i \in \cI_{\text{ent}}, \\ \varnothing & \text{otherwise}, \end{cases}\] where \(\texttt{[Cap]}_{i}\) encloses all the query-focused caption tokens for the \(i\)-th event segment, and \(\varnothing\) means no token is placed (we left out the end-of-sequence token for notation clarity). Overall, the LLM training objective is the combination of the structural token grounding loss and the auto-regressive generation loss, , \(\mathcal{L}_{\text{ST}} + \mathcal{L}_{\text{LM}}(\mX_{\text{MeCo}})\), where \(\mathcal{L}_{\text{LM}}\) is defined in 1 .

Interestingly, we find that although the query-focused captioning task prominently contributes to the structural-token-based temporal localization, it can barely be exploited by boundary-centric timestamp generation methods [25], [28]. This indicates the importance of focusing on the semantic understanding capabilities of LLMs to explore their full potential in temporal localization tasks.

5 Experiments↩︎

[TABLE]

Zero-shot performance comparisons on E.T.Bench [28] with previous methods. The full names of each task appear in 5.1. For general video LLMs, the reported results come from timestamp-aware prompting in [28]. “I.T. Data” refers to instruction tuning data used in the temporal localization tuning stage, which may include both localization-specific and other datasets. For temporal localization video LLMs, methods marked with \(\dagger\) are evaluated using their officially released checkpoints, while those marked with \(\ddag\) are fine-tuned on E.T.Instruct [28] (LoRA rank 128) for one epoch. When trained on E.T.Instruct, all models start from their official checkpoints at the final pre-training stage, before any temporal localization tuning. Metrics shown in gray are not zero-shot results, indicating that the model accessed the training data of the corresponding evaluation dataset in E.T.Bench. The best metrics are highlighted in [6pt][2pt]green , and the second-best metrics in [6pt][2pt]blue .

5.1 Benchmarks↩︎

We focus on evaluating MeCo’s zero-shot performance on three benchmarks: E.T.Bench [28], Charades-STA [48] and QVHighlights [4].

E.T.Bench is a comprehensive benchmark comprising a suite of event-level and time-sensitive tasks, spanning four domains: referring, grounding, dense captioning, and complex temporal reasoning. Since the referring domain does not involve temporal localization, we focus on the other three domains where temporal localization serves either as the primary or an auxiliary task. Specifically, the grounding domain includes five tasks: Temporal Video Grounding (TVG) [4], [48], Episodic Memory (EPM) [49], Temporal Action Localization (TAL) [50][52], Extractive Video Summarization (EVS) [9], [10], and Video Highlight Detection (VHD) [9], [10]. The dense captioning domain consists of two tasks: Dense Video Captioning (DVC) [54], [55] and Step Localization and Captioning (SLC) [56], [57]. Finally, the complex temporal reasoning domain involves two tasks: Temporal Event Matching (TEM) [4], [50] and Grounded Video Question Answering (GVQ) [58]. We directly apply the evaluation metrics provided in E.T.Bench.

Charades-STA and QVHighlights are widely adopted benchmarks for evaluating temporal moment retrieval and video highlight detection. Although E.T.Bench tasks such as TVG, VHD, and TEM include data from these two benchmarks, we report additional results directly on the original benchmarks to facilitate comparisons with existing methods. Specifically, the simplified single-segment evaluation used by E.T.Bench reduces difficulty for tasks such as temporal grounding/moment retrieval and highlight detection, especially on QVHighlights, which often contains samples with multiple ground-truth event segments.

We adopt the standard metrics for these benchmarks: for Charades-STA [48], we use recall at temporal Intersection over Union thresholds of 0.5 (R@1\(_{0.5}\)) and 0.7 (R@1\(_{0.7}\)); for QVHighlights, we use mean Average Precision (mAP) for moment retrieval (MR) and highlight detection (HL), along with HIT@0.1, indicating the accuracy of retrieving the top relevant highlights.

5.2 Implementation Details↩︎

Fine-tuning dataset. Though different works often collect their own data for temporal localization instruction fine-tuning [7], [25][27], [29], [30], we choose to utilize the E.T.Instruct dataset with 164K samples [28] as it covers a wide range of temporal localization tasks, including grounding, dense captioning, and complex reasoning tasks. More importantly, training on E.T.Instruct allows us to guarantee zero-shot evaluations on E.T.Bench, as data leakage has been avoided in the data collection process for E.T.Instruct.

Query-focused caption generation. As query-focused captioning is a novel task and there is currently no such dataset available, we leverage the ground-truth event timestamps in E.T.Instruct to extract event clips, which are then sent to a pre-trained video captioning model, MiniCPM-V-2.6 [75], to generate detailed clip captions. Because these initial captions are often very detailed and contain redundant information, we employ GPT-4o-mini [84] to summarize them into concise versions that preserve key details not mentioned in the original queries (if provided). Further details regarding the generation pipeline are presented in the supplementary material.

Model Architecture. We develop and evaluate MeCo using the E.T.Chat architecture [28], which employs a pre-trained ViT-G/14 from EVA-CLIP [85] as the visual encoder, and a resampler consisting of a pre-trained Q-Former [68] followed by a frame compressor [28] that produces one token per video frame. When using Phi-3-Mini-3.8B [86] as the base LLM, we build on the pre-trained E.T.Chat-Stage-2 model [28] for instruction fine-tuning. We also adopt the QWen2 model [73] from MiniCPM-V-2.6 [75] as the base LLM, following Stage 1 & 2 training in [28] to pre-train it before temporal localization fine-tuning. During fine-tuning, we follow [28] to apply LoRA adapters [87] to the LLM and train them together with the resampler for one epoch, while freezing all other parameters. Additionally, we test MeCo on other architectures, such as those used in TimeChat [25] and VTGLLM [27], for controlled comparisons. Only minimal changes are made to replace the original timestamp generation modules with MeCo without further developmental enhancements. Additional details about the E.T.Chat-based model and other architectures appear in the supplementary material.

5.3 Main Results↩︎

E.T.Bench: Comprehensive comparisons. As shown in 1, although previous temporal localization video LLMs demonstrate promising zero-shot results compared to general video LLMs after temporal localization fine-tuning, they still underperform on most tasks compared to MeCo. Specifically, MeCo (3.8B) achieves substantial gains in all domains—for example, 59.1% vs. 44.3% on TVG\(_{F1}\) (grounding), 43.4% vs. 39.7% on DVC\(_{F1}\) (dense captioning), and 9.6% vs. 3.7% on GVQ\(_{Rec}\) (complex reasoning). Notably, many competing models use larger base LLMs (, 7B or 13B) and train for considerably more steps (, VTG-LLM [27] on 217K samples for 10 epochs, TRACE [30] on 900K samples for 2 epochs with full-parameter fine-tuning). This outcome demonstrates that MeCo better leverages video LLMs’ semantic understanding for temporal localization than boundary-centric methods. Furthermore, when MeCo uses a more powerful base LLM (QWen2-7B), its performance consistently improves on most tasks, reinforcing its scalability and effectiveness.

E.T.Bench: Comparisons with E.T.Instruct fine-tuning. When fine-tuned on E.T.Instruct, VTG-LLM [27] retains performance levels comparable to its original setting. Although TimeChat [25] shows notable gains on most tasks, it still underperforms MeCo by a substantial margin. Meanwhile, TRACE [30] suffers the greatest performance drop, likely because its specialized timestamp encoder/decoder and newly introduced timestamp tokens demand extensive tuning for effective adaptation with the LLM. In contrast, MeCo relies only on two structural tokens, <ent> and <tst>, and emphasizes semantic understanding, an area where LLMs naturally excel, making it more amenable to efficient fine-tuning. Although E.T.Chat’s boundary matching mechanism does not require the LLM to generate timestamps, it overemphasizes uninformative boundaries and therefore fails to sufficiently capture the semantic information of events, resulting in inferior performance compared to MeCo.

[TABLE]

Zero-shot Performance on Charades-STA and QVHighlights. Detailed information on the E.T.Instruct fine-tuning setting, as well as the meanings of \(\dagger\) and \(\ddag\), can be found in the captions of 1. “MR” denotes Moment Retrieval, and “HL” stands for Highlight Detection.

Moment retrieval. As shown in 2, MeCo outperforms most existing methods on the Charades-STA benchmark, with only R@1\(_{0.7}\) on par with E.T.Chat [28] and TRACE [30]. However, Charades-STA contains relatively short videos (an average duration of 30 seconds [48]) and each video features only a single ground-truth segment, making it less challenging.

In contrast, QVHighlights consists of longer videos (about 150 s) and may contain one or more ground-truth segments per video. As MeCo is trained on E.T.Instruct, which does not include multi-segment moment retrieval data, we generalize MeCo to multi-segment retrieval by thresholding the semantic similarities between <ent> and frame tokens (3 ) to produce one or more connected segments. As seen in 2 (E.T.Instruct fine-tuning setting), previous methods that have not encountered multi-segment retrieval data fail to generalize to this task, evidenced by their low mAP\(_{\text{MR}}\), because they generate discrete timestamp tokens rather than leveraging semantic similarities.

Highlight detection. For the QVHighlights highlight detection evaluation, we directly use the continuous semantic similarities derived from 3 for scoring. It is therefore unsurprising that MeCo achieves much higher performance in mAP\(_{\text{HL}}\) and HIT@0.1 than previous methods, which generate numeric tokens to approximate highlight scores without leveraging the underlying semantic information.

[TABLE]

Zero-shot Comparisons Between Contrastive Vision–Language Models and Video LLMs. For each contrastive model, we compute the cosine similarities between the localization query feature and the frame features (sampled at 1 fps). We then apply a threshold to these similarity scores and merge contiguous points above the threshold as localized segments.

[TABLE]

Replace Boundary-Centric Localization with MeCo. The original boundary-centric strategies in representative methods are replaced with MeCo for a more controlled comparison. Both the MeCo-adapted models and the original models are trained on E.T.Instruct for one epoch with LoRA adapters.

[TABLE]

Ablations on the necessity of <tst> token and query-focused captioning (QFC). Here, “Query Copying” indicates that the model is trained to replicate the localization query from the prompt instead of performing QFC. Whenever <tst> is omitted, we identify segments by applying a fixed threshold (same for all tasks) to the cosine similarities between <ent> and frame tokens, and then merging contiguous points above this threshold.

[TABLE]

Investigation on the compatibility of QFC and different localization strategies. Timestamp Tokens: The LLM is trained to generate numeric timestamp tokens [25], [28]. Boundary Matching: The LLM is trained to generate a special token to match the boundary embeddings of event segments [28].

5.4 Detailed Analysis↩︎

In this section, all experiments are conducted on E.T.Bench, and for all related analyses we use the MeCo (3.8B) variants trained on E.T.Instruct, unless otherwise specified. Except in 3, all metrics are reported as the average across all tasks within the corresponding domain.

Semantic-based methods excel; LLMs amplify. MeCo leverages LLMs’ semantic understanding to capture video temporal structure and fine-grained semantics for localization. However, contrastively trained vision-language models without generation capability, such as CLIP [43] and EVA-CLIP [85], also exhibit strong semantic discriminative power. As shown in 3, these models perform impressively on grounding tasks without any additional training, even surpassing the previous best temporal localization video LLMs on the EPM and EVS tasks while performing competitively compared to them on other tasks. This provides strong evidence that semantic-based approaches are highly effective for temporal localization, and by harnessing LLMs’ capabilities, MeCo further amplifies this strength.

Replacing boundary-centric methods with MeCo yields consistent benefits. To isolate the benefits of MeCo, we compare it against boundary-centric methods under each of their settings. As shown in 4, MeCo consistently outperforms the original methods across all tasks. Note that aside from the E.T.Chat setting, we have not yet fully explored MeCo’s potential in each configuration; we leave the investigation on its compatibility with various base LLMs for future work.

The necessity of both holistic and localized understanding. As shown in 5, optimizing the structural token grounding loss without the transition tokens (<tst>), with segments derived via thresholding, yields significantly poorer performance than when <tst> is used. Notably, <ent> tokens begin to take effect once query-focused captioning is introduced; however, replacing query-focused captioning with an uninformative query copying task [30] reduces performance to the level achieved using only <ent> tokens. By combining holistic structural information via <tst> tokens with localized details from query-focused captioning, MeCo achieves the best performance.

Boundary-centric methods fail to leverage query-focused captioning. As shown in 6, boundary-centric methods that focus solely on generating numeric timestamps cannot exploit the rich semantic cues contained in the query-focused captions. In contrast, the structural tokens of MeCo effectively leverage this detailed information to enhance performance in both localization and complex reasoning. Therefore, the success of MeCo demonstrates that discriminative and generative learning can interact effectively to facilitate temporal localization tasks.

Figure 4: Visualizations of MeCo’s temporal localization results.

Qualitative analysis. As shown in the 4, MeCo can generate detailed query-focused captions and accurately localize the event segments in both single-event and multi-event segments. However, it is also observable that there is still much room for improvement.

6 Conclusion↩︎

In this work, we introduce MeCo, a novel timestamp-generation-free framework that equips video LLMs with temporal localization capabilities. By leveraging structural tokens to capture holistic video structure and query-focused captioning to extract fine-grained semantic details, MeCo outperforms traditional boundary-timestamp-centric methods across a suite of temporal localization tasks. Our results demonstrate that exploiting LLMs’ intrinsic semantic understanding can be a more effective approach for temporal localization. For future work, it could be interesting to explore how MeCo can scale with synthetic localization data.

Figure 5: Query-focused captioning pipeline and examples.

Figure 6: Evaluation prompt templates.

A.1. Detailed Implementation Details↩︎

In Table. 7, we provide the hyperparameters used for the MLP projectors for the structural tokens, LoRA, and model training. All training was done on 4 NVIDIA A100 (80G) GPUs.

Table 1: Hyperparameters for MLP Projectors, LoRA, and Model Training.
MLP Projectors
Number of Layers 2
Hidden Size 1536
Output Size 3072
Large Range LoRA
LoRA \(r\) 128
LoRA \(\alpha\) 256
LoRA Dropout 0.05
LoRA Modules QVO Layers
Model Training
Max Number of Tokens 2048
Number of Epochs 1
Batch Size 2
Learning Rate for LoRA 5e-5
LR Decay Type Cosine
Warmup Ratio 0.03
Optimizer AdamW
AdamW \(\beta_1, \beta_2\) 0.9, 0.997

A.2. Details of Query-Focused Captioning↩︎

Based on the temporal localization data in E.T.Instruct [28], we extract event segments and send them into a video captioning model, MiniCPM-V-2.6 [75] to generate detailed captions. The captions often contain redundant information and are summarized using GPT-4o-mini [84]. Some QFC examples are shown in Figure. 5.

A.3. Adapting TimeChat and VTGLLM to Work with MeCo↩︎

In main text Table.3b, we show the results of plugging MeCo in the TimeChat [25] and VTGLLM [30] models. TimeChat and VTGLLM share the same architecture, with a ViT-G/14 from EVA-CLIP [85] as the visual encoder, a pre-trained Q-Former [68] as the visual resampler and a base LLM initialized from the pre-trained VideoLLaMa [20], except that TimeChat applies a sliding video Q-Former to compress the number of visual tokens into 96 but VTGLLM applies a slot-based visual compressor to obtain 256 tokens, both using 96 as the maximum number of sampled frames.

To plug MeCo into the architecture, we directly modify the video Q-Former in TimeChat to a standard image Q-Former, which resamples 32 tokens into 1 token per frame. Moreover, we apply the bi-directional self-attention for the visual token parts, following [28]. Other components remain unchanged. We then train TimeChat, VTGLLM, MeCo (adapted) on E.T.Instruct with the same hyperparameters following [25], except that LoRA \(r\) and \(\alpha\) are both changed into 128.

A.4. Evaluation and Training Prompt Templates↩︎

For evaluation, we modify E.T.Bench templates to work with MeCo, and the example templates are provided in Figure. 6. For training, we manually craft a query-focused captioning-aware instruction template for each task domain in E.T.Instruct, and diversity it with GPT-4o [84] to get four more templates. The instruction templates for all the domains are provided in Figures.7.

a

Temporal grounding.

b

Dense video captioning.

c

Complex reasoning.

Figure 7: Instruction templates for different task domains: (a) temporal grounding, (b) dense video captioning, and (c) complex reasoning..

References↩︎

[1]
K. Q. Lin et al., “UniVTG: Towards unified video-language temporal grounding , booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),” 2023, pp. 2794–2804.
[2]
S. Yan et al., “UnLoc: A unified framework for video localization tasks , booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),” 2023, pp. 13623–13633.
[3]
Y. Liu et al., “R 2-tuning: Efficient image-to-video transfer learning for video temporal grounding,” 2024 , organization={Springer}, pp. 421–438.
[4]
J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,” Advances in Neural Information Processing Systems, vol. 34, pp. 11846–11858, 2021.
[5]
Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Sukthankar Rahul, “Rethinking the faster r-cnn architecture for temporal action localization,” 2018, pp. 1130–1139.
[6]
F. Cheng and booktitle=European. C. on C. V. Bertasius Gedas, “Tallformer: Temporal action localization with a long-memory transformer,” 2022 , organization={Springer}, pp. 503–521.
[7]
X. Liu et al., “End-to-end temporal action detection with transformer,” IEEE Transactions on Image Processing, vol. 31, pp. 5427–5441, 2022.
[8]
Z. Shou, D. Wang, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Chang Shih-Fu, “Temporal action localization in untrimmed videos via multi-stage cnns,” 2016, pp. 1049–1058.
[9]
Y. Song, J. Vallmitjana, A. Stent, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Jaimes Alejandro, “Tvsum: Summarizing web videos using titles,” 2015, pp. 5179–5187.
[10]
M. Gygli, H. Grabner, H. Riemenschneider, and editor =. F. D. and P. T. and S. B. and T. T. Van Gool Luc, “Creating Summaries from User Videos , booktitle = ECCV,” 2014, pp. 505–520, doi: 10.1007/978-3-319-10584-0_33 , langid = {english}.
[11]
K. Zhou, Y. Qiao, and T. Xiang, “Deep Reinforcement Learning for Unsupervised Video Summarization With Diversity-Representativeness Reward,” AAAI, vol. 32, no. 1, 2018, doi: 10.1609/aaai.v32i1.12255 , copyright = {Copyright (c)}, langid = {english}.
[12]
Z. Pang, Y. Nakashima, M. Otani, and H. Nagahara, “Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization , booktitle = Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,” 2023, pp. 2010–2019, langid = english.
[13]
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and booktitle=Proceedings. of the I. international conference on computer vision Carlos Niebles Juan, “Dense-captioning events in videos,” 2017, pp. 706–715.
[14]
T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and booktitle=Proceedings. of the I. international conference on computer vision Luo Ping, “End-to-end dense video captioning with parallel decoding,” 2021, pp. 6847–6857.
[15]
L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Xiong Caiming, “End-to-end dense video captioning with masked transformer,” 2018, pp. 8739–8748.
[16]
A. Yang et al., “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” 2023, pp. 10714–10726.
[17]
M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” arXiv preprint arXiv:2306.05424, 2023.
[18]
B. Lin et al., “Video-llava: Learning united visual representation by alignment before projection,” arXiv preprint arXiv:2311.10122, 2023.
[19]
Y. Li, C. Wang, and booktitle=European. C. on C. V. Jia Jiaya, “Llama-vid: An image is worth 2 tokens in large language models,” 2025 , organization={Springer}, pp. 323–340.
[20]
H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio-visual language model for video understanding,” arXiv preprint arXiv:2306.02858, 2023.
[21]
E. Song et al., “Moviechat: From dense token to sparse memory for long video understanding,” 2024, pp. 18221–18232.
[22]
X. Wang, D. Song, S. Chen, C. Zhang, and B. Wang, “Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture,” arXiv preprint arXiv:2409.02889, 2024.
[23]
P. Zhang et al., “Long context transfer from language to vision,” arXiv preprint arXiv:2406.16852, 2024.
[24]
F. Xue et al., “Longvila: Scaling long-context visual language models for long videos,” arXiv preprint arXiv:2408.10188, 2024.
[25]
S. Ren, L. Yao, S. Li, X. Sun, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Hou Lu, “Timechat: A time-sensitive multimodal large language model for long video understanding,” 2024, pp. 14313–14323.
[26]
D.-A. Huang et al., “Lita: Language instructed temporal-localization assistant,” 2025 , organization={Springer}, pp. 202–218.
[27]
Y. Guo, J. Liu, M. Li, X. Tang, X. Chen, and B. Zhao, “VTG-LLM: Integrating timestamp knowledge into video LLMs for enhanced video temporal grounding,” arXiv preprint arXiv:2405.13382, 2024.
[28]
Y. Liu, Z. Ma, Z. Qi, Y. Wu, Y. Shan, and C. W. Chen, “ET bench: Towards open-ended event-level video-language understanding,” arXiv preprint arXiv:2409.18111, 2024.
[29]
B. Huang, X. Wang, H. Chen, Z. Song, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Zhu Wenwu, “Vtimellm: Empower llm to grasp video moments,” 2024, pp. 14271–14280.
[30]
Y. Guo, J. Liu, M. Li, X. Tang, Q. Liu, and X. Chen, “TRACE: Temporal grounding video LLM via causal event modeling,” arXiv preprint arXiv:2410.05643, 2024.
[31]
L. Qian et al., “Momentor: Advancing video large language model with fine-grained temporal reasoning,” arXiv preprint arXiv:2402.11435, 2024.
[32]
J. M. Zacks and K. M. Swallow, “Event segmentation,” Current directions in psychological science, vol. 16, no. 2, pp. 80–84, 2007.
[33]
J. M. Zacks and B. Tversky, “Event structure in perception and conception.” Psychological bulletin, vol. 127, no. 1, p. 3, 2001.
[34]
C. A. Kurby and J. M. Zacks, “Segmentation in the perception and memory of events,” Trends in cognitive sciences, vol. 12, no. 2, pp. 72–79, 2008.
[35]
J. M. Wolfe and T. S. Horowitz, “Five factors that guide attention in visual search,” Nature human behaviour, vol. 1, no. 3, p. 0058, 2017.
[36]
J. H. Maunsell and S. Treue, “Feature-based attention in visual cortex,” Trends in neurosciences, vol. 29, no. 6, pp. 317–322, 2006.
[37]
W. Wang et al., “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,” Advances in Neural Information Processing Systems, vol. 36, pp. 61501–61513, 2023.
[38]
J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
[39]
S. Yao et al., “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[40]
P. Wu and booktitle=Proceedings. of the I. C. on C. V. and P. R. Xie Saining, “V?: Guided visual search as a core mechanism in multimodal llms,” 2024, pp. 13084–13094.
[41]
X. Lai et al., “LISA: Reasoning segmentation via large language model , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2024, pp. 9579–9589.
[42]
R. Pi, L. Yao, J. Gao, J. Zhang, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Zhang Tong, “Perceptiongpt: Effectively fusing visual perception into llm,” 2024, pp. 27124–27133.
[43]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021 , organization={PMLR}, pp. 8748–8763.
[44]
K. He, H. Fan, Y. Wu, S. Xie, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Girshick Ross, “Momentum contrast for unsupervised visual representation learning,” 2020, pp. 9729–9738.
[45]
M. Caron et al., “Emerging properties in self-supervised vision transformers,” 2021, pp. 9650–9660.
[46]
P. BehnamGhader, V. Adlakha, M. Mosbach, D. Bahdanau, N. Chapados, and S. Reddy, “LLM2Vec : Large Language Models Are Secretly Powerful Text Encoders , shorttitle = LLM2Vec.” 2024.
[47]
W. Huang et al., “LLM2CLIP : Powerful Language Model Unlocks Richer Visual Representation , shorttitle = LLM2CLIP.” arXiv, Nov. 2024.
[48]
J. Gao, C. Sun, Z. Yang, and booktitle=Proceedings. of the I. international conference on computer vision Nevatia Ram, “Tall: Temporal activity localization via language query,” 2017, pp. 5267–5275.
[49]
K. Grauman et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” 2022, pp. 18995–19012.
[50]
V. Patraucean et al., “Perception test: A diagnostic benchmark for multimodal video models,” Advances in Neural Information Processing Systems, vol. 36, pp. 42748–42761, 2023.
[51]
A. Gorban et al., “Thumos challenge: Action recognition with a large.” 2015.
[52]
Y.-G. Jiang et al., “Thumos challenge: Action recognition with a large.” 2014.
[53]
M. Sun, A. Farhadi, and booktitle=Computer. V. 2014:. 13th. E. C. Z. S. S. 6. 2014,. P. P. I. 13. Seitz Steve, “Ranking domain-specific highlights by analyzing edited videos,” 2014 , organization={Springer}, pp. 787–802.
[54]
A. Zala et al., “Hierarchical video-moment retrieval and step-captioning,” 2023, pp. 23056–23065.
[55]
L. Zhou, C. Xu, and booktitle=Proceedings. of the A. conference on artificial intelligence Corso Jason, “Towards automatic learning of procedures from web instructional videos,” 2018, vol. 32.
[56]
D. Zhukov, J.-B. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Sivic Josef, “Cross-task weakly supervised learning from instructional videos,” 2019, pp. 3537–3545.
[57]
T. Afouras, E. Mavroudi, T. Nagarajan, H. Wang, and L. Torresani, “Ht-step: Aligning instructional articles with how-to videos,” Advances in Neural Information Processing Systems, vol. 36, pp. 50310–50326, 2023.
[58]
L. Bärmann and booktitle=Proceedings. of the I. C. on C. V. and P. R. Waibel Alex, “Where did i leave my keys?-episodic-memory-based question answering on egocentric videos,” 2022, pp. 1560–1568.
[59]
M. Otani, Y. Nakashima, E. Rahtu, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Heikkila Janne, “Rethinking the evaluation of video summaries,” 2019, pp. 7596–7604.
[60]
J. Xiao, A. Yao, Y. Li, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Chua Tat-Seng, “Can i trust your answer? Visually grounded video question answering,” 2024, pp. 13204–13214.
[61]
J. Lei, L. Yu, T. L. Berg, and M. Bansal, “Tvqa+: Spatio-temporal grounding for video question answering,” arXiv preprint arXiv:1904.11574, 2019.
[62]
S. Di and booktitle=Proceedings. of the I. C. on C. V. and P. R. Xie Weidi, “Grounded question-answering in long egocentric videos,” 2024, pp. 12934–12943.
[63]
K. Lin et al., “Mm-vid: Advancing video understanding with gpt-4v (ision),” arXiv preprint arXiv:2310.19773, 2023.
[64]
Z. Yang et al., “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv preprint arXiv:2303.11381, 2023.
[65]
A. Zeng et al., “Socratic models: Composing zero-shot multimodal reasoning with language,” arXiv preprint arXiv:2204.00598, 2022.
[66]
D. Surı́s, S. Menon, and booktitle=Proceedings. of the I. I. C. on C. V. Vondrick Carl, “Vipergpt: Visual inference via python execution for reasoning,” 2023, pp. 11888–11898.
[67]
J. Li, D. Li, C. Xiong, and booktitle=International. conference on machine learning Hoi Steven, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” 2022 , organization={PMLR}, pp. 12888–12900.
[68]
J. Li, D. Li, S. Savarese, and booktitle=International. conference on machine learning Hoi Steven, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” 2023 , organization={PMLR}, pp. 19730–19742.
[69]
L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
[70]
Y. Wang et al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” arXiv preprint arXiv:2204.07705, 2022.
[71]
Y. Wang et al., “Self-instruct: Aligning language models with self-generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
[72]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
[73]
P. Wang, S. Bai, S. Tan, et al., “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.” 2024 , eprint={2409.12191}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2409.12191.
[74]
Z. Chen et al., “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024.
[75]
Y. Yao, T. Yu, A. Zhang, et al., “MiniCPM-v: A GPT-4V level MLLM on your phone.” 2024 , eprint={2408.01800}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2408.01800.
[76]
B. Li, Y. Zhang, D. Guo, et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024.
[77]
N. Dziri et al., “Faith and fate: Limits of transformers on compositionality,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[78]
S. Frieder et al., “Mathematical capabilities of chatgpt,” Advances in neural information processing systems, vol. 36, 2024.
[79]
P. Lee, Y. Uh, and booktitle=Proceedings. of the A. conference on artificial intelligence Byun Hyeran, “Background suppression network for weakly-supervised temporal action localization,” 2020, vol. 34, pp. 11320–11327.
[80]
P. X. Nguyen, D. Ramanan, and booktitle=Proceedings. of the I. international conference on computer vision Fowlkes Charless C, “Weakly-supervised action localization with background modeling,” 2019, pp. 5502–5511.
[81]
T. Yu, Z. Ren, Y. Li, E. Yan, N. Xu, and booktitle=Proceedings. of the I. international conference on computer vision Yuan Junsong, “Temporal structure mining for weakly supervised action detection,” 2019, pp. 5522–5531.
[82]
Q. Liu, Z. Wang, S. Rong, J. Li, and booktitle=Proceedings. of the I. international conference on computer vision Zhang Yixin, “Revisiting foreground and background separation in weakly-supervised temporal action localization: A clustering-based approach,” 2023, pp. 10433–10443.
[83]
O. and: Aaron Jaech, A. Kalai, A. Lerer, et al., “OpenAI o1 system card.” 2024 , eprint={2412.16720}, archivePrefix={arXiv}, primaryClass={cs.AI}, [Online]. Available: https://arxiv.org/abs/2412.16720.
[84]
O. and: Aaron Hurst, A. Lerer, A. P. Goucher, and Others, “GPT-4o system card.” 2024 , eprint={2410.21276}, archivePrefix={arXiv}, primaryClass={cs.CL}, [Online]. Available: https://arxiv.org/abs/2410.21276.
[85]
Y. Fang et al., “Eva: Exploring the limits of masked visual representation learning at scale,” 2023, pp. 19358–19369.
[86]
M. Abdin et al., “Phi-3 technical report: A highly capable language model locally on your phone,” arXiv preprint arXiv:2404.14219, 2024.
[87]
E. J. Hu et al., “LoRA: Low-rank adaptation of large language models.” 2021 , eprint={2106.09685}, archivePrefix={arXiv}, primaryClass={cs.CL}, [Online]. Available: https://arxiv.org/abs/2106.09685.

  1. Code available at https://github.com/pangzss/MeCo↩︎

  2. Work done during an internship at CyberAgent, Inc.↩︎