Unlocking the Quantum Code: The Mysterious World of Qubits
- Only one 80G graphics card is needed, and the large model can understand hour-long videos.
- Zhiyuan Research Institute teamed up with Shanghai Jiao Tong University, Renmin University of China, Peking University, Beijing University of Posts and Telecommunications and other universities to bring the...
- It uses the native ability of language models (LLM) to compress long visual sequences, not only retaining the ability to understand short videos, but also showing excellent generalization...
subtlety
2024-10-28 17:30:03 Source: Qubits
Zhiyuan Research Institute cooperates with many universities to bring
Yunzhong came from Aofei Temple
Qubits | Public account QbitAI
Only one 80G graphics card is needed, and the large model can understand hour-long videos.
Zhiyuan Research Institute teamed up with Shanghai Jiao Tong University, Renmin University of China, Peking University, Beijing University of Posts and Telecommunications and other universities to bring the latest achievement of the ultra-long video understanding large model Video-XL.

It uses the native ability of language models (LLM) to compress long visual sequences, not only retaining the ability to understand short videos, but also showing excellent generalization capabilities in long video understanding.
Compared with models of the same parameter scale, Video-XL ranks first in multiple tasks on multiple mainstream long video understanding benchmarks.。
And it achieves a good balance between efficiency and performance,Only a graphics card with 80G video memory can process 2048 frames of input (sampling hour-long video), and achieve an accuracy of close to 95% in the video “needle in the haystack” task。

△Figure 1: The maximum number of frames supported by different long video models on a single 80G graphics card and their performance on Video-MME
You know, long video understanding is one of the core capabilities of multi-modal large models, and it is also a key step towards general artificial intelligence (AGI).
However, existing multi-modal large models still face the dual challenges of poor performance and low efficiency when processing ultra-long videos of more than 10 minutes.
Video-XL is here for this, the model code has beenOpen source。
In the future, it is expected toMovie summarization, video anomaly detection, ad placement detectionIt has shown extensive application value in other application scenarios and has become a powerful long video understanding assistant.
Using MLLM for long video understanding has great research and application prospects. However, current video understanding models often can only handle shorter videos and cannot handle videos longer than ten minutes.
Although some long video understanding models have recently emerged in the research community, these works mainly suffer from the following problems:
Information loss caused by compressing visual tokens: In order to adapt the fixed window length of the language model to the large number of visual tokens brought by long videos, many methods try to design mechanisms to compress visual tokens. For example, LLaMA-VID mainly reduces the number of tokens, while MovieChat and MALMM design memory modules to compress frames. Information is compressed. However, compressing visual information inevitably leads to information loss and performance degradation.
Imbalance between performance and efficiency: Related work LongVA attempts to expand the context window of the finetune language model and successfully generalizes short video understanding capabilities to long videos. LongVila optimizes the cost of long video training and proposes a paradigm for efficient long video training. However, these works do not consider the computational overhead caused by the increase in the number of video frames during inference.

△Figure 2: Video-XL model structure diagram
As shown in Figure 2, the overall model structure of Video-XL is similar to the structure of mainstream MLLMs, consisting of a visual encoder (CLIP), a visual-language mapper (2-layer MLP) and a language model (Qwen-7B).
What’s special is that in order to handle multi-modal data in various formats (single image, multi-image and video),Video-XL establishes a unified visual encoding mechanism。
For multi-image and video data, each frame is input into CLIP separately; for a single image, it is divided into multiple image blocks, and the image blocks are input into CLIP for encoding. Therefore, an N-frame video or an N-image block image will be uniformly marked with N × M visual tokens.
Compared with previous long video models that directly compress visual tokens, Video-XL attempts to use the context modeling capabilities of language models to perform lossless compression of long visual sequences. For the visual signal sequence output by the visual language connector:

Where n is the number of visual tokens. The goal of Video-XL is to compress X into a more compact visual representation C (|C|
Inspired by Activation Beacon, Video-XL introduces a new special tag called Visual Summary Tag (VST), denoted as . Based on this, the hidden layer features of the visual signal can be compressed into the activation representation of VST in LLM (Key and Value values of each layer). Specifically, the visual signal sequence X is first divided into windows of size w (the default length of each window is 1440):

Next, the compression ratio is first determined for each window and a set of VST markers is inserted in the visual marker sequence in an alternating manner. During this process, the change in visual token representation can be expressed by the following formula:

The LLM will process each window one by one for encoding and use an additional projection matrix to process the hidden values of the VST in each layer of self-attention modules. After encoding is completed, the activation values of ordinary visual markers are discarded, while the activation values of VST are retained and accumulated as a visual signal proxy when processing subsequent windows.

Among them, θ represents all optimized parameters of the model, including language model, visual encoder, visual language connector, VST projection matrix, and VST token embedding. The model is trained by minimizing the standard autoregressive loss. The VST labeled losses are not calculated during training (their labels are set to -100) since they are only used for compression. At the same time, in order to flexibly support different compression granularities, the compression ratio of each window will be randomly selected from {2, 4, 8, 12, 16} during training. At inference time, a compression ratio can be chosen based on specific efficiency needs and applied to all windows.
In the pre-training stage, Video-XL uses the Laion-2M dataset to optimize the visual language connector. In the fine-tuning phase, Video-XL fully exploits the capabilities of MLLM on various multi-modal datasets. For single image data, 57k images of Bunny 695k and Sharegpt-4o were used. For multi-image data, 5k data extracted from MMDU were used. For video data, video samples of different durations were collected, including 32k samples from NExT-QA, 2k video samples from Sharegpt-4o, 10k samples from CinePile, and 11k private data with GPT-4V video subtitle annotations.
In order to enhance long video understanding capabilities and unleash the potential of visual compression mechanisms, this work develops an automated long video data production process and creates a high-quality dataset – Visual Cue Ordinal Data (VICO). The process first obtains long videos from CinePile data or video platforms such as YouTube, covering content in open fields such as movies, documentaries, games, and sports. Each long video is split into 14-second segments. For each fragment, this work uses the VILA-1.5 40B model to generate detailed descriptions. These descriptions include action sequences and key events. Based on these captions, this work uses ChatGPT to arrange the clues in chronological order. The VICO dataset improves its long video understanding capabilities by requiring the model to retrieve key frames and detect temporal changes.
Video-XL selects multiple mainstream video understanding evaluation benchmarks. For long video understanding tasks, it evaluates VNBench, LongVideoBench, MLVU and Video-MME; for short video understanding tasks, it evaluates MVBench and Next-QA.
1. Long video understanding:
As shown in Table 1 and Table 2, Video-XL has demonstrated excellent performance on multiple mainstream long video evaluation benchmarks. Among them, the accuracy rate on VNBench exceeds the current best long video model.about 10%。
On the validation set of MLVU, Video-XL with only 7B parameters even performs well on the single-choice taskBeyond the GPT-4o model. On data sets such as Video-MME and LongVideoBench, Video-XL also ranks first among long video understanding models of the same magnitude.
2. Super long video understanding:
Video-XL conducted a video “needle in a haystack” test to evaluate its ability to handle very long contexts. Both LLaVA-NexT-Video and LongLLaVA adopt simple location information extrapolation algorithms, but it is still difficult to understand the key information when more context is input.
Although LongVA handles longer inputs by fine-tuning LLM, the high computational cost limits its ability to process approximately 400 frames on a single 80G GPU. In contrast, under the same hardware conditions, Video-XLInput with 16x compression ratio and 2048 framesreaching an accuracy of nearly 95%. This shows that Video-XL achieves the best balance between accuracy and computational efficiency.
3. Short video understanding:
Although Video-XL is designed primarily for long videos, it retains the ability to understand short videos. In the MVBench and Next-QA task evaluation, Video-XL achieved equivalent results to the current SOTA model.

△Table 3 Ablation experiment of Video-XL
Video-XL conducted ablation experiments on the proposed visual compression mechanism and the VICO data set, as shown in Table 3
1. Effectiveness of visual compression:
Video-XL trained two models using the Bunny 695k dataset: one using no compression and the other using a random compression ratio (chosen from {2, 8, 16}). For the compression model, different compression ratios were applied when testing on the video benchmark MLVU and the image benchmarks MME and MMBench. It is worth noting that even with a compression ratio of 16, the compression model still shows better results, approaching or even surpassing the baseline model.
2. Validity of VICO data set:
Video-XL trained four models using different data sets: (a) Bunny 695k only; (b) Bunny 695k combined with NeXTQA 32k; (c) Bunny 695k combined with CinePile 10k; (d) Bunny 695k combined with long video subtitles 5k; (e) Bunny 695k combined with VICO 5k. Notably, Video-XL outperforms the model trained with NeXTQA 32k even using only 5k of VICO data. Furthermore, the main event/action sequencing task leads to more significant improvements than the subtitle generation task because it forces the model to extract key segments and understand them from long sequences.

△Figure 3 Visual results of Video-XL on long video understanding tasks
at present. The model code of Video-XL has been open sourced to promote cooperation and technology sharing in the global multi-modal video understanding research community.
Paper link: https://arxiv.org/abs/2409.14485
Model link: https://huggingface.co/sy1998/Video_XL
Project link: https://github.com/VectorSpaceLab/Video-XL
All rights reserved. Any reproduction or use in any form without authorization is prohibited. Violators will be prosecuted.
