Revolutionizing Image Generation: Zhipu Unleashes Open-Source Video Model, Leaving Netizens in Awe
- Just now, ZhipuQingyingBehindFigure videoModelCogVideoX-5B-I2VOpen source!(Playable online)
- In actual applications, CogVideoX-5B-I2V supports"One picture" + "prompt word"Generate video.
- And cogvlm2-llama3-caption is responsible forConvert video content into text description。
Taole
2024-09-19 14:26:35 Source: Quantum Bit
Taole from Aofei Temple
Quantum Bit | Public Account QbitAI
Just now, ZhipuQingyingBehindFigure videoModelCogVideoX-5B-I2VOpen source!(Playable online)
Its annotation model is also open sourcecogvlm2-llama3-caption。

In actual applications, CogVideoX-5B-I2V supports“One picture” + “prompt word”Generate video.
And cogvlm2-llama3-caption is responsible forConvert video content into text description。

However, netizens who have used it have mixed reviews on its performance:
Some people called it amazing after using it.

Some people tried for a long time, but finally chose the previous version of CogVideoX, and boasted: I am most optimistic about this model!

So let’s put it to the test to see what the effect is!
Test starts~ Input prompt words: The coffee shop clerk holds his hands and smiles to welcome the guests, and moves his body naturally while speaking(It’s still the old “hand” problem)
In the second test, I tried a short prompt: Maluo crossed his legs and made a phone call(The effect is not ideal, the subject is still static and does not move)
The third prompt is: “The moon is bright and round, and a few people are sitting by the river, chatting and singing.” The display is generated, but the display link is directly NAN at the end.(Woo woo woo)

The overall effect is a bit hard to describe, and the generation speed is a bit slow.

Let’s take a look at some of the successful works released by the team:
Prompt Word: The garden comes alive as a kaleidoscope of butterflies flutter among the flowers, their delicate wings casting shadows on the petals below.
Prompt: An astronaut in a suit, his boots stained with red Martian dust, reaches out to shake the hand of an alien under the pink sky of the fourth planet
Tips: The lakeshore is lined with willow trees, with slender branches swaying gently in the breeze. The calm lake reflects the clear blue sky, and several graceful swans glide gracefully on the calm water.
It is worth mentioning that the code of the CogVideoX-5B-I2V model is now open source and supports deployment in BaobaoFace.
Related research papers have also been made public. Looking at the content of the paper, there are three major technical highlights worth talking about~

First of all, the team developed an efficientThree-dimensional variational autoencoder structure(3D VAE), compresses the original video space to 2% of its size, greatly reducing the training cost and difficulty of the video diffusion generation model.
The model structure includes an encoder, a decoder, and a latent space regularizer, and compression is achieved through four stages of downsampling and upsampling. Temporal causal convolution ensures the causality of information and reduces communication overhead. The team uses contextual parallel technology to adapt to large-scale video processing.
In experiments, the team found that large-resolution encoding is easy to generalize, while increasing the number of frames is more challenging.
Therefore, the team trained the model in two stages: first training at a lower frame rate and small batches, and then fine-tuning at a higher frame rate through context parallelism. The training loss function combines L2 loss, LPIPS perceptual loss, and GAN loss of the 3D discriminator.

followed byExpert Transformer。
The team used the VAE encoder to compress the video into a latent space, then split the latent space into blocks and expanded the long sequence embedding z_vision.
At the same time, they use T5 to encode the text input into a text embedding z_text, and then concatenate z_text and z_vision along the sequence dimension. The concatenated embedding is fed into the expert Transformer block stack for processing.
Finally, the team back-concatenated the embeddings to recover the original latent space shape and decoded them using VAE to reconstruct the video.

The final highlight isdata.
The team developed negative labels to identify and exclude low-quality videos, such as over-edited, choppy motion, low-quality, lecture-style, text-dominated, and screen-noise videos.
Through the filters trained by video-llama, they labeled and screened 20,000 video data points. At the same time, they calculated the optical flow and aesthetic scores and dynamically adjusted the thresholds to ensure the quality of the generated videos.
Video data usually has no text description and needs to be converted into text description for text-to-video model training. Existing video captioning datasets have short captions that cannot fully describe the video content.
To this end, the team also proposed a pipeline for generating video subtitles from image subtitles and fine-tuned the end-to-end video subtitle model to obtain denser subtitles.
This method generates short captions through the Panda70M model, generates dense image captions using the CogView3 model, and then summarizes the final short video using the GPT-4 model.
They also fine-tuned a CogVLM2-Caption model based on CogVLM2-Video and Llama 3, trained using dense caption data, to accelerate the video caption generation process.

It is worth mentioning that CogVideoX has not been idle in the past month, and has become a diligent updater and has produced a lot of updates!
September 17, 2024provides SAT weight reasoning and fine-tuning code and installation dependency commands, and uses GLM-4 to optimize prompt words
Jump link:
September 16, 2024users can use local open source models + FLUX + CogVideoX to automatically generate high-quality videos.
Jump link:
September 15, 2024the LoRA fine-tuning weights of CogVideoX have been successfully exported and successfully tested in the diffusers library.
Jump link:
August 29, 2024pipe.enable_sequential_cpu_offload() and pipe.vae.enable_slicing() functions were added to the inference code of CogVideoX-5B to reduce the video memory usage to 5GB.
August 27, 2024the open source protocol of the CogVideoX-2B model has been modified to the Apache 2.0 protocol.
On the same day, Zhipu AI open sourced the larger CogVideoX-5B model, which significantly improved the quality and visual effects of video generation. This model optimizes inference performance, allows users to perform inference on desktop graphics cards such as RTX 3060, and reduces hardware requirements.
August 20, 2024the VEnhancer tool has supported enhancing videos generated by CogVideoX to improve video resolution and quality.
August 15, 2024the SwissArmyTransformer library that CogVideoX relies on has been upgraded to version 0.4.12, and fine-tuning no longer requires installing the library from source code. At the same time, the Tied VAE technology has been introduced to optimize the generation effect.
The open source of CogVideoX-5B-I2V also means that the CogVideoX series models already support three tasks: generating video from text, extending video, and generating video from images.

All rights reserved. No reproduction or use in any form without authorization. Violators will be prosecuted.
