Revolutionizing AI: MiniMax Unveils Groundbreaking Vincent Video Model in Stealthy Launch
MiniMax Unveils Its First Large Model for Video Generation
On August 31, MiniMax released its first large model for video generation, accompanied by a 2-minute video “Magic Coin” generated by the MiniMax large model.
Yan Junjie, the founder of MiniMax, shared in an interview that “we have indeed made great progress in video model generation. According to internal evaluation and running scores, our (generated video) effect is better than Runway.”
The current video generation model is only the first version, with a new version set to be released soon. The model will continue to iterate in terms of data, algorithm, and usage details. Currently, only text-generated videos are provided, with image-generated videos and text + image-generated videos to be released in the future.
Yan Junjie explained that the team has been solving more difficult technical problems, such as training content with higher computing power. The difficulty lies in training video generation capabilities, which requires turning videos into tokens. These tokens are very long, and the longer they are, the higher the complexity.
The MiniMax team continuously reduced the complexity through algorithms, and the compression rate became higher, resulting in a delayed release time of one or two months.
Yan Junjie emphasized that the core research and development idea of the MiniMax team is not to find a way to improve the algorithm by 5% or 10%, but to achieve significant improvements. ”If it can be improved several times, it must be done. If it only improves by 5%, it is not worth doing.”
When asked why text-based videos are necessary, Yan Junjie believes that the essence is that most of the content consumed by humans every day is pictures, texts, and videos, and text accounts for a small proportion. To achieve higher user coverage and usage, the only way is to output multimodal content, rather than simply outputting text content.
Generating large models from videos poses certain difficulties. Yan Junjie explained that the complexity of working with videos is more difficult than working with texts, and the contextual text of videos is naturally very long and difficult to process.
The video volume is also very large, with a 5-second video being several megabytes, and 100 words being less than 1K, resulting in a storage gap of several thousand times.
The challenge of generating video models is that the underlying infrastructure previously built based on text is not suitable for video generation, such as how to process, clean, and label data, which means that the infrastructure also needs to be upgraded.
At the press conference, Yan Junjie emphasized the importance of speed. He believes that in the long run, the faster the progress, the better. Whether it is MOE or Linear attention, or other explorations, the essence is to make the same effect model faster.
Wei Weiye, head of the MiniMax open platform, noted that the effectiveness, cost, and multimodality of large models still face challenges. Large models have inevitable hallucinations, and their output may not meet expectations due to insufficient compliance with instructions and language comprehension.
Cost was also a significant challenge, but since May this year, a price war has been launched in the field of large models, and API has been reduced to “dirt cheap”. Wei Wei believes that low cost can stimulate the emergence of more application scenarios, and API costs will be further reduced in the future.
Multimodality will also trigger more application scenarios. For example, the combination of text and voice can enable large models to better recognize and express emotions. The combination of voice and video can generate short videos and clips with dubbing.
Yan Junjie expressed optimism about technological progress, users, and product iteration efficiency, despite the challenges in the field of big models.
