ByteDance releases a large model for video generation: from price competition to performance breakthroughs_EastMoney.com
The field of AI video generation has reached another major milestone. On September 24, the Volcano Engine AI Innovation Tour was held in Shenzhen. At the conference, Volcano Engine released two large video generation models, PixelDance and Seaweed, and launched invitation tests for the enterprise market.
In addition to the video generation large model, Volcano Engine also released a music model and a simultaneous interpretation model, and comprehensively upgraded the general language model, text-based graph model, and speech model. The Doubao full-modal large model family was fully unveiled.
At the meeting, Tan Dai, president of Volcano Engine, said, “There are many difficulties in video generation that need to be overcome. Doubao’s two models will continue to evolve, explore more possibilities in solving key problems, and accelerate the expansion of AI video creation space and application implementation.”
Tan Dai, President of Volcano Engine
ByteDance releases Doubao video generation model
During the exhibition, the Doubao video generation model released by ByteDance was undoubtedly the focus of the whole exhibition. Its video generation quality not only reached the industry-leading level, but also achieved a comprehensive surpassing of traditional video generation technology in multiple dimensions.
One of the highlights of the Doubao video generation model is its precise semantic understanding ability. Compared with most video generation models on the market that can only complete simple instructions and single actions, the Doubao video generation model can follow more complex prompts, unlocking the ability to interact with multiple subjects in a time-sequential manner with multiple action instructions.
In order to overcome the difficulty of maintaining consistency when switching between multiple shots, Doubao video generation model adopts a new diffusion model training method. This technology successfully achieves the consistency of subject, style, atmosphere and logic when switching between multiple shots in a prompt, allowing users to tell a story with a beginning, a middle and an end in just 10 seconds.
For highly dynamic complex scene videos and text instructions with diverse expressions, the Doubao video generation model, based on the efficient DiT fusion computing unit, achieves more fully compressed and encoded video and text. This makes the generated video movements more flexible, the lens language more diverse, and the expressions and details more abundant.
In terms of visual effects, the Doubao video generation model also performs well. It supports film-level video generation, with rich levels of detail and high realism. At the same time, the model also has professional-level color blending and light and shadow layout capabilities, which greatly improves the visual aesthetics of the picture.
In addition, the Doubao video generation model has deeply optimized the Transformer structure and improved the generalization ability of video generation. It supports a variety of styles including black and white, 3D animation, 2D animation, Chinese painting, thick painting, and a variety of video size ratios to meet the diverse creative needs of users.
When ByteDance launched the Doubao video generation model, it followed its usual path of developing large models: first honing the model’s capabilities through consumer-oriented products, and then expanding into the enterprise market once the model has a competitive advantage.
This strategy has been verified on the Doubao language model, which was first registered in China in August 2023 and officially released in May 2024 after nearly a year of low-key polishing. Similarly, the early version of the Doubao video generation model has been used and iterated on platforms such as Dreamina for a long time before it was officially launched in the enterprise market.
In addition, ByteDance’s ability to achieve such results in the field of video generation models is inseparable from its rich accumulation of business scenarios.
It is understood that ByteDance’s business scenarios cover short videos, social media, online education, e-commerce and many other fields. These diversified business scenarios provide massive data and rich application scenarios for the research and development and training of video generation models, enabling them to better understand and meet the diverse needs of users.
At the same time, ByteDance has profound experience in the field of algorithms and has a strong R&D team that continuously promotes algorithm innovation and optimization, providing solid technical support for the excellent performance of the Doubao video generation model.

Doubao full-modal large model family debuts
Since the release of the Doubao model in May this year, the daily average usage of language model tokens has soared 10 times, and the processing volume of multimodal data such as pictures and voice has also increased significantly. According to QuestMobile data, as of July, Doubao had 30.42 million monthly active users, making it one of the largest AI native applications in China.
In addition to the video generation model, Volcano Engine has also released the Doubao music model. Users can easily generate a 1-minute high-quality music work including melody, lyrics and singing by simply describing or uploading a picture.
The high-quality music generation capability of the Doubao music model is due to its advanced algorithms and rich music library. The model can accurately understand the lyrics or picture emotions input by users, and generate a melody and rhythm that is highly consistent with them. At the same time, the model also supports more than 10 different music styles and emotional expressions, such as folk songs, pop, rock, and Chinese style, to meet the diverse needs of different users.
In terms of singing, the Doubao music model also performs well. It can match the appropriate timbre according to the style of the song, and truly present details such as breath and true and false voice conversion, making users feel as if they are in a professional recording studio. In addition, the model also supports high-quality sound quality listening experience, allowing users to enjoy the charm of music during the creation process.
As globalization deepens, the importance of cross-language communication is self-evident. The Doubao simultaneous interpretation model released by Volcano Engine was created to solve this problem. The model has the characteristics of ultra-low latency and translation while speaking, and can maintain the advantages of fluency, naturalness and high accuracy in the process of real-time translation. According to evaluations, in office, legal, educational and other scenarios, the translation level of the Doubao simultaneous interpretation model is close to or even exceeds the level of human simultaneous interpretation.
It is worth mentioning that the Doubao simultaneous interpretation model also supports the voice cloning function. This means that in the process of cross-language translation, the model can maintain the timbre and expressiveness of the original voice, thereby breaking down communication barriers and facilitating communication in scenarios such as multinational conferences, international forums, and online live broadcasts.
In addition to the three newly released models mentioned above, namely the video generation model, music model, and simultaneous interpretation model, Volcano Engine has also comprehensively upgraded the general language model, text-based graph model, and speech model.
The general language model has been improved to varying degrees in terms of comprehensive ability, mathematics, code, professional knowledge, etc. The Wensheng Graph Model 2.0 has achieved significant improvements in reasoning efficiency and performance, and can present complex scenes more accurately and produce images at a very fast speed.
The upgrade of the voice model introduces a super-powerful mixing function, which allows users to freely combine different timbres to create a unique sound experience. This function not only brings more possibilities to the field of audio creation, but also brings a qualitative leap in user experience in scenarios such as voice interaction and smart home.
From “price war” to “performance war”
Currently, big models bring important changes and development opportunities to cloud services. Volcano Engine is becoming an important force in cloud services in the AI era: it has led to price cuts for big models, and has initiated a big model alliance for smart terminals, automobiles, and retail, promoting innovation in AI applications in the industry.
As product capabilities continue to improve, the use of large bean bag models is also growing rapidly.
According to Volcano Engine, as of September, the average daily token usage of the Doubao language model exceeded 1.3 trillion, a tenfold increase compared to when it was first released in May. The multimodal data processing volume also reached 50 million images and 850,000 hours of voice per day, respectively.
In the early stage of the development of large models, price competition was one of the focuses of market attention. Previously, Doubao large models announced pricing that was 99% lower than the industry, leading the domestic large model to start a price reduction trend.
Tan Dai believes that the price of large models is no longer a barrier to innovation. With large-scale applications in enterprises, large models supporting greater concurrent traffic are becoming a key factor in the development of the industry.
According to Tan Dai, many large models in the industry currently only support a maximum TPM (tokens per minute) of 300K or even 100K, which is difficult to carry the traffic of enterprise production environments. For example, in the literature translation scenario of a scientific research institution, the TPM peak is 360K, the TPM peak of a car smart cockpit is 420K, and the TPM peak of an AI education company is as high as 630K. For this reason, the Doubao large model supports an initial TPM of 800K by default, which is far higher than the industry average, and customers can also flexibly expand capacity according to demand.
“Thanks to our efforts, the application cost of large models has been well resolved. Large models must move from volume prices to volume performance, and volume better model capabilities and services.” said Tan Dai.
(Source: 21st Century Business Herald)
Source: 21st Century Business Herald
Original title: ByteDance releases video generation model: from price competition to performance breakthrough
Solemn declaration:Eastmoney publishes this content to spread more information. It has nothing to do with the position of this website and does not constitute investment advice. You will bear the risks if you act accordingly.
