Snap, Smile, Repeat: The Game-Changing Tech That Brings Your Emotions to Life

No need to take the course of expression management? ByteDance’s new technology allows you to instantly “transfer” your emotions by uploading a picture

Zhidixi (public account: zhidxcom)
Author | Cheng Qian
Editor | Mo Ying

The competition for video generation is becoming increasingly fierce, but conveying the details of people’s facial expressions delicately and accurately is still a major difficulty.

In general film and television works or daily communication, the changes in facial expressions accompanying people’s speech are also the key to accurately conveying information. In video generation, if you want to make the overall performance of a character more smooth and natural, you need to present the characters’ movements, skin texture, muscle movements and other details more delicately.

This is not easy for AI. A recent research breakthrough in portrait generation provides a solution to the above problems.

This is the X-Portrait 2 single-picture video driver technology recently proposed by the ByteDance intelligent creation team. With only one static picture and a driver video, users can get high-quality, movie-level video clips.

If I upload a video clip of actor Jin Shijie in “Silver Empire” and a still picture of a purple-haired foreign girl generated by AI at the same time, I can let the girl directly copy the movements of the movie clip.

It can be seen that the character images in the static image and the driver video are very different in the picture below, and even if the character expression changes include laughing, opening the mouth, etc., the final effect generated by X-Portrait 2 is not affected at all, and it is just concentrated on Changes in facial expressions and head movements.

Vivid and rich expressions are the key to shaping a character’s personality. It can be seen that the current portrait generation technology is advancing towards more precise simulation of human micro-expressions.

1. Classic shots can be reproduced in seconds, and the face will not be deformed when laughing or turning the head.

Meticulous expressions are often the key for actors to convey emotions, and now this job can also be taken over by AI.

At the beginning of my experience with this technology, I set the initial difficulty level to invoke expressions with fewer senses, such as blinking, laughing, etc. This test was how to make the characters in the still images appear in the X-Portrait 2 generation process. The characters accurately invoke the right senses and convey the emotions accurately.

I believe that many people still remember the scene where Fairy Zixia blinks in “Westward Journey”. This is also considered to be a blinking picture that is difficult to surpass. What if this expression is moved to the face of the famous emoji “Kin Curator” ?

It can be seen that in the final generated video, Curator Jin’s eyes were enlarged, and he moved from pursed lips to blinking in one go, with no facial deformation at all, directly replicating this classic scene.

So what if you put Director Kim’s classic laughing emoticon on other people’s faces? I used beanbag to generate an image of a character with obvious sci-fi attributes, and then uploaded a video clip of Curator Jin from laughing to talking.

The character in the static picture not only imitates Director Kim’s laughing expression, but also conveys the wrinkles on his face and the slight up and down movement of his head when he laughs.

After testing a single expression, let’s look at the advanced difficulty.

The characters in the original video of this level will change their emotions when they speak. For example, in the next video, there will be a behind-the-scenes clip of Zhang Yi performing, from just starting to speak to turning his head and laughing.

Then I uploaded a still photo of the famous American actor Ben Afflec, and in the generated video, Ben and Zhang Yi had exactly the same angles of their mouths when they laughed. And the movement when turning from the side face to the front face is also very smooth.

2. Avatar and Thanos fantasy linkage, everyone can make Disney princess expressions

In addition to making a picture move in the style you want, X-Portrait 2 can also directly transfer the same expression to characters in various styles.

Based on this, I directly created a dream collaboration between Avatar from the classic science fiction movie “Avatar” and Thanos from the Marvel series.

I uploaded a video of the heroine Neytiri in the movie having a violent argument with someone else, and a still picture of Thanos. In the video, Neytiri looked sad while walking backwards.

Thanos also shows the same affection, and the wrinkles on his forehead gradually deepen as his emotions change.

The expressions and movements of Disney princesses in animated movies have become a system of their own, making people feel like they are in the “Disney Universe” as soon as they see them. At the same time, some bloggers on the Internet have started the challenge of imitating Disney princesses. Their expressions are lifelike. Now X-Portrait 2 allows anyone to quickly get this skill.

Here I chose to upload an AI-generated animation character image and a parody video uploaded by a blogger on the short video platform. It can be seen that the eyes, mouth and entire expression of the blogger in the original video are very exaggerated, and the generation effect of X-Portrait 2 at this difficulty level is not overturned.

I also uploaded imitation videos of other bloggers, and the effect was that the princess, who was originally just a static picture, was directly in a fairy tale world, with a very cute and lifelike expression of curiosity and happiness.

Nowadays, many animated movies will be adapted into live-action movies, but such movies will make the original animation readers uneasy in terms of actor selection, plot adaptation, actor performance, etc., because many plots are difficult for real actors to perform, and some expressions and movements The plot will even be adapted.

Now based on X-Portrait 2, you can directly “copy” the expressions of anime characters and “paste” them onto other characters. I uploaded a video of the “Beast” from “Beauty and the Beast”. In the video, the “Beast” has human-like facial features and is accompanied by roaring movements.

This performance was accurately copied to the picture I generated using AI. X-Portrait 2 was not disturbed in expression recognition. The movements of the eyes and mouth changed smoothly, replicating the angry emotion of the “beast”.

It can be seen that the realistic effect of X-Portrait 2 in expression generation can be reflected in the movements of the eyes and mouth, expression switching, action synergy and many other aspects, allowing the expression generation of static images to cooperate with other actions. .

3. Expression encoder model + generative diffusion model,Realize the transition of expression “reproduction” effect

The stunning effects generated by the above portraits are all produced by X-Portrait 2.

In March this year, ByteDance’s first-generation portrait animation model, X-Portrait, can be used to generate expressive and time-coherent portrait animations. X-Portrait 2 is an iterative version of this portrait animation model that can faithfully represent rapid head movements, subtle expression changes, and strong personal emotions.

In order to make the expressions of the final generated video more smooth and realistic, X-Portrait 2 combines the expression encoder model and the generative diffusion model, which can capture the subtle expressions of the actors in the driving video, even pouting, sticking out the tongue, etc., which require the mobilization of multiple The expressions of facial organs can also be accurately conveyed.

This expression encoder model is trained based on a large data set and implicitly encodes every tiny expression in the input to achieve accurate expression communication.

For driver videos, this encoder can also achieve strong separation of character appearance and facial expressions, allowing it to focus more on expression-related information in the video, thereby achieving accurate transfer of facial expressions.

By designing a filter layer for the model, the encoder can effectively filter ID-related signals in motion representation, so that even if the image and style in the ID picture and the driving video are greatly different, the model can still achieve cross-ID and cross-style action transfer, covering Realistic portraits and cartoon images.

Currently, in addition to X-Portrait 2, video generation startup Runyway also launched a similar function Act-One last month, which allows users to record a video themselves and then transfer it to an AI-generated character.

In contrast, X-Portrait 2 can more accurately convey the movements of the character’s head, changes in smiling expressions, and personal emotional expressions; the final video generated by Act-One can also convey expressions, but in the context of the character’s emotions and rapid head movements. The action may not be accurately “reproduced”.

As can be seen from the comparison video below, the character in the original video is very sad, and his head turns slightly when speaking, but the video generated based on X-Portrait and Act-One does not reflect this. In the X-Portrait video The swing range of the characters’ heads is reproduced, but the expressions of the characters in the two videos are slightly smiling, which is completely different from the emotions of the original videos.

Restoration of facial details, coordination of head movements and postures, etc. are the keys to accurate expression generation, which is also the current advantage of X-Portrait 2.

Conclusion: Let video generation break through the problem of facial expression details

Among the many aspects of video generation, expression generation is a very challenging part, because compared with the generation of overall character movements, it is much more difficult to generate nuanced expressions. A subtle change in facial muscles may convey a completely different expression. Different emotions.

Although this technology is still in the academic research stage, ByteDance’s active exploration in this area has far-reaching significance. By continuously optimizing algorithms and model structures, X-Portrait 2 has demonstrated the ability to capture and reproduce subtle changes in human expressions. ability. This advancement will further expand the application boundaries of video generation.

Snap, Smile, Repeat: The Game-Changing Tech That Brings Your Emotions to Life

1. Classic shots can be reproduced in seconds, and the face will not be deformed when laughing or turning the head.

2. Avatar and Thanos fantasy linkage, everyone can make Disney princess expressions

3. Expression encoder model + generative diffusion model,Realize the transition of expression “reproduction” effect

Conclusion: Let video generation break through the problem of facial expression details

Share this:

Related