SenseTime develops an AI capable of creating realistic deepfake videos from an audio clip – FindNow
Connect with us

Artificial Intelligence

SenseTime develops an AI capable of creating realistic deepfake videos from an audio clip

Amelia Fort

Published

on

Artificial intelligence can be applied in countless fields and one of the most controversial is that of video manipulation. These manipulated clips, known as deepfakes, pose a challenge for large social platforms such as Facebook, and they do not stop improving and becoming more difficult to detect. Proof of this is the new AI from SenseTime , the Hong Kong technology giant, which is capable of creating realistic deepfakes.

Summarizing its operation, the AI ​​detects elements such as expression, geometry and face pose in each frame of a video. Later, the authors of the article explain, “a recurring network is introduced to translate the source audio into expression parameters that are related to the audio content.” These expression parameters are used to synthesize a “photo-realistic human” in each frame of the video “with the movement of the regions of the mouth accurately mapped to the source audio”.

What does this translate to? In that the generated video emulates the facial expressions that are interpreted from the original audio clip, but respecting the pose and characteristics of the subject’s face, resulting in a realistic video that, as the authors of the study could verify , is difficult to detect simple view by users.

Mapping a video using audio as a source

Diagram of SenseTime AI operation. As you can see, the AI ​​links the audio clip with facial expressions to later apply these facial expressions to a video.

The methodology followed by the researchers is relatively simple. You can see an outline in the image above and it can be summarized in three steps:

  1. Register a 3D parametric facial model that includes, as we said, the geometry of the face, the pose and the parameters of the expression in each frame of the video.
  2. The audio to expression translation network “learns” the mapping of the audio source to apply the expression parameters . Audio ID-Removing Network is applied to this, which serves to eliminate the problems of large variations when using audios of different people. It is important, as the few video datasets available include different subjects, each with their accent and tone.
  3. Finally, a restructured 3D facial mesh is generated using the landmarks of the mouth region in each frame. Put another way, the AI-generated face moves its face and mouth to simulate that it is saying what is being said in the original audio, making the video photo-realistic.

In other words, SenseTime Artificial Intelligence can take a clip of anyone and make them say whatever it is by respecting the subject’s facial expression and movements , but applying facial expressions drawn from the audio clip. It’s funny, in fact, that AI works even with different poses. In minute 2:36 of the video below these lines you can see an example. The results are most realistic, to the point that AI can make a person sing (3:26 in the video below).

In the video above several examples are shown and the detail of the texture of the face, teeth, lip movement, facial lines and even dimples is striking . The model, yes, is not perfect, since it is not capable of imitating emotions or estimating the feelings expressed in the audio clip that is used as a source, it only collects the associated facial expressions.

In the same way, the language is ignored, which means that some phonemes like “z” (whose pronunciation requires putting the tongue between the teeth) do not emulate naturally. Finally, the researchers emphasize that the model tends to offer worse results when the original audio clip has a lot of accent . They give the example of an English-speaking person with a Russian accent, whose audio clip doesn’t quite sync up well with the AI-synthesized 3D mesh.

In this GIF, the generated video is saying “many to one results” and you can see how the mouth and face gestures perfectly match the original audio. You can literally almost read the subject’s lips to know what he is saying – VentureBeat

Be that as it may, the clips were evaluated by showing them to a team of 100 volunteers who had to mark whether a video was true or had been synthesized. Altogether there were 168 videos, half fake and half true, and the result was that AI-generated videos were tagged real 55% of the time, while real videos were tagged real 70.1% of the times.

This is an interesting project that could be used, as the researchers say, “to advance video editing .” However, they are also aware that it has “potential” to be “misused or abused” for different purposes, such as media manipulation or the spread of malicious propaganda . Precisely for this reason, they affirm, “we defend and strongly support all safeguard measures against such exploitation practices” and “we welcome the enactment and application of legislation that requires all edited videos to be clearly labeled as such”.