Audio-Driven Video Synthesis of Personalized Moderations

Duration

18 months

Status

Completed

MTC Team

Alberto Pennino, Dr. Clara Fernandez Labrador, Marc Willhaus, Dr. Severin Klingler, Prof. Markus Gross

Collaborators

Christian Vogg (SRG), Florian Notter (SRF), Prof. Martin Zimper (ZHdK)

<sup>Photo by Christian Gertenbach on Unsplash</sup> — ^{Photo by Christian Gertenbach on Unsplash}

Recent advances in image and video synthesis have allowed for digital modeling and reconstruction of humans. An exciting use case is to create videos of a person speaking content that the person has never said, where a faithful reproduction of the appearance, including novel viewpoints, expressions or head poses, is required. These technologies have huge potential for creating personalized moderations (e.g. your personal news anchor, weather anchor) and updating moderations more quickly and at lower cost. In this project we explore this technology to generate synthetic moderations relevant to the viewer's location, with a special focus on modeling the dynamics of the human face and the transitions between different camera viewpoints in the scene.

Goals

The overall goal of this project is to generate synthetic moderations based on input audio and a set of sample videos of a moderator. In order to get convincing results, we especially take care of lip synchronization, along with the generation of natural and coherent head movements according to the spoken text. We will further synthesize different camera positions to provide a realistic and more dynamic experience to the user. Additionally, we explore novel neural rendering techniques with the goal of synthesizing high resolution photorealistic videos. The project can be divided into the following steps:

Building a baseline model for a specific speaker. We aim to improve existing technologies for lip synchronization and focus on a single speaker model to achieve higher quality videos. This first step can be used as an initial estimate of what kind of quality in terms of resolution and lip movement can be expected. However, only synchronizing the lip movement will result in videos with very similar motions to the original input videos.

Support the synthesis of head movements, including rotations. This feature will help to create a more realistic and diversified performance than only rendering a fixed frontal position. This can also be used to synthesize the transitions between different camera positions. To achieve this, we will explore neural rendering techniques, as well as the use of a 3D face model of the speaker to modify the movements in the 3D space before rendering back to image space.

Combination of intermediate results to a full framework. Such combination can be used to generate various camera positions of a moderation based on given audio.

Outcomes

During the realisation of this project we explored multiple techniques to achieve lip-synchronization and head-motion generation. We present two different approaches to address the "Audio driven video synthesis of personalised Moderations" problem.

Our first implementation is an end to end pipeline based on the paper external page Neural Voice Puppetry. This approach overwrites the original video by generating novel mouth regions based on the new input audio signal.
Our second implementation tackles the problem of head-motion generation similarly to external page Live Speech Portraits. The new input audio signal is used to generate novel mouth and head movements. The final video is then completely generated using image to image translation from an animated edge map.