filename     : Ott22a.pdf
entry        : article
conference   : Pacific Graphics 2022, Kyoto, Japan, October 5-8, 2022
pages        : 611-622
year         : 2022
month        : October
title        : Learning Dynamic 3D Geometry and Texture for Video Face Swapping
subtitle     : 
author       : C. Otto, J. Naruniec, L. Helminger, T. Etterlin, G. Mignone, P. Chandran, G. Zoss, C. Schroers, M. Gross, P. Gotardo, D. Bradley, R. Weber
booktitle    : 
ISSN/ISBN    : 1467-8659
editor       : N. Umetani, E. Vouga, C. Wojtan
publisher    : The Eurographics Association and John Wiley & Sons Ltd.
publ.place   : Computer Graphics Forum
volume       : 41
issue        : 7
language     : English
keywords     : Image manipulation, Rendering, Neural Networks
abstract     : Face swapping is the process of applying a source actor's appearance to a target actor's performance in a video. This is a challenging visual effect that has seen increasing demand in film and television production. Recent work has shown that data-driven methods based on deep learning can produce compelling effects at production quality in a fraction of the time required for a traditional 3D pipeline. However, the dominant approach operates only on 2D imagery without reference to the underlying facial geometry or texture, resulting in poor generalization under novel viewpoints and little artistic control. Methods that do incorporate geometry rely on pre-learned facial priors that do not adapt well to particular geometric features of the source and target faces. We approach the problem of face swapping from the perspective of learning simultaneous convolutional facial autoencoders for the source and target identities, using a shared encoder network with identity-specific decoders. The key novelty in our approach is that each decoder first lifts the latent code into a 3D representation, comprising a dynamic face texture and a deformable 3D face shape, before projecting this 3D face back onto the input image using a differentiable renderer. The coupled autoencoders are trained only on videos of the source and target identities, without requiring 3D supervision. By leveraging the learned 3D geometry and texture, our method achieves face swapping with higher quality than when using off-the-shelf monocular 3D face reconstruction, and overall lower FID score than state-of-the-art 2D methods. Furthermore, our 3D representation allows for efficient artistic control over the result, which can be hard to achieve with existing 2D approaches.