EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech Animation
P. Witzig, B. Solenthaler, M. Gross, R. Wampfler
The 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games (Arlington VA, USA, November 21-23, 2024), pp. 1-12
Abstract
Equipping stylized conversational characters with facial animations tailored to specific emotions enhances coherence and authenticity. Many data-driven speech animation methods lack dynamic facial expressions since they rely on explicit semantic control signals, leading to static emotional expressions. We present a Transformer-AE for disentangling emotion and content within the facial motion latent space. Our method processes animation control parameters in the frequency domain, enabling a more fine-grained separation of emotion and content based on frequencies. Through contrastive learning, the model is encouraged to learn similar representations for similar emotional states and the same linguistic content. Capturing the full dynamics of an emotional episode spatially and temporally, this approach enables emotion swapping, enhances expressiveness, and gives artists fine control over emotion, e.g., through emotion interpolation. Our analyses show that the Transformer-AE effectively separates emotion from content, enabling more nuanced and realistic facial animation for conversational characters.