PhonemeNet: A Transformer Pipeline for Text-Driven Facial Animation
Philine Witzig, B. Solenthaler, M. Gross, R. Wampfler
The 18th ACM SIGGRAPH Conference on Motion, Interaction, and Games (Zurich, Switzerland, December 03-05, 2025), pp. 1-11
Abstract
We present a fully text-driven framework for 3D facial animation that eliminates the need for audio input or explicit prosodic cues. Our architecture extracts rich phoneme embeddings from text using a pre-trained TTS encoder, aligns them with quantized motion embeddings via a transformer decoder, and decodes the result into mesh deformations through a pre-trained transformer decoder. We explore two scenarios of our pipeline: (1) In the single-subject setting, we find that phoneme embeddings alone can yield accurate lip motion. (2) In a multi-subject setting, where speaker articulation varies widely, we introduce stochastic latent modulation to model residual variability conditioned on both phoneme context and speaker identity. We evaluate our approach quantitatively and qualitatively: We demonstrate accurate lip sync in the single-subject case, and compare against audio-driven baselines on a large multi-subject dataset. Our results show that PhonemeNet not only achieves competitive lip sync and motion quality, but also offers flexibility, modularity, and scalability as an alternative to audio-driven facial animation.