filename : Wit25a.pdf entry : inproceedings conference : MIG '25 Proceedings of the 18th SIGGRAPH Conference on Motion, Interaction, and Games pages : 1-11 year : 2025 month : December title : PhonemeNet: A Transformer Pipeline for Text-Driven Facial Animation subtitle : author : Philine Witzig, Barbara Solenthaler, Markus Gross, and Rafael Wampfler booktitle : The 18th ACM SIGGRAPH Conference on Motion, Interaction, and Games ISSN/ISBN : 9798400722363 editor : ACM New York, NY, USA publisher : Association for Computing Machinery publ.place : volume : issue : language : English keywords : Facial animation, Transformers, contrastive learning, emotion-content disentanglement, stylized characters abstract : We present a fully text-driven framework for 3D facial animation that eliminates the need for audio input or explicit prosodic cues. Our architecture extracts rich phoneme embeddings from text using a pre-trained TTS encoder, aligns them with quantized motion embeddings via a transformer decoder, and decodes the result into mesh deformations through a pre-trained transformer decoder. We explore two scenarios of our pipeline: (1) In the single-subject setting, we find that phoneme embeddings alone can yield accurate lip motion. (2) In a multi-subject setting, where speaker articulation varies widely, we introduce stochastic latent modulation to model residual variability conditioned on both phoneme context and speaker identity. We evaluate our approach quantitatively and qualitatively: We demonstrate accurate lip sync in the single-subject case, and compare against audio-driven baselines on a large multi-subject dataset. Our results show that PhonemeNet not only achieves competitive lip sync and motion quality, but also offers flexibility, modularity, and scalability as an alternative to audio-driven facial animation.