The interest in conversational digital characters has grown significantly in recent years, spanning industries such as healthcare, education, and entertainment. Thanks to progress in natural language processing and machine learning, these characters can participate in meaningful conversations with users, offering a more engaging and immersive experience. Our research aims to develop data-driven methods for establishing affect-aware conversational agents, speech synthesis, and animations to make interactions with digital characters more natural, engaging, and rewarding. Affect awareness is at the core of all projects. By relying on affective states, interactions with digital characters can be tailored to the emotions of the user.
The Digital Einstein platform is a cutting-edge setup that merges conversational digital characters with a physical environment. This setup enables users to converse interactively with a digital, stylized representation of Albert Einstein, featuring dynamic expressions and body language for a more engaging and immersive experience. Styled to echo the early 20th century, the platform includes an armchair, a carpet, a table, and various decorative items such as a lamp, books, a Mozart bust, and a pocket watch. It also incorporates a screen framed in wood, several speakers, and a media box, all discreetly placed behind the screen. Users can interact with the Digital Einstein using a microphone cleverly hidden in the top book on the table, while a camera concealed in the frame's top scans their movements and reactions. The arrangement of speakers (two in the chair, one below the table, and two behind the screen) creates a comprehensive spatial audio experience. Speech input is processed into text using Microsoft Azure's speech recognition technology.
Digital Einstein 1.0
The software uses machine learning to understand the user's intent and employs a dialog algorithm to generate the most appropriate response. This system is enhanced with motion-captured animations for facial expressions and body movements, along with pre-recorded speech mimicking Einstein's distinctive German accent. The responses stem from a dialog tree, which spans topics such as Einstein's famous theories, his academic pursuits, and his personal relationships. The algorithm introduces a degree of randomness to keep conversations dynamic and unpredictable. For example, if Einstein notices a lack of engagement, he will quickly ask if the user is distracted or uninterested. If he doesn't understand something, he will ask, with both courtesy and humor, for the question to be rephrased.
Digital Einstein 2.0
To enhance the conversation's openness and engagement, users have the option to choose between a chatbot based on GPT-4o and another based on Llama-3 8B. We guide these chatbots toward the topics in the dialog tree by using prompt engineering for the GPT-4o chatbot and fine-tuning the Llama-3 chatbot with synthetic Einstein conversations created using GPT-4. User characteristics (age, gender, appearance, re-identification) and behavior (attention) are analyzed from webcam data and utilized by the chatbot during the conversation. The responses are vocally rendered using a neural model from Microsoft Azure, specifically fine-tuned with over 2,000 recordings of an artist emulating Einstein's voice, capable of expressing eight different emotional tones, such as anger, excitement, or sadness. The facial animations are generated by a data-driven model. Corresponding body animations blend motion-captured movements based on the avatar's status (idle, listening, or speaking). Additionally, to visually represent the topic of the conversation, we continually generate and display an image, created using GPT-4o to formulate a prompt and Midjourney to execute the image production. The latest version of Digital Einstein is now also accessible on iPads as a mobile app and on the web at http://digital-einstein.ethz.ch.
Gaining awareness of affective states enables leveraging emotional information as additional context to design emotionally sentient systems.
Applications of such systems are manifold. For example, the learning gain can be increased in educational settings by incorporating targeted interventions that are capable of adjusting to affective states of students.
Another application consists of enabling digital characters and smartphones to support enriched interactions that are sensitive to the user's contexts.
In our projects, we focus on data-driven models relying on lightweight data collection tailored to mobile settings.
We have developed systems for affective state prediction based on camera recordings (i.e., action units, eye gaze, eye blinks, and head movement), low-cost mobile biosensors (i.e., skin conductance, heart rate, and skin temperature), handwriting data, and smartphone touch and sensor data.
We have also explored the complexities of human-chatbot emotion recognition in real-world contexts.
By collecting multimodal data from 99 participants interacting with a GPT-3-based chatbot over three weeks, we identified a significant domain gap between human-human and human-chatbot interactions.
This gap arises from subjective emotion labels, reduced facial expressivity, and the subtlety of emotions.
Using transformer-based multimodal emotion recognition networks, we found that personalizing models to individual users improved performance by up to 38% for user emotions and 41% for perceived chatbot emotions.
Understanding and integrating personality traits into conversational digital characters is essential for enhancing user interaction and engagement.
Large language models (LLMs) like GPT-4 exhibit distinct personality structures during human interactions.
In our studies, we collected 147 chatbot personality descriptors from 86 participants and further validated them with 425 new participants. Our exploratory factor analysis revealed that while there is overlap, human personality models do not fully transfer to chatbot personalities.
The perceived personalities of chatbots differ significantly from those of virtual personal assistants, which are typically viewed in terms of serviceability and functionality.
This finding underscores the evolving nature of language models and their impact on user perceptions of agent personalities.
To enhance chatbot interactions, we introduced dynamic personality infusion, a method that adjusts a chatbot's responses using a dedicated personality model and GPT-4, without altering its semantic capabilities.
Through human-chatbot conversations collected from 33 participants and subsequent ratings by 725 participants, we analyzed the impact of personality infusion on perceived trustworthiness and suitability for real-world use cases.
Our work highlights the potential of dynamic, personalized chatbots in transforming user interactions, enhancing satisfaction, and building trust, thereby paving the way for more engaging and applicable real-world chatbot experiences.
A dialog act is a label that contains information on the semantic and structural function of an utterance in a conversation e.g., inform, question, commissive, or directive. It is also commonly interpreted as the speakers intent at a lower level. Determining dialog acts in interactions with digital characters is essential for enabling agents to understand and respond effectively to user intents, leading to more engaging and seamless interactions. We have developed a system for classifying dialog acts in conversations from multi-modal data (text, audio, and video), which can be leveraged for online applications. Our research further focuses on leveraging intent for character response generation as well as for speech and animation synthesis.
Data-Driven Facial Animation Synthesis
When replacing fixed dialog systems with LLMs, pre-recorded MoCap does not suffice for animating conversational characters. Instead, the character's lip-sync, facial expressions, eye gaze, gestures, etc. need to flexibly adapt to new speech contents. Hence, we are performing research in the field of animation synthesis, developing deep sequence models for synthesizing animation control parameters on the fly based on the current conversational context.
Emotion-Content Disentanglement of Facial Animations
Equipping stylized conversational characters with facial animations tailored to specific emotions and artistic preferences enhances the coherence and authenticity of the characters. Within this project, we developed a novel Transformer-Autoencoder for disentangling emotion and content (lip-sync needed in speech animation) in facial animation sequences. Our method captures the full dynamics of an emotional episode, including temporal changes in intensity and subject-specific differences in the emotive expression. Hence, our method provides a valuable tool for artists to easily manipulate the type and intensity of an emotion maintaining its dynamics.
For an immersive augmented reality experience, it's essential that digital characters operate autonomously and interact dynamically with users. Our project focuses on crafting autonomous agents that can understand their environment and engage users in meaningful, goal-oriented interactions. By integrating LLMs and augmented reality (AR) technologies, users can interact directly with a digital Albert Einstein, gaining interactive and educational insights from his scientific legacy in a contemporary, user-friendly format. This endeavor not only honors Einstein's contributions to science but also repurposes his teachings for the modern digital age.
Our research aims to redefine behavioral standards for autonomous agents, allowing them to function automatically in either casual social interactions or complex tasks. By integrating LLMs with cognitive theories, we demonstrate how agents can mimic realistic behaviors while aligning with human values and expectations. The profiling methods developed in this study will help with character design, enhancing applications in creative writing and film production. Furthermore, the same cognitive framework will be used to develop the agents' theory of mind, improving their understanding of users and thus tailoring the interactive experience to individual needs. This approach will enable creators to develop more nuanced and credible characters, leveraging the frameworks established through this research.
We are actively working on applying digital characters in diverse fields, offering innovative solutions for personalized and interactive experiences. In education, virtual teachers provide customized tutoring and support, adapting to students' learning styles and emotional states to enhance educational outcomes. In healthcare, virtual doctors assist in diagnosing conditions, offering medical advice, and monitoring patient progress, making healthcare more accessible and efficient. Virtual psychotherapists provide mental health support through therapeutic conversations and emotional assistance, making mental health care more approachable and scalable. These applications demonstrate the potential of digital characters to revolutionize traditional practices, delivering personalized and effective interactions across various domains.
The research is supported by ETH Zurich Research Grants and the Swiss National Science Foundation.