Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
Abstract: On this page, we showcase the generative capabilities of InteractAvatar. All videos presented here were generated by our model and include corresponding audio. The content is organized into eight sections, demonstrating capabilities ranging from multi-object scenarios to fine-grained multi-step controls.
Recommended: Click the sound icon on videos for the full audio-visual experience.
We propose InteractAvatar, a novel dual-stream DiT generation model centered on Grounded Human-Object Interaction for Talking Avatars, which explicitly decouples perception planning from video synthesis. Beyond the core capability of GHOI, our method is a unified framework with flexible multimodal control, accepting any combination of text, audio, and motion as inputs.
In this section, all videos begin from a non-interactive reference frame containing multiple objects . The model must interpret the text command to determine which object to interact with and how to do so, inferring the object's location and properties directly from the image. InteractAvatar successfully generates realistic videos from a static starting point while maintaining high audio-visual consistency.
We demonstrate more complex scenarios where text prompts involve fine-grained multi-step control. InteractAvatar successfully adheres to these detailed sequential instructions, showcasing its strong command-following capabilities.
First, gently touch the flower in front of you. Then, extend one hand to pick up the hat on the table.
First, extend both hands and pick up the bag on the table. Then, stand up and hold the bag up in front of your chest.
First, carry the bag and walk forward. Then, raise the bag and hold it in front of your chest for display.
First, extend one hand to hold the vase, then lift the vase and show it towards the window.
Here, we demonstrate the ability to follow commands containing multiple sequential actions. The model accurately executes each action in the correct temporal order, demonstrating a robust understanding of complex, time-ordered instructions.
We demonstrate the long video generation and chinese language ability of our InteractAvatar. We divided the long video into several segments, with each segment having its own independent action description. InteractAvatar successfully adheres to these segmented sequential instructions and maintains the temporal coherence and ID consistency in long videos, showcasing its strong long video generation capabilities.
We demonstrate the song-driven generation ability of our InteractAvatar, with segmented action description. InteractAvatar generates high-quality lip dynamics and co-speech gestures in challenging singing scenarios, accurately following the action command.
We compare our method with current SOTA models. Our method demonstrates powerful text-aligned human-object interaction generation, while greatly preserving the audio-driven performance. Current SOTA audio-driven methods tend to generate only lip dynamics but ignore interaction instructions.
This section showcases interactions with 32 different types of common, everyday objects. InteractAvatar exhibits stable and consistent performance across all object categories, demonstrating the model's strong robustness and generalization.
This section illustrates the model's ability to naturally extrapolate from object interaction to human action control. InteractAvatar can precisely control body movements based on text commands.
This section showcases InteractAvatar's performance in a instruction-free, audio-only setting. This demonstrates that beyond interaction and action control, our model retains the core capabilities of generating high-quality lip dynamics and co-speech gestures.
This section highlights InteractAvatar's responsiveness to explicit motion signals. As a unified digital human model, it can be controlled by any combination of text, audio, and motion.