Cutting-Edge Technology Unveiled: Audio2Video Diffusion Model Creates Expressive Portrait Videos

A groundbreaking advancement in artificial intelligence has emerged with the introduction of the EMO framework, a revolutionary audio-driven portrait-video generation system. Developed by a team of researchers, this innovative technology allows for the creation of lifelike vocal avatar videos with dynamic facial expressions and versatile head poses, even under less-than-optimal conditions.

The EMO framework operates in two distinct stages, each crucial for producing high-quality results. In the initial phase, known as Frames Encoding, features are extracted from a single reference image and motion frames using the ReferenceNet. Subsequently, during the Diffusion Process stage, an audio encoder processes vocal inputs, integrating facial region masks with multi-frame noise to govern the generation of facial imagery. The Backbone Network, equipped with Reference-Attention and Audio-Attention mechanisms, ensures the preservation of character identity and the modulation of character movements. Furthermore, Temporal Modules are employed to manipulate the temporal dimension and adjust motion velocity.

Dubbed "Make Portrait Sing," this cutting-edge technology can transform a static character image and vocal audio into expressive portrait videos, suitable for both talking and singing. Notably, the system can generate videos of varying durations, depending on the length of the input audio, while maintaining character identities over extended periods.

Moreover, the versatility of the EMO framework extends beyond vocal inputs from singing, accommodating spoken audio in multiple languages. Additionally, the technology can animate portraits from historical eras, paintings, and even 3D models and AI-generated content, imbuing them with realism and lifelike motion.

The introduction of the EMO framework marks a significant leap forward in the field of AI-driven video generation, offering unprecedented capabilities for creating expressive and dynamic portrait videos under diverse conditions. As this technology continues to evolve, it holds the potential to revolutionize various industries, from entertainment to education and beyond.

Download paper: https://arxiv.org/pdf/2402.17485.pdf

0 Comentarios