In the ever-evolving landscape of text-to-3D generation, a groundbreaking approach has emerged, promising to reshape the field without relying on conventional methods. A recent paper introduces IM-3D, a novel technique that leverages Iterative Multiview diffusion and reconstruction, marking a significant departure from existing methodologies. This innovative approach not only streamlines the process but also enhances the quality of generated 3D assets.

Traditionally, state-of-the-art text-to-3D generators have relied on off-the-shelf 2D generators trained on vast repositories of images due to the scarcity of 3D training data. However, this approach has inherent limitations, leading to debates on the optimal strategy for model construction. One prevalent method involves 3D distillation, such as Score Distillation Sampling (SDS) and its variants. While effective, these techniques demand extensive evaluations of 2D generators and suffer from prolonged generation times and potential convergence issues.

Addressing these shortcomings, IM-3D takes a transformative leap by enhancing multi-view generation quality. The approach utilizes Emu Video, a text-to-video generator network capable of producing up to 16 high-resolution (512 × 512) consistent views of objects. Unlike traditional methods, IM-3D bypasses distillation and reconstruction networks, directly fitting 3D models to generated views using a fast and robust Gaussian splatting-based reconstruction algorithm.

Central to IM-3D's efficacy is its iterative refinement process. After initial generation, the 3D reconstruction is fed back to the 2D generator, iteratively enhancing the consistency and quality of results. This iterative loop significantly reduces the number of required evaluations compared to conventional SDS approaches, achieving remarkable efficiency without compromising on quality.

Key advantages of IM-3D include its speed and quality. Generating multi-view images requires only a fraction of evaluations compared to traditional methods, while the subsequent reconstruction process is notably swift. Moreover, IM-3D mitigates common issues associated with SDS, such as artifacts and lack of diversity, while surpassing alternatives in terms of quality without necessitating large reconstruction networks.

In essence, IM-3D represents a paradigm shift in text-to-3D generation, demonstrating how video generator networks can deliver state-of-the-art results efficiently and effectively. By eliminating the need for distillation and reconstruction networks, IM-3D offers a streamlined approach that paves the way for advanced applications in the realm of 3D modeling and synthesis.

Download paper: https://arxiv.org/abs/2402.08682