SceneFoundry: Generating Interactive Infinite 3D Worlds

Overview of SceneFoundry. The framework generates apartment-scale 3D scenes from natural language prompts via LLM-guided floor plan generation, diffusion-based placement, and post-optimization ensuring articulated functionality and robot navigability.

Video Demo

Abstract

The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research.

Pipeline

Overview of our apartment-scale generation pipeline. An LLM first guides procedural floor plan generation, diffusion posterior guidance generates plausible room bounding boxes, and 3D assets from 3D-FRONT/GAPartNet are refined via post-optimization to complete the layout.

Model

Guidance scheduling during the reverse diffusion process. Object quantity control is applied at t < 100 and articulated collision constraint at t < 10, followed by a final walkable-ratio optimization at t = 0 to generate a realistic scene.

Articulated Objects Collision Prevention

Visualization of the Articulated Object Collision Constraint. Synthesized scenes without the constraint (top) show obstructed articulated furniture, such as drawers that cannot open, while applying the constraint (bottom) enables proper motion and functional layouts.