- A Diffusion model for indoor scene synthesis
Notes

- 3D Scene $s$ → fully-connected scene graph $x_0$
- each object as a graph node strong all object attributes, i.e. location, size, orientation, class label, and latent shape code.
- DiffuScene → Based on a sel of all posible $x_0$
- Forward: Gradually add noice to $x_0$ until a standard Gaussian noice $x_T$
- Reverse: A Denoising network cleans the noisy graph using ancestral sampling
- use the denoised object features to perform shape retrieval
Graph Representation


Denoising Network

Text-Conditioned
- employ a pretrained BERT encoder to extract word embedings
- inject the language guidance into denoising network using cross attention layers