SceneFormer | Notion

Notes

Screenshot 2023-04-30 at 16.46.44.png

3.4. Text-Conditioned Scene Generation

In scene generation from text, a room is described by a list of sentences. We use the first 3 sentences, tokenize them, and pad the token sequence to a maximum length of 40 tokens. We then embed each word with an embedding function. We experiment with GloVe [24], ELMo [25], and BERT [11]. We obtain fixed-size word embeddings with dimension d (100, 1024, and 768 respectively) using the Flair library [1]. We then use a 2-layer MLP to convert from d to E dimensions, where E is the dimension of the SceneFormer embedding.

For the text-conditional model, we use decoders only for the category and location models since our sentences only describe object classes and their spatial relations. The decoders for orientation and dimension models are replaced by encoders without cross-attention. We do not use an additional loss to align the transformer output with the input text; this relation is learned implicitly.

Untitled

Key takeaways

SceneFormer: A method for generating indoor scenes using transformers, which can generate a sequence of objects along with their locations and orientations conditioned on a room layout.
Data Preparation: Each scene is treated as a sequence of objects, ordered by the frequency of their class categories in the training set. The location and dimensions of each object are normalized and quantized.
Layout-Conditioned Scene Generation: The model can generate an indoor scene conditioned on a room layout, defining the floor, windows, and doors. The floor is represented as a binary image and encoded by a series of residual blocks.
Text-Conditioned Scene Generation: The model can also generate an indoor scene conditioned on text descriptions of the room. The text is tokenized, embedded, and then given as input to the SceneFormer decoder.
Results: The generated scenes are realistic and diverse, with the ability to capture common patterns in object relations. In a perceptual study, the layout-conditioned scenes generated by SceneFormer were preferred over other methods.