Notes

Screenshot 2023-04-30 at 16.46.44.png

3.4. Text-Conditioned Scene Generation

In scene generation from text, a room is described by a list of sentences. We use the first 3 sentences, tokenize them, and pad the token sequence to a maximum length of 40 tokens. We then embed each word with an embedding function. We experiment with GloVe [24], ELMo [25], and BERT [11]. We obtain fixed-size word embeddings with dimension d (100, 1024, and 768 respectively) using the Flair library [1]. We then use a 2-layer MLP to convert from d to E dimensions, where E is the dimension of the SceneFormer embedding.

For the text-conditional model, we use decoders only for the category and location models since our sentences only describe object classes and their spatial relations. The decoders for orientation and dimension models are replaced by encoders without cross-attention. We do not use an additional loss to align the transformer output with the input text; this relation is learned implicitly.

Untitled

Key takeaways