Overview

The dissertation focuses on the task of generating 3D scene representations from natural language, which is referred to as text to 3D scene generation. This task involves many challenging problems in natural language understanding, such as grounding language to visual representations and leveraging world knowledge to interpret language in context. The dissertation proposes a new text-to-scene framework that incorporates prior knowledge learned from data and demonstrates its effectiveness in generating high-quality 3D scenes from natural language input.

Challenges

Framework

Common-sense priors learned from datasets of 3D models and scenes are used to represent spatial knowledge and infer unstated facts and resolve spatial constraints. The system also includes a method for learning groundings of lexical terms from a parallel corpus of 3D scenes and natural language descriptions to improve the quality of the generated 3D scenes.

I propose viewing the problem as extracting a set of explicit constraints from input descriptions, combining them with learned common-sense priors for inferring implicit constraints, and then selecting objects and positioning them to satisfy the constraints and generate plausible scenes.

Problem Decomposition

Scene template parsing (§ 5.1): Parse the textual description of a scene $u$ into a scene template $t$ that represents the explicitly stated set of constraints on the objects present and spatial relations between them. Scene inference (§ 5.2): Expand the literal scene template $t$ into a complete scene template $t’$ by accounting for implicit constraints not specified in the text using learned spatial priors. Scene generation (§ 5.3): Given a completed scene template $t$ with the constraints and priors on the spatial relations of objects, transform the scene template into a geometric 3D scene with a set of objects to be instantiated. The subproblem of scene generation is further decomposed into:

Prior Knowledge Incorporation

Common-sense priors learned from data are used to infer unstated facts and resolve spatial constraints when generating 3D scenes from natural language input. By incorporating these priors into the text-to-scene framework, the system can leverage world knowledge to interpret natural language in context and generate high-quality 3D scenes that are consistent with common-sense expectations.

Screenshot 2023-05-20 at 10.29.12.png