Angela’s Thesis Stanford

Overview

The dissertation focuses on the task of generating 3D scene representations from natural language, which is referred to as text to 3D scene generation. This task involves many challenging problems in natural language understanding, such as grounding language to visual representations and leveraging world knowledge to interpret language in context. The dissertation proposes a new text-to-scene framework that incorporates prior knowledge learned from data and demonstrates its effectiveness in generating high-quality 3D scenes from natural language input.

Challenges

Spatial knowledge is often not expressed explicitly in natural language, making it difficult to ground language and enable natural communication between people and intelligent systems.
To generate 3D scenes from natural language, a system must be able to extract information about objects, their attributes, and their relationships from the text, ground this information to visual representations, and leverage world knowledge to interpret the language in context.
This requires addressing core problems in natural language processing such as coreference resolution and grounding, as well as incorporating prior knowledge about the world to infer unstated facts and resolve spatial constraints.
The final step of generating a 3D scene that satisfies all constraints is also a challenging problem that has been addressed by prior work in computer graphics.

Framework

Common-sense priors learned from datasets of 3D models and scenes are used to represent spatial knowledge and infer unstated facts and resolve spatial constraints. The system also includes a method for learning groundings of lexical terms from a parallel corpus of 3D scenes and natural language descriptions to improve the quality of the generated 3D scenes.

I propose viewing the problem as extracting a set of explicit constraints from input descriptions, combining them with learned common-sense priors for inferring implicit constraints, and then selecting objects and positioning them to satisfy the constraints and generate plausible scenes.

Problem Decomposition

Scene template parsing (§ 5.1): Parse the textual description of a scene $u$ into a scene template $t$ that represents the explicitly stated set of constraints on the objects present and spatial relations between them. Scene inference (§ 5.2): Expand the literal scene template $t$ into a complete scene template $t’$ by accounting for implicit constraints not specified in the text using learned spatial priors. Scene generation (§ 5.3): Given a completed scene template $t$ with the constraints and priors on the spatial relations of objects, transform the scene template into a geometric 3D scene with a set of objects to be instantiated. The subproblem of scene generation is further decomposed into:

Object selection (§ 5.3.1): Select a set of models from the model database to represent the objects in the scene. For this we need grounding of text to objects (see § 6.2)
Scene layout (§ 5.3.2): Arrange the objects and optimize their placement based on priors on the relative positions of objects and explicitly provided spatial constraints. For this we need grounding of text to spatial constraints (see § 6.1).

Prior Knowledge Incorporation

Use Common sense to fulfill the unstated attributes and common world restraints

Common-sense priors learned from data are used to infer unstated facts and resolve spatial constraints when generating 3D scenes from natural language input. By incorporating these priors into the text-to-scene framework, the system can leverage world knowledge to interpret natural language in context and generate high-quality 3D scenes that are consistent with common-sense expectations.

Screenshot 2023-05-20 at 10.29.12.png