Generate Any Scene :
Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Ziqi Gao^*¹, Weikai Huang^*¹, Jieyu Zhang¹, Aniruddha Kembhavi², Ranjay Krishna^1,2

▶ University of Washington ▶ Allen Institute for AI ^*Equal Contribution

Abstract

Generative models like DALL-E and Sora have gained attention by producing implausible images, such as “astronauts riding a horse in space.” Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a groundbreaking framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages Scene Graph Programming: a revolutionary method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. Additionally, we demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene:

A self-improving framework where models iteratively enhance their performance using generated data.
A distillation process to transfer specific strengths from proprietary models to open-source counterparts.
Improvements in content moderation by identifying and generating challenging synthetic data.

Scene Graph Programming

Metadata

To construct a scene graph, we use three main metadata types: 28,787 objects, 1,494 attributes, and 10,492 Relationships. We also have 2,193 scene attributes that capture the board aspect of the caption, such as art style, to create a complete visual caption.

$$\begin{array}{lll} \hline \textbf{Metadata Type} & \textbf{Number} & \textbf{Source} \\ \hline \text{Objects} & 28,787 & \text{WordNet} \\ \text{Attributes} & 1,494 & \text{Wikipedia, etc.} \\ \text{Relations} & 10,492 & \text{Robin} \\ \text{Scene Attributes} & 2,193 & \text{Places365, etc.} \\ \hline \end{array}$$

Captions Generation Process

The generation pipeline of Generate Any Scene:

Step 1: The system enumerates scene graph structures that contain objects, attributes, and relations based on complexity, and queries the corresponding scene graph structure that satisfies the needs.

Step 2: It populates these structures with metadata, assigning specific content to each node. Scene graphs are completed in this step.

Step 3: In addition to the scene graph, scene attributes—such as art style and camera settings—are sampled to provide contextual depth beyond the scene graph.

Step 4: The Generate Any Scene system combines the scene graph and scene attributes, such as art style and camera settings, and then translates them into a coherent caption by organizing the elements into structured text.

Overall Results

Comparative evaluation of text-to-image models across different backbones (DiT and UNet) using multiple metrics: TiFA-Score, Pick-Score, VQA-Score, and Image-Reward-Score (Evaluated on 10K Generate Any Scene Captions).

Overall performance of text-to-video models on 10K Generate Any Scene captions. Red Cell is the highest score. Yellow Cell is the second highest score.

Overall performance of text-to-video models on 10K Generate Any Scene captions with VBench metrics. Red Cell is the highest score. Blue Cell is the lowest score.

Overall performance of text-to-3D models on 10K Generate Any Scene captions with VBench metrics.

Application

We propose three applications of Generate Any Scene.

Application 1 - (Self-improving): Iteratively enhances a model by generating images with Generate Any Scene captions, selecting the best, and fine-tuning, yielding a performance boost.

Application 2 - (Distilling limitations): Distills strengths from proprietary models, such as better compositionality and hard concept understanding, into open-source models.

Application 3 - (Generated content detector): Robustify AI-generated content detection by training on diverse synthetic data generated by Generate Any Scene‘s captions.

Generate Any Scene :
Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Abstract

Scene Graph Programming

Metadata

Captions Generation Process

Overall Results

Application

Application 1:

Application 2:

Application 3:

Please refer to the paper for more experiments, analysis, and takeaways!

BibTeX

Generate Any Scene : Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Abstract

Scene Graph Programming

Metadata

Captions Generation Process

Overall Results

Application

Application 1:

Application 2:

Application 3:

Please refer to the paper for more experiments, analysis, and takeaways!

BibTeX

Generate Any Scene :
Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming