iconGenerate Any Scene :
Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

University of Washington Allen Institute for AI   *Equal Contribution

Abstract

Generative models like DALL-E and Sora have gained attention by producing implausible images, such as “astronauts riding a horse in space.” Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a groundbreaking framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages Scene Graph Programming: a revolutionary method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. Additionally, we demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene:

  1. A self-improving framework where models iteratively enhance their performance using generated data.
  2. A distillation process to transfer specific strengths from proprietary models to open-source counterparts.
  3. Improvements in content moderation by identifying and generating challenging synthetic data.

Scene Graph Programming

Metadata

To construct a scene graph, we use three main metadata types: 28,787 objects, 1,494 attributes, and 10,492 Relationships. We also have 2,193 scene attributes that capture the board aspect of the caption, such as art style, to create a complete visual caption.
$$\begin{array}{lll} \hline \textbf{Metadata Type} & \textbf{Number} & \textbf{Source} \\ \hline \text{Objects} & 28,787 & \text{WordNet} \\ \text{Attributes} & 1,494 & \text{Wikipedia, etc.} \\ \text{Relations} & 10,492 & \text{Robin} \\ \text{Scene Attributes} & 2,193 & \text{Places365, etc.} \\ \hline \end{array}$$

Captions Generation Process

The generation pipeline of Generate Any Scene:
  • Step 1: The system enumerates scene graph structures that contain objects, attributes, and relations based on complexity, and queries the corresponding scene graph structure that satisfies the needs.
  • Step 2: It populates these structures with metadata, assigning specific content to each node. Scene graphs are completed in this step.
  • Step 3: In addition to the scene graph, scene attributes—such as art style and camera settings—are sampled to provide contextual depth beyond the scene graph.
  • Step 4: The Generate Any Scene system combines the scene graph and scene attributes, such as art style and camera settings, and then translates them into a coherent caption by organizing the elements into structured text.
  • Overall Results

    Example Image
    Comparative evaluation of text-to-image models across different backbones (DiT and UNet) using multiple metrics: TiFA-Score, Pick-Score, VQA-Score, and Image-Reward-Score (Evaluated on 10K Generate Any Scene Captions).
    Video Result (overall)
    Overall performance of text-to-video models on 10K Generate Any Scene captions. Red Cell is the highest score. Yellow Cell is the second highest score.
    Video Result (Vbench)
    Overall performance of text-to-video models on 10K Generate Any Scene captions with VBench metrics. Red Cell is the highest score. Blue Cell is the lowest score.
    Video Result (Vbench)
    Overall performance of text-to-3D models on 10K Generate Any Scene captions with VBench metrics.

    Application

    We propose three applications of Generate Any Scene.
  • Application 1 - (Self-improving): Iteratively enhances a model by generating images with Generate Any Scene captions, selecting the best, and fine-tuning, yielding a performance boost.
  • Application 2 - (Distilling limitations): Distills strengths from proprietary models, such as better compositionality and hard concept understanding, into open-source models.
  • Application 3 - (Generated content detector): Robustify AI-generated content detection by training on diverse synthetic data generated by Generate Any Scene‘s captions.

  • Application 1:

    Average VQA score of stable diffusion v1.5fine-tuned on different data across 1K \name image/video evaluation set and GenAI-Bench image/video benchmark Generate Any Scene captions with VBench metrics.
    We show that our diverse captions can facilitate a framework to iteratively improve text-to-vision models using their own generations. Given a model, we generate multiple images, identify the highest-scoring one, and use it as new fine-tuning data to improve the model itself. We fine-tune stable diffusion v1.5 and achieve an average of 5% performance boost compared with original models, and this method is even better than fine-tuning with the same amount of real images and captions from the Conceptual Captions CC3M over different benchmarks.

    Application 2:

    Examples of images generated by Dalle-3, the original stable-diffusion v1.5, and the fine-tuned versions. The left four captions demonstrate fine-tuning with multi-object captions generated by Generate Any Scene for better compositionality, while the right two columns focus on understanding hard concepts.
    Using our evaluations, we identify limitations in open-sourced models that their proprietary counterparts excel at. Next, we distill these specific capabilities from proprietary models. For example, Dalle-3 excels particularly in generating composite images with multiple parts. We distill this capability into stable-diffusion v1.5, effectively bridging the gap between Dalle-3 and stable diffusion v1.5.

    Application 3:

    Comparison of detection performance across different data scales using D3 alone versus the combined D3 + Generate Any Scene training set in cross-model and cross-dataset scenarios.
    Content moderation is a vital application, especially as text-to-vision models improve. We identify which kinds of data content moderation models are bad at detecting, generate more of such content, and retrain the detectors. We train a ViT-T with our generated data and boost its detection capabilities across benchmarks.

    Please refer to the paper for more experiments, analysis, and takeaways!

    BibTeX

    @misc{gao2024generatesceneevaluatingimproving,
            title={Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming}, 
            author={Ziqi Gao and Weikai Huang and Jieyu Zhang and Aniruddha Kembhavi and Ranjay Krishna},
            year={2024},
            eprint={2412.08221},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2412.08221}, 
      }