Guided Latent Slot Diffusion for Object-Centric Learning

1Technical Univeristy of Darmstadt, 2hessian.AI
arXiv 2024

TL;DR 🚀

We present GLASS, a slot attention method that uses language guidance in order to obtain slot embeddings which are better suited for various downstream tasks such as object-level property prediction, conditional generation and object discovery. GLASS sets state-of-the-art (SOTA) benchmark for the object discovery and conditional generation tasks compared to exisiting slot attention-based methods.

Abstract

Network Diagram
Left: We utilize a pre-trained diffusion decoder for generating the guidance signal to obtain better slot embeddings. Right: Our slot embeddings outperform previous slot attention methods on the task of object discovery and conditional generation, while being competitve in the property prediction task.
Slot attention aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable various downstream tasks. Yet, these slots often bind to object parts, not objects themselves, especially for real-world datasets. To address this, we introduce Guided Latent Slot Diffusion – GLASS, an object-centric model that uses generated captions as a guiding signal to better align slots with objects. Our key insight is to learn the slot-attention module in the space of generated images. This allows us to repurpose the pre-trained diffusion decoder model, which reconstructs the images from the slots, as a semantic mask generator based on the generated captions. GLASS learns an object-level representation suitable for multiple tasks simultaneously, e.g., segmentation, image generation, and property prediction, outperforming previous methods. For object discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU over the previous SOTA method on the VOC and COCO datasets, respectively, and establishes a new SOTA FID score for conditional image generation amongst slot-attention-based methods. For the segmentation task, GLASS surpasses SOTA weakly-supervised and language-based segmentation models, which were specifically designed for the task.

Method

Network Diagram
Network architecture of GLASS 1 The input image is fed to a prompt generator for generating a prompt P, which is obtained by concatenating generated caption and the extracted class label from the generated caption. 2 A random noise vector, along with the generated prompt P, is used to generate an image using a pre-trained diffusion module. 3 The cross-attention layers of the diffusion model, along with self-attention layers, are used in the pseudo ground-truth generation module to generate the semantic mask. 4 The generated image is passed through an encoder model (DINOv2) followed by a slot attention module to generate slots. 5 The slots are matched with their corresponding object masks using the Hungarian matcher module.6 The slot attention module is trained end-to-end using the mean squared error between the reconstructed and the generated image and our guidance loss between the predicted mask from the slots and their matched object mask. GLASS is trained on generated images only, the real images are only used for prompt generation.

Results

GLASS and GLASS (GLASS with ground-truth class labels) are representation learning methods that perform multiple downstream tasks like object discovery (OD), property prediction (PP), and conditional generation (CG). We show results for all the tasks.

Object Discovery

We compare our method against language-based segmentation models (top partition), weakly-supervised models (middle partition), and OCL models (bottom partition) for the object discovery task. “Downstream Tasks” denotes a model’s capability for solving the following tasks: OD: object discovery, PP: object-level property prediction, and CG: conditional generation. Input denotes the input signal the model trains on, where I: image, C: captions, and L: image-level labels. The best value is highlighted in red, the second best in blue. We show the relative improvement (in parentheses) of GLASS compared to the best OCL method. Our method outperforms all models on the COCO dataset and is comparable to the best model on the VOC dataset (ToCo).

Qualitative Results

Quantative Comparision with other Weakly-Supervised OCL Methods

GLASS outperforms other weakly-supervised variants of StableLSD such as StableLSD-BBox: StableLSD, which uses bounding box information for initalizing slots and StableLSD Dynamic: StableLSD with dynamic number of slots equal to number of objects present in the scene.

Conditional Generation

GLASS and GLASS outperform StableLSD on the task of conditional generation in terms of FID on the COCO and VOC datasets.

Object-level Propery Prediction

An ideal model should have a high detection rate and accuracy. GLASS and GLASS have a slightly lower (approx. -2%) accuracy drop but have a higher increase in detection rate compared to StableLSD. Here, □ and ♢ show the results for the COCO and VOC datasets, respectively

BibTeX

@inproceedings{singh2024synthetic,
  author    = {Krishnakant Singh and Simone Scahub-Meyer and Stefan Roth},
  title     = {Guided Latent Slot Diffusion for Object-Centric Learning},
  booktitle = {arXiv:2407.17929 [cs.CV]},
  year      = {2024},
}