We present GLASS, a slot attention method that uses language guidance in order to obtain slot embeddings which are better suited for various downstream tasks such as object-level property prediction, conditional generation and object discovery. GLASS sets state-of-the-art (SOTA) benchmark for the object discovery and conditional generation tasks compared to exisiting slot attention-based methods.
GLASS and GLASS† (GLASS with ground-truth class labels) are representation learning methods that perform multiple downstream tasks like object discovery (OD), property prediction (PP), and conditional generation (CG). We show results for all the tasks.
We compare our method against language-based segmentation models (top partition), weakly-supervised models (middle partition), and OCL models (bottom partition) for the object discovery task. “Downstream Tasks” denotes a model’s capability for solving the following tasks: OD: object discovery, PP: object-level property prediction, and CG: conditional generation. Input denotes the input signal the model trains on, where I: image, C: captions, and L: image-level labels. The best value is highlighted in red, the second best in blue. We show the relative improvement (in parentheses) of GLASS compared to the best OCL method. Our method outperforms all models on the COCO dataset and is comparable to the best model on the VOC dataset (ToCo).
GLASS outperforms other weakly-supervised variants of StableLSD such as StableLSD-BBox: StableLSD, which uses bounding box information for initalizing slots and StableLSD Dynamic: StableLSD with dynamic number of slots equal to number of objects present in the scene.
GLASS and GLASS† outperform StableLSD on the task of conditional generation in terms of FID on the COCO and VOC datasets.
An ideal model should have a high detection rate and accuracy. GLASS and GLASS† have a slightly lower (approx. -2%) accuracy drop but have a higher increase in detection rate compared to StableLSD. Here, □ and ♢ show the results for the COCO and VOC datasets, respectively
@inproceedings{singh2024synthetic,
author = {Krishnakant Singh and Simone Scahub-Meyer and Stefan Roth},
title = {Guided Latent Slot Diffusion for Object-Centric Learning},
booktitle = {arXiv:2407.17929 [cs.CV]},
year = {2024},
}