Evaluating Object-Centric Models beyond Object Discovery

TL;DR: We evaluate object-centric representations with VLMs for real visual reasoning and introduce a unified metric that jointly measures localization and representation usefulness.

Issues with existing evalutions

Disjoint Evaluation.

Disjoint evaluation. Separate metrics miss localization and representation fragmentation: (top) M1 and M2 score equally despite different localization quality; (bottom) VQA probing cannot distinguish correct-slot answers (M4) from wrong-slot answers (M3).

Limited evalution.

Limited evalution. Existing probes are costly and limited to simple classification tasks; VLMs scale but still suffer from disjoint evaluation. VLMs + AwGA provide a unified, fragmentation-aware evaluation.

VLMs to the rescue

Training.
Our training is akin to LLaVA. In Stage I, only the MLP connector is trained on the LLaVA pre-training dataset. This aligns the slot embeddings with the language model's embedding space. In Stage II, the MLP network and the language model are trained on the instruction-tuning dataset from LLaVA. This enables the language model to follow instructions and perform tasks based on slots as visual tokens.

Evaluation.
We perform evaluation in a zero-shot fashion on various VQA benchmarks letting us evaluate the usefulness of OCL models for complex visual reasoning tasks.

AwGA: Enabling joint evaluation

Accuracy and mIoU evaluate usefulness and localization separately; G-Acc partially unifies them but misses representation fragmentation, while AwGA jointly evaluates both and penalizes both fragmentation types.
AwGA is computed by first computing the attribution score for each slot with respect to the predicted answer and then selecting the top K slots with the highest attributions and computing the mean intersection over union (mIoU) using the union of their predicted masks.
Formally, it is written as:

Results

General Perception Benchmarks

Takeaway 1. Under our VLM evaluation, OCL models are competitive despite far fewer tokens, with feature-based and hybrid reconstructions performing best.

Robust Perception Benchmarks

Takeaway 2. Under VLM probing, OCL models perform well on OOD and numeric counterfactual tasks but trail DINOv2 on compositional and adversarial benchmarks, with feature reconstruction being more robust.

Mismatch between Object Discovery and Representation Usefulness

Takeaway 3. In our VLM-based setting, object discovery scores poorly reflect representation usefulness, motivating joint localization–utility metrics.

Joint evalution of "what" and "where"

Takeaway 4. Object discovery or semantic accuracy alone is insufficient; AwGA jointly evaluates what and where and penalizes both localization and representation fragmentation.

Grounding Failure Examples

Grounding failures. DINOSAURv2 and StableLSD achieve high accuracy or G-Acc but low AwGA because answer-attributed slots poorly overlap with the grounded masks.

BibTeX

@inproceedings{singh2026evaluating,
  author    = {Krishnakant Singh and Simone Schaub-Meyer and Stefan Roth},
  title     = {Evaluating Object-Centric Models beyond Object Discovery},
  booktitle = {arXiv: [cs.CV]},
  year      = {2024},
}