Academic Project Page

Vision-language models (VLMs), such as CLIP, have shown remarkable capabilities in downstream tasks. However, the coupling of semantic information between the foreground and the background in images leads to significant shortcut issues that adversely affect out-of-distribution (OOD) detection abilities. When confronted with a background OOD sample, VLMs are prone to misidentifying it as in-distribution (ID) data. In this paper, we analyze the OOD problem from the perspective of shortcuts in VLMs and propose OSPCoOp which includes background decoupling and mask-guided region regularization. We first decouple images into ID-relevant and ID-irrelevant regions and utilize the latter to generate a large number of augmented OOD background samples as pseudo-OOD supervision. We then use the masks from background decoupling to adjust the model's attention, minimizing its focus on ID-irrelevant regions. To assess the model's robustness against background interference, we introduce a new OOD evaluation dataset, ImageNet-Bg, which solely consists of background images with all ID-relevant regions removed. Our method demonstrates exceptional performance in few-shot scenarios, achieving strong results even in one-shot setting, and outperforms existing methods.

Pipeline of OSPCoOp. We first decouple the images into foreground and background to provide pseudo-OOD supervision. Subsequently, we generate ID and pseudo-OOD augmented samples and then use the masks obtained during the background decoupling process to provide regional supervision for ID and OOD samples, guiding the model to ignore the ID-irrelevant regions.

To evaluate the robustness of the model against background interference, we propose an ImageNet background interference test set, ImageNet-Bg, based on the ImageNet validation set with 48,285 images. All images in this dataset are generated by removing ID-relevant regions from samples in the ImageNet validation set. We filter the images to obtain the ImageNet-Bg(S) test set, which contains purer background information with 24,863 images.

(Left) Results on ImageNet-1K benchmark with iNaturalist, SUN, Places, and Texture datasets. Our method achieves an average of 94.75% AUR and 25.13% FPR—a 0.91% AUR improvement and a 1.34% FPR reduction over the state-of-the-art few-shot training method. For OOD datasets with significant background interference (e.g., SUN and Places), our method shows notable gains, reaching 96.74% AUR on SUN and 94.01% AUR on Places.

(Right) Results on ImageNet-Bg. Results reveal that most methods underperform compared to the training-free CLIP baselines (MCM and GL-MCM). Besides, parameter-efficient fine-tuning demonstrates limited effectiveness in enhancing model robustness against background interference. Extensive parameter tuning may inadvertently reintroduce shortcut learning behaviors, ultimately diminishing the model's robustness against background attacks.

Visualization on ImageNet-1K ID dataset, where gray regions indicate areas the model identifies as ID-relevant.

Visualization on SUN (top) and ImageNet-Bg (bottom) OOD dataset, where darker colors indicate higher similarity with ID categories.

BibTeX

@InProceedings{Xu_2025_CVPR,
  author    = {Xu, Zhuo and Xiang, Xiang and Liang, Yifan},
  title     = {Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
  month     = {June},
  year      = {2025},
  pages     = {15402-15412}
}

Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection

Abstract

Pipeline

Dataset ImageNet-Bg

Results on ImageNet and ImageNet-Bg

Comparison Visualization

BibTeX