Vision-language models (VLMs), such as CLIP, have shown remarkable capabilities in downstream tasks. However, the coupling of semantic information between the foreground and the background in images leads to significant shortcut issues that adversely affect out-of-distribution (OOD) detection abilities. When confronted with a background OOD sample, VLMs are prone to misidentifying it as in-distribution (ID) data. In this paper, we analyze the OOD problem from the perspective of shortcuts in VLMs and propose OSPCoOp which includes background decoupling and mask-guided region regularization. We first decouple images into ID-relevant and ID-irrelevant regions and utilize the latter to generate a large number of augmented OOD background samples as pseudo-OOD supervision. We then use the masks from background decoupling to adjust the model's attention, minimizing its focus on ID-irrelevant regions. To assess the model's robustness against background interference, we introduce a new OOD evaluation dataset, ImageNet-Bg, which solely consists of background images with all ID-relevant regions removed. Our method demonstrates exceptional performance in few-shot scenarios, achieving strong results even in one-shot setting, and outperforms existing methods.
(Left) Results on ImageNet-1K benchmark with iNaturalist, SUN, Places, and Texture datasets. Our method achieves an average of 94.75% AUR and 25.13% FPR—a 0.91% AUR improvement and a 1.34% FPR reduction over the state-of-the-art few-shot training method. For OOD datasets with significant background interference (e.g., SUN and Places), our method shows notable gains, reaching 96.74% AUR on SUN and 94.01% AUR on Places.
(Right) Results on ImageNet-Bg. Results reveal that most methods underperform compared to the training-free CLIP baselines (MCM and GL-MCM). Besides, parameter-efficient fine-tuning demonstrates limited effectiveness in enhancing model robustness against background interference. Extensive parameter tuning may inadvertently reintroduce shortcut learning behaviors, ultimately diminishing the model's robustness against background attacks.
@InProceedings{Xu_2025_CVPR,
author = {Xu, Zhuo and Xiang, Xiang and Liang, Yifan},
title = {Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {15402-15412}
}