响马读paper

一个要求成员每月至少读一篇文献并打卡的学术交流社群

2023, arXiv. DOI: 10.48550/arXiv.2311.02612 arXiv ID: 2311.02612
Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection
Jiangning Zhang, Xuhai Chen, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu
Abstract:
Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding
capabilities, making it possible to handle certain tasks through the Visual
Question Answering (VQA) paradigm. This paper explores the potential of
VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and
is the first to conduct qualitative and quantitative evaluations on the popular
MVTec AD and VisA datasets. Considering that this task requires both
image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three
components: 1) Granular Region Division, 2) Prompt Designing, 3)
Text2Segmentation for easy quantitative evaluation, and have made some
different attempts for comparative analysis. The results show that GPT-4V can
achieve certain results in the zero-shot AD task through a VQA paradigm, such
as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec
AD and VisA datasets, respectively. However, its performance still has a
certain gap compared to the state-of-the-art zero-shot method, e.g., WinCLIP
ann CLIP-AD, and further research is needed. This study provides a baseline
reference for the research of VQA-oriented LMM in the zero-shot AD task, and we
also post several possible future works. Code is available at
\url{https://github.com/zhangzjn/GPT-4V-AD}.
2024-01-31 22:50:00
#paper arxiv.org//pdf/2311.026 2023 Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection. 大型多模态模型 (LMM) GPT-4V(ision) 赋予 GPT-4 视觉grounding能力,使得通过视觉问答 (VQA) 范式处理某些任务成为可能。本文探讨了面向 VQA 的 GPT-4V 在最近流行的视觉异常检测(AD)中的潜力,并首次对流行的 MVTec AD 和 VisA 数据集进行定性和定量评估。 考虑到该任务需要图像/像素级评估,提出的 GPT-4V-AD 框架包含三个组成部分:1)粒度区域划分,2)提示设计,3)用于轻松定量评估的 Text2Segmentation,并做了一些不同的 尝试进行比较分析。 结果表明,GPT-4V可以通过VQA范式在零样本AD任务中取得一定的结果,例如在MVTec AD和VisA数据集上分别实现图像级77.1/88.0和像素级68.0/76.6 AU-ROC 。 然而,其性能与最先进的零样本方法(例如WinCLIP和CLIP-AD)相比仍然存在一定差距,需要进一步研究。 这项研究为零样本 AD 任务中面向 VQA 的 LMM 的研究提供了基线参考
TOP