姗姗来迟 (2023-02-16 20:44):
#paper https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4247187 Hierarchical Reasoning Based on Perception Action Cycle for Visual Question Answering - 受PAC机制的启发,设计了HIPA。HIPA遵循一种分层模式,通过对两种模态使用注意力模块来独立地解释视觉和语言特征,然后将聚合的特征传递到推理循环中。 - 受人类感知心理过程的启发,HIPA提出将视觉理解分为注意、组织和理解三个阶段。视觉理解的划分促进了对视觉特征的框架理解。 - 使用余弦相似度和曼哈顿距离的标准差作为视觉和语言特征的评价指标。
Hierarchical Reasoning Based on Perception Action Cycle for Visual Question Answering
翻译
Abstract:
Recent visual question answering (VQA) frameworks employ different combinations of attention techniques to derive a correct answer. Attention techniques in vision-language tasks have mostly achieved success through the improvement of local features for both modalities. Attention as a concept is heavily established by human cognition mechanism. Different combinations of attention techniques are not well proven as a means of human cognition. Neural networks were originally inspired by the structure of the human brain. Many researchers have recently resorted to frameworks that resemble the human brain, and their models have achieved high performance. To this end, we aim to consider a framework that utilizes human biological and psychological concepts to achieve a good understanding of vision and language modalities. In this view, we introduce a hierarchical reasoning based on a perception action cycle (HIPA) framework to tackle VQA tasks. It integrates the reasoning process of multi-modalities with the perception action cycle (PAC), which explains the learning mechanism of humans about the surrounding world. It comprehends the visual modality through three phases of reasoning: object-level attention, organization, and interpretation. It comprehends the language modality through word-level attention, interpretation, and conditioning. Subsequently, vision and language modalities are interpreted dependently in a cyclic and hierarchical way throughout the entire framework. For further assessment of the visual and language features, we argue that image-question pairs of the same answer ought to have similar visual and language features eventually. As a result, we conduct visual and language feature evaluation experiments using metrics such as standard deviation of cosine similarity and Manhattan distance. We show that employing PAC in our framework improves the standard deviation compared with other VQA frameworks. For further assessment, we also test the novel proposed HIPA on the visual relationship detection (VRD) tasks. The proposed method achieves the state-of-the-art results on the TDIUC and VRD datasets and obtains competitive results on the VQA 2.0 dataset.
翻译
回到顶部