姗姗来迟 (2023-05-14 19:34):
#paper Multimodal Graph Transformer for Multimodal Question Answering https://arxiv.org/abs/2305.00581 这项工作从这两个世界中受益,并提出了一种新的多模态图转换器,用于需要跨多模态执行推理的问答任务。引入了一种涉及图形的即插即用类注意机制,将从文本和视觉数据中获得的多模态图形信息作为有效的先验信息整合到vanilla自注意力中。 具体来说,文章构建文本图、密集区域图和语义图来生成邻接矩阵,然后将它们与输入的视觉和语言特征组合在一起进行下游推理。 学习笔记链接:https://blog.csdn.net/weixin_44845357/article/details/130577459?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22130577459%22%2C%22source%22%3A%22weixin_44845357%22%7D
Multimodal Graph Transformer for Multimodal Question Answering
翻译
Abstract:
Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
翻译
回到顶部