文献收藏与分享平台

1.

姗姗来迟 (2023-06-30 13:25):

#paper Arabic Dialect Identification with a Few Labeled Examples Using Generative Adversarial Networks https://aclanthology.org/2022.aacl-main.16.pdf 考虑到在处理阿拉伯方言(DA)变化时引入的挑战和复杂性，基于Transformer的模型，例如BERT，在处理DA识别任务方面优于其他模型。然而，要对这些模型进行微调，需要大量的语料库。为一些阿拉伯语方言课程获取大量高质量的示例是具有挑战性和耗时的。该论文扩展了基于Transformer的模型（ARBERT和MARBERT）使用半监督生成对抗网络(SS-GAN)在生成对抗设置中使用未分类数据。模型能够为阿拉伯语方言样本生成高质量的嵌入，并帮助模型更好地泛化下游分类任务。

ACL Anthology, 2022.

Arabic Dialect Identification with a Few Labeled Examples Using Generative Adversarial Networks

翻译

Abstract:

Given the challenges and complexities introduced while dealing with Dialect Arabic (DA) variations, Transformer based models, e.g., BERT, outperformed other models in dealing with the DA identification task. However, to … >>>

翻译

2.

姗姗来迟 (2023-05-14 19:34):

#paper Multimodal Graph Transformer for Multimodal Question Answering https://arxiv.org/abs/2305.00581 这项工作从这两个世界中受益，并提出了一种新的多模态图转换器，用于需要跨多模态执行推理的问答任务。引入了一种涉及图形的即插即用类注意机制，将从文本和视觉数据中获得的多模态图形信息作为有效的先验信息整合到vanilla自注意力中。具体来说，文章构建文本图、密集区域图和语义图来生成邻接矩阵，然后将它们与输入的视觉和语言特征组合在一起进行下游推理。学习笔记链接：https://blog.csdn.net/weixin_44845357/article/details/130577459?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22130577459%22%2C%22source%22%3A%22weixin_44845357%22%7D

arXiv, 2023. DOI: 10.48550/arXiv.2305.00581

Multimodal Graph Transformer for Multimodal Question Answering

翻译

Xuehai He, Xin Eric Wang

Abstract:

Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, … >>>

翻译

3.

姗姗来迟 (2023-04-19 13:44):

#paper arXiv:2103.00020 Learning Transferable Visual Models From Natural Language Supervision 前天拜读了CLIP论文并去了解了一下论文中提到的prompt 拜读笔记见博文：CLIP论文拜读及理解链接：https://blog.csdn.net/weixin_44845357/article/details/130206779

arXiv, 2021. DOI: 10.48550/arXiv.2103.00020

Learning Transferable Visual Models From Natural Language Supervision

翻译

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Abstract:

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is … >>>

翻译

4.

姗姗来迟 (2023-03-27 15:44):

#paper arXiv:2201.11903 chain of thought Prompting elicits reasoning in large language models 阅读笔记被记录在本人的博文中：https://blog.csdn.net/weixin_44845357/article/details/129566376 主要是了解思维链（通过逐步回答示例来引出复杂的多步推理的技术）

arXiv, 2022.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

翻译

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Abstract:

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, … >>>

翻译

5.

姗姗来迟 (2023-02-27 21:25):

#paper https://openaccess.thecvf.com/content_CVPR_2019/html/Tang_Learning_to_Compose_Dynamic_Tree_Structures_for_Visual_Contexts_CVPR_2019_paper.html title:为视觉上下文组成动态树结构的学习提出，将图像中的objects放置到视觉上下文中，组成动态树结构，帮助场景图生成和视觉问答等视觉推理任务。该可视化上下文树模型，称为VCTREE，有两个关键优势: 1)高效且富有表现力的二叉树编码了对象之间固有的并行/层次关系; 2)动态结构从图像到图像，从任务到任务，允许更多内容/任务特定的消息传递。

CVPR 2019, 2018. DOI: 10.48550/arXiv.1812.01880

Learning to Compose Dynamic Tree Structures for Visual Contexts

翻译

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu

Abstract:

We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. … >>>

翻译

6.

姗姗来迟 (2023-02-16 20:44):

#paper https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4247187 Hierarchical Reasoning Based on Perception Action Cycle for Visual Question Answering - 受PAC机制的启发，设计了HIPA。HIPA遵循一种分层模式，通过对两种模态使用注意力模块来独立地解释视觉和语言特征，然后将聚合的特征传递到推理循环中。 - 受人类感知心理过程的启发，HIPA提出将视觉理解分为注意、组织和理解三个阶段。视觉理解的划分促进了对视觉特征的框架理解。 - 使用余弦相似度和曼哈顿距离的标准差作为视觉和语言特征的评价指标。

SSRN Electronic Journal, 2022. DOI: 10.2139/ssrn.4247187

Hierarchical Reasoning Based on Perception Action Cycle for Visual Question Answering

翻译

Safaa Abdullahi Moallim Mohamud , Amin Jalali , Minho Lee

Abstract:

Recent visual question answering (VQA) frameworks employ different combinations of attention techniques to derive a correct answer. Attention techniques in vision-language tasks have mostly achieved success through the improvement of … >>>

Recent visual question answering (VQA) frameworks employ different combinations of attention techniques to derive a correct answer. Attention techniques in vision-language tasks have mostly achieved success through the improvement of local features for both modalities. Attention as a concept is heavily established by human cognition mechanism. Different combinations of attention techniques are not well proven as a means of human cognition. Neural networks were originally inspired by the structure of the human brain. Many researchers have recently resorted to frameworks that resemble the human brain, and their models have achieved high performance. To this end, we aim to consider a framework that utilizes human biological and psychological concepts to achieve a good understanding of vision and language modalities. In this view, we introduce a hierarchical reasoning based on a perception action cycle (HIPA) framework to tackle VQA tasks. It integrates the reasoning process of multi-modalities with the perception action cycle (PAC), which explains the learning mechanism of humans about the surrounding world. It comprehends the visual modality through three phases of reasoning: object-level attention, organization, and interpretation. It comprehends the language modality through word-level attention, interpretation, and conditioning. Subsequently, vision and language modalities are interpreted dependently in a cyclic and hierarchical way throughout the entire framework. For further assessment of the visual and language features, we argue that image-question pairs of the same answer ought to have similar visual and language features eventually. As a result, we conduct visual and language feature evaluation experiments using metrics such as standard deviation of cosine similarity and Manhattan distance. We show that employing PAC in our framework improves the standard deviation compared with other VQA frameworks. For further assessment, we also test the novel proposed HIPA on the visual relationship detection (VRD) tasks. The proposed method achieves the state-of-the-art results on the TDIUC and VRD datasets and obtains competitive results on the VQA 2.0 dataset. <<<

翻译

7.

姗姗来迟 (2023-01-31 23:24):

#paper PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition https://link.springer.com/article/10.1007/s11263-022-01654-0?utm_source=xmol&utm_content=meta 该工作针对篇幅级手写中文文本识别问题，提出了端到端弱监督的方法PageNet。该方法的主要优势在于：（1）从一个新的角度解决篇幅级中文文本识别问题——检测识别单字并预测单字间的阅读顺序。（2）模型可以弱监督地训练。对于真实数据仅需要标注文本，不需要任何边界框标注，极大地降低了数据的标注成本。（3）尽管只需要文本标注信息，模型却可以预测出单字级和文本行级的检测和识别结果。（4）该方法深入研究篇幅级文本识别中的阅读顺序问题，所提出的阅读顺序模块可以处理多方向文本、弯曲文本等复杂的阅读顺序。

IF:11.600Q1 International Journal of Computer Vision, 2022. DOI: 10.1007/s11263-022-01654-0

PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition

翻译

Dezhi Peng , Lianwen Jin , Yuliang Liu , Canjie Luo , Songxuan Lai

Abstract:

Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error … >>>

翻译

8.

姗姗来迟 (2022-12-31 17:48):

#paper https://link.springer.com/article/10.1007/s11263-022-01654-0?utm_source=xmol&utm_content=meta PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition 该工作针对篇幅级手写中文文本识别问题，提出了端到端弱监督的方法PageNet。该方法的主要优势在于：（1）从一个新的角度解决篇幅级中文文本识别问题——检测识别单字并预测单字间的阅读顺序。（2）模型可以弱监督地训练。对于真实数据仅需要标注文本，不需要任何边界框标注，极大地降低了数据的标注成本。（3）尽管只需要文本标注信息，模型却可以预测出单字级和文本行级的检测和识别结果。（4）该方法深入研究篇幅级文本识别中的阅读顺序问题，所提出的阅读顺序模块可以处理多方向文本、弯曲文本等复杂的阅读顺序。

IF:11.600Q1 International Journal of Computer Vision, 2022. DOI: 10.1007/s11263-022-01654-0

PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition

翻译

Dezhi Peng , Lianwen Jin , Yuliang Liu , Canjie Luo , Songxuan Lai

Abstract:

Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error … >>>

翻译