来自用户 姗姗来迟 的文献。
当前共找到 8 篇文献分享。
1.
姗姗来迟
(2023-06-30 13:25):
#paper Arabic Dialect Identification with a Few Labeled Examples Using Generative Adversarial Networks
https://aclanthology.org/2022.aacl-main.16.pdf
考虑到在处理阿拉伯方言(DA)变化时引入的挑战和复杂性,基于Transformer的模型,例如BERT,在处理DA识别任务方面优于其他模型。然而,要对这些模型进行微调,需要大量的语料库。为一些阿拉伯语方言课程获取大量高质量的示例是具有挑战性和耗时的。
该论文扩展了基于Transformer的模型(ARBERT和MARBERT)使用 半监督生成对抗网络(SS-GAN)在生成对抗设置中使用未分类数据。模型能够为阿拉伯语方言样本 生成高质量的嵌入,并帮助模型更好地泛化下游分类任务。
ACL Anthology,
2022.
Abstract:
Given the challenges and complexities introduced while dealing with Dialect Arabic (DA) variations, Transformer based models, e.g., BERT, outperformed other models in dealing with the DA identification task. However, to …
>>>
Given the challenges and complexities introduced while dealing with Dialect Arabic (DA) variations, Transformer based models, e.g., BERT, outperformed other models in dealing with the DA identification task. However, to fine-tune these models, a large corpus is required. Getting a large number high quality labeled examples for some Dialect Arabic classes is challenging and time-consuming. In this paper, we address the Dialect Arabic Identification task. We extend the transformer-based models, ARBERT and MARBERT, with unlabeled data in a generative adversarial setting using Semi-Supervised Generative Adversarial Networks (SS-GAN). Our model enabled producing high-quality embeddings for the Dialect Arabic examples and aided the model to better generalize for the downstream classification task given few labeled examples. Experimental results showed that our model reached better performance and faster convergence when only a few labeled examples are available.
<<<
翻译
2.
姗姗来迟
(2023-05-14 19:34):
#paper Multimodal Graph Transformer for Multimodal Question Answering
https://arxiv.org/abs/2305.00581
这项工作从这两个世界中受益,并提出了一种新的多模态图转换器,用于需要跨多模态执行推理的问答任务。引入了一种涉及图形的即插即用类注意机制,将从文本和视觉数据中获得的多模态图形信息作为有效的先验信息整合到vanilla自注意力中。
具体来说,文章构建文本图、密集区域图和语义图来生成邻接矩阵,然后将它们与输入的视觉和语言特征组合在一起进行下游推理。
学习笔记链接:https://blog.csdn.net/weixin_44845357/article/details/130577459?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22130577459%22%2C%22source%22%3A%22weixin_44845357%22%7D
arXiv,
2023.
DOI: 10.48550/arXiv.2305.00581
Abstract:
Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, …
>>>
Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
<<<
翻译
3.
姗姗来迟
(2023-04-19 13:44):
#paper arXiv:2103.00020 Learning Transferable Visual Models From Natural Language Supervision
前天拜读了CLIP论文并去了解了一下论文中提到的prompt
拜读笔记见博文:CLIP论文拜读及理解
链接:https://blog.csdn.net/weixin_44845357/article/details/130206779
arXiv,
2021.
DOI: 10.48550/arXiv.2103.00020
Abstract:
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is …
>>>
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.
<<<
翻译
4.
姗姗来迟
(2023-03-27 15:44):
#paper arXiv:2201.11903
chain of thought Prompting elicits reasoning in large language models
阅读笔记被记录在本人的博文中:https://blog.csdn.net/weixin_44845357/article/details/129566376
主要是了解思维链(通过逐步回答示例来引出复杂的多步推理的技术)
arXiv,
2022.
Abstract:
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, …
>>>
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
<<<
翻译
5.
姗姗来迟
(2023-02-27 21:25):
#paper
https://openaccess.thecvf.com/content_CVPR_2019/html/Tang_Learning_to_Compose_Dynamic_Tree_Structures_for_Visual_Contexts_CVPR_2019_paper.html
title:为视觉上下文组成动态树结构的学习
提出,将图像中的objects放置到视觉上下文中,组成动态树结构,帮助场景图生成和视觉问答等视觉推理任务。该可视化上下文树模型,称为VCTREE,有两个关键优势:
1)高效且富有表现力的二叉树编码了对象之间固有的并行/层次关系;
2)动态结构从图像到图像,从任务到任务,允许更多内容/任务特定的消息传递。
CVPR 2019,
2018.
DOI: 10.48550/arXiv.1812.01880
Abstract:
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. …
>>>
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. Our visual context tree model, dubbed VCTree, has two key advantages over existing structured object representations including chains and fully-connected graphs: 1) The efficient and expressive binary tree encodes the inherent parallel/hierarchical relationships among objects, e.g., "clothes" and "pants" are usually co-occur and belong to "person"; 2) the dynamic structure varies from image to image and task to task, allowing more content-/task-specific message passing among objects. To construct a VCTree, we design a score function that calculates the task-dependent validity between each object pair, and the tree is the binary version of the maximum spanning tree from the score matrix. Then, visual contexts are encoded by bidirectional TreeLSTM and decoded by task-specific models. We develop a hybrid learning procedure which integrates end-task supervised learning and the tree structure reinforcement learning, where the former's evaluation result serves as a self-critic for the latter's structure exploration. Experimental results on two benchmarks, which require reasoning over contexts: Visual Genome for scene graph generation and VQA2.0 for visual Q&A, show that VCTree outperforms state-of-the-art results while discovering interpretable visual context structures.
<<<
翻译
6.
姗姗来迟
(2023-02-16 20:44):
#paper https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4247187 Hierarchical Reasoning Based on Perception Action Cycle for Visual Question Answering
- 受PAC机制的启发,设计了HIPA。HIPA遵循一种分层模式,通过对两种模态使用注意力模块来独立地解释视觉和语言特征,然后将聚合的特征传递到推理循环中。
- 受人类感知心理过程的启发,HIPA提出将视觉理解分为注意、组织和理解三个阶段。视觉理解的划分促进了对视觉特征的框架理解。
- 使用余弦相似度和曼哈顿距离的标准差作为视觉和语言特征的评价指标。
SSRN Electronic Journal,
2022.
DOI: 10.2139/ssrn.4247187
Abstract:
Recent visual question answering (VQA) frameworks employ different combinations of attention techniques to derive a correct answer. Attention techniques in vision-language tasks have mostly achieved success through the improvement of …
>>>
Recent visual question answering (VQA) frameworks employ different combinations of attention techniques to derive a correct answer. Attention techniques in vision-language tasks have mostly achieved success through the improvement of local features for both modalities. Attention as a concept is heavily established by human cognition mechanism. Different combinations of attention techniques are not well proven as a means of human cognition. Neural networks were originally inspired by the structure of the human brain. Many researchers have recently resorted to frameworks that resemble the human brain, and their models have achieved high performance. To this end, we aim to consider a framework that utilizes human biological and psychological concepts to achieve a good understanding of vision and language modalities. In this view, we introduce a hierarchical reasoning based on a perception action cycle (HIPA) framework to tackle VQA tasks. It integrates the reasoning process of multi-modalities with the perception action cycle (PAC), which explains the learning mechanism of humans about the surrounding world. It comprehends the visual modality through three phases of reasoning: object-level attention, organization, and interpretation. It comprehends the language modality through word-level attention, interpretation, and conditioning. Subsequently, vision and language modalities are interpreted dependently in a cyclic and hierarchical way throughout the entire framework. For further assessment of the visual and language features, we argue that image-question pairs of the same answer ought to have similar visual and language features eventually. As a result, we conduct visual and language feature evaluation experiments using metrics such as standard deviation of cosine similarity and Manhattan distance. We show that employing PAC in our framework improves the standard deviation compared with other VQA frameworks. For further assessment, we also test the novel proposed HIPA on the visual relationship detection (VRD) tasks. The proposed method achieves the state-of-the-art results on the TDIUC and VRD datasets and obtains competitive results on the VQA 2.0 dataset.
<<<
翻译
7.
姗姗来迟
(2023-01-31 23:24):
#paper
PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition
https://link.springer.com/article/10.1007/s11263-022-01654-0?utm_source=xmol&utm_content=meta
该工作针对篇幅级手写中文文本识别问题,提出了端到端弱监督的方法PageNet。该方法的主要优势在于:(1)从一个新的角度解决篇幅级中文文本识别问题——检测识别单字并预测单字间的阅读顺序。(2)模型可以弱监督地训练。对于真实数据仅需要标注文本,不需要任何边界框标注,极大地降低了数据的标注成本。(3)尽管只需要文本标注信息,模型却可以预测出单字级和文本行级的检测和识别结果。(4)该方法深入研究篇幅级文本识别中的阅读顺序问题,所提出的阅读顺序模块可以处理多方向文本、弯曲文本等复杂的阅读顺序。
Abstract:
Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error …
>>>
Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error caused by text line detection in real-world applications. Although some approaches aimed at page-level text recognition have been proposed in recent years, they either are limited to simple layouts or require very detailed annotations including expensive line-level and even character-level bounding boxes. To this end, we propose PageNet for end-to-end weakly supervised page-level HCTR. PageNet detects and recognizes characters and predicts the reading order between them, which is more robust and flexible when dealing with complex layouts including multi-directional and curved text lines. Utilizing the proposed weakly supervised learning framework, PageNet requires only transcripts to be annotated for real data; however, it can still output detection and recognition results at both the character and line levels, avoiding the labor and cost of labeling bounding boxes of characters and text lines. Extensive experiments conducted on five datasets demonstrate the superiority of PageNet over existing weakly supervised and fully supervised page-level methods. These experimental results may spark further research beyond the realms of existing methods based on connectionist temporal classification or attention. The source code is available at https://github.com/shannanyinxiang/PageNet.
<<<
翻译
8.
姗姗来迟
(2022-12-31 17:48):
#paper https://link.springer.com/article/10.1007/s11263-022-01654-0?utm_source=xmol&utm_content=meta
PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition
该工作针对篇幅级手写中文文本识别问题,提出了端到端弱监督的方法PageNet。该方法的主要优势在于:(1)从一个新的角度解决篇幅级中文文本识别问题——检测识别单字并预测单字间的阅读顺序。(2)模型可以弱监督地训练。对于真实数据仅需要标注文本,不需要任何边界框标注,极大地降低了数据的标注成本。(3)尽管只需要文本标注信息,模型却可以预测出单字级和文本行级的检测和识别结果。(4)该方法深入研究篇幅级文本识别中的阅读顺序问题,所提出的阅读顺序模块可以处理多方向文本、弯曲文本等复杂的阅读顺序。
Abstract:
Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error …
>>>
Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error caused by text line detection in real-world applications. Although some approaches aimed at page-level text recognition have been proposed in recent years, they either are limited to simple layouts or require very detailed annotations including expensive line-level and even character-level bounding boxes. To this end, we propose PageNet for end-to-end weakly supervised page-level HCTR. PageNet detects and recognizes characters and predicts the reading order between them, which is more robust and flexible when dealing with complex layouts including multi-directional and curved text lines. Utilizing the proposed weakly supervised learning framework, PageNet requires only transcripts to be annotated for real data; however, it can still output detection and recognition results at both the character and line levels, avoiding the labor and cost of labeling bounding boxes of characters and text lines. Extensive experiments conducted on five datasets demonstrate the superiority of PageNet over existing weakly supervised and fully supervised page-level methods. These experimental results may spark further research beyond the realms of existing methods based on connectionist temporal classification or attention. The source code is available at https://github.com/shannanyinxiang/PageNet.
<<<
翻译