来自用户 王昊 的文献。
当前共找到 10 篇文献分享。
1.
王昊 (2023-01-31 23:53):
#paper Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules http://arxiv.org/abs/2001.01568 Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. 2020. Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. Retrieved January 31, 2023. VCM图像编码基线方法(cheng2020网络),用于机器视觉编码的特征提取阶段,是图像压缩方法类算法。作者提出使用离散的高斯混合似然来参数化潜在表示的分布,可以获得更准确和灵活的概率模型。此外,作者还使用attention module来提高网络对图像中复杂区域的关注能力。具体地,作者提出使用离散高斯混合模型来对latent representation进行熵估计,这样可以对y提供多个最可能的均值,而每一个mixture的方差可以更小,达到的效果是实现更准确的概率模型,节约编码y所需要的比特数。第二,作者还加入了简化版的attention modules,可以提高网络对于non-zero responses,即复杂区域的关注,同时不引入过多的训练复杂度。
arXiv, 2020.
Abstract:
Image compression is a fundamental research field and many well-known compression standards have been developed for many decades. Recently, learned compression methods exhibit a fast development trend with promising results. … >>>
Image compression is a fundamental research field and many well-known compression standards have been developed for many decades. Recently, learned compression methods exhibit a fast development trend with promising results. However, there is still a performance gap between learned compression algorithms and reigning compression standards, especially in terms of widely used PSNR metric. In this paper, we explore the remaining redundancy of recent learned compression algorithms. We have found accurate entropy models for rate estimation largely affect the optimization of network parameters and thus affect the rate-distortion performance. Therefore, in this paper, we propose to use discretized Gaussian Mixture Likelihoods to parameterize the distributions of latent codes, which can achieve a more accurate and flexible entropy model. Besides, we take advantage of recent attention modules and incorporate them into network architecture to enhance the performance. Experimental results demonstrate our proposed method achieves a state-of-the-art performance compared to existing learned compression methods on both Kodak and high-resolution datasets. To our knowledge our approach is the first work to achieve comparable performance with latest compression standard Versatile Video Coding (VVC) regarding PSNR. More importantly, our approach generates more visually pleasant results when optimized by MS-SSIM. This project page is at this https URL this https URL <<<
翻译
2.
王昊 (2022-12-31 23:57):
#paper https://arxiv.org/abs/2111.08687v2 Jing Shao, Siyu Chen, Yangguang Li, et al. 2021. INTERN: A New Learning Paradigm Towards General Vision. 视觉基础模型的论文。“书生”(INTERN),旨在系统化解决当下人工智能视觉领域中存在的任务通用、场景泛化和数据效率等一系列瓶颈问题。“书生”由七大模块组成,包括通用视觉数据系统、通用视觉网络结构、通用视觉评测基准三个基础设施模块,以及区分上下游的四个训练阶段模块。多个阶段中学习到了很强的泛化能力。其可以在26个数据集上实现CV中的四类任务,仅使用10%的训练数据进行微调,性能便优于全套数据训练的对应模型。
Abstract:
Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a … >>>
Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner. See project website at this https URL . <<<
翻译
3.
王昊 (2022-11-30 19:37):
#paper https://cis.temple.edu/tagit/presentations/A%20Path%20Towards%20Autonomous%20Machine%20Intelligence.pdf. A Path Towards Autonomous Machine Intelligence.  LeCun. A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27. 62. Yann LeCun指明下一代AI方向:自主机器智能。 LeCun在本文中提出了一套认知的架构,以及训练其中world model的方法。主要包括以下模块: (1)配置器(Configurator)模块负责执行控制(executive control):给定要执行的任务,可以通过调整这些模块的参数来预先配置感知模块(perception module)、世界模型(world model)、成本(cost)和当前任务的 actor。(2)感知模块(Perception module)接收来自传感器的信号并估计当前世界的状态,对于给定的任务,只有一小部分感知到的世界状态是相关和有用的。配置器模块启动感知系统,从感知中提取相关信息,完成手头的任务。(3)世界模型(World model)的作用是双重的:(1)估计感知未提供的关于世界状态的缺失信息;(2)预测合理的未来世界状态。(4)成本模块(Cost module)计算单个标量的输出,该输出预测智能体的不适(discomfort)程度。(5)Actor 模块计算动作序列的提议。(6)短期记忆模块(Short-term memory module)跟踪当前和预测的世界状态以及相关成本。
2022.
Abstract:
How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple … >>>
How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of abstraction, enabling them to reason, predict, and plan at multiple time horizons? This position paper proposes an architecture and training paradigms with which to construct autonomous intelligent agents. It combines concepts such as configurable predictive world model, behavior driven through intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning. <<<
翻译
4.
王昊 (2022-10-25 10:11):
#paper doi: 10.48550/arXiv.2110.07342 So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. 2022. FILM: Following Instructions in Language with Modular Methods. Retrieved July 13, 2022 from http://arxiv.org/abs/2110.07342. 应用于视觉语言导航任务的算法文章,目前在ALFRED数据集下排名第4的方法。本文提出了一种具有结构化表示的模块化方法,(1)构建场景的语义地图,(2)使用语义搜索策略进行探索,以实现自然语言目标。Film的四个组件:1.将语言指令转换成结构化形式(语言处理)2.将以自我为中心的视觉输入转换为语义度量图(语义映射)3. 将以自我为中心的视觉输入转换为语义度量图(语义搜索策略)4. 输出后续导航/交互操作(确定性策略)。FILM不需要任何提供顺序指导的输入,即专家轨迹或低级语言指令(用来指导顺序)。
Abstract:
Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This often requires the use of expert trajectories and low-level language instructions. Such approaches assume that neural … >>>
Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This often requires the use of expert trajectories and low-level language instructions. Such approaches assume that neural states will integrate multimodal semantics to perform state tracking, building spatial memory, exploration, and long-term planning. In contrast, we propose a modular method with structured representations that (1) builds a semantic map of the scene and (2) performs exploration with a semantic search policy, to achieve the natural language goal. Our modular method achieves SOTA performance (24.46 %) with a substantial (8.17 % absolute) gap from previous work while using less data by eschewing both expert trajectories and low-level instructions. Leveraging low-level language, however, can further increase our performance (26.49 %). Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions. <<<
翻译
5.
王昊 (2022-09-01 14:36):
#paper doi:10.1109/TNNLS.2022.3152527 Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2022. Learning From Noisy Labels With Deep Neural Networks: A Survey. IEEE Transactions on Neural Networks and Learning Systems: 1–19. 本文是噪声标签2022年的综述论文,主要介绍结构化数据、图片分类数据集等进行去噪的各种方法。具体类别总结如下: 【Robust Architecture】基于attention注意力机制给干净样本和噪声数据进行打分,文章叫做Attention Feature Mixup,在计算最终loss的时候有两部分,一部分是同一个类的每张图和标签计算的交叉熵损失;另外一个损失是数据mixup得到的新的数据x'和标签y'计算的loss. 【Robust Regularization】 这一部分主要是通过一些添加正则ticks,防止模型过拟合到噪声数据上,常用的正则方法包含:label smooth、l1、l2、MixUp等. 【Sample Selection】Area Under the Margin metric (AUM):在训练过程中一边训练一边筛选数据的方式. 【数据划分】是通过密度聚类的思路,将一个类的数据分成easy dataset、smi-hard dataset 和 hard dataset,一般噪声数据是较为困难训练的数据,对于每张图分配一个权重,文中建议1.0、0.5和0.5;模型的训练借鉴了课程学习的思路. 【Semi-supervised Learning】基于半监督学习的带噪学习算法,首先介绍DivideMix方法,其实还是co-teaching的思路,但是在挑出干净样本和噪音样本后,把噪音样本当做无标签样本,通过 FixMatch 的方法进行训练,目前半监督图像分类的 SOTA 应该还是 FixMatch. (这个性能比较好) 【Label correction】“label correction phase”通过一个pre-trained模型得到随机选择每个类中的几张图采用聚类的方法得到Prototype样本的每个类的聚类中心,对输入图片得到的特征向量和各类聚类中心计算距离,得到图片的伪标签,最后的loss是原始标签计算的交叉熵损失和伪标签计算的伪标签的求和。
Abstract:
Deep learning has achieved remarkable success in numerous domains with help from large amounts of big data. However, the quality of data labels is a concern because of the lack … >>>
Deep learning has achieved remarkable success in numerous domains with help from large amounts of big data. However, the quality of data labels is a concern because of the lack of high-quality labels in many real-world scenarios. As noisy labels severely degrade the generalization performance of deep neural networks, learning from noisy labels (robust training) is becoming an important task in modern deep learning applications. In this survey, we first describe the problem of learning with label noise from a supervised learning perspective. Next, we provide a comprehensive review of 62 state-of-the-art robust training methods, all of which are categorized into five groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority. Subsequently, we perform an in-depth analysis of noise rate estimation and summarize the typically used evaluation methodology, including public noisy datasets and evaluation metrics. Finally, we present several promising research directions that can serve as a guideline for future studies. <<<
翻译
6.
王昊 (2022-09-01 14:34):
#paper doi:10.1109/ICCV48922.2021.00014 ZHOU X, LIU X, WANG C, 等. Learning with Noisy Labels via Sparse Regularization[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021: 72-81. https://doi.org/10.1109/ICCV48922.2021.00014. 本文使用稀疏正则化的方法,将输出尽可能地往one-hot上引导,使得输出锐化(一个是1,其它都是0,相当于有很大的确信度就是那一个答案,其它的概率都很低), 具体使用使用Lp Norm方法来达成. 该方法属于噪声标签去噪的损失函数方法的paper。噪声标签去噪综述可参见: SONG H, KIM M, PARK D, 等. Learning From Noisy Labels With Deep Neural Networks: A Survey[J/OL]. IEEE Transactions on Neural Networks and Learning Systems, 2022: 1-19. https://doi.org/10.1109/TNNLS.2022.3152527
Abstract:
Learning with noisy labels is an important and challenging task for training accurate deep neural networks. Some commonly-used loss functions, such as Cross Entropy (CE), suffer from severe overfitting to … >>>
Learning with noisy labels is an important and challenging task for training accurate deep neural networks. Some commonly-used loss functions, such as Cross Entropy (CE), suffer from severe overfitting to noisy labels. Robust loss functions that satisfy the symmetric condition were tailored to remedy this problem, which however encounter the underfitting effect. In this paper, we theoretically prove that any loss can be made robust to noisy labels by restricting the network output to the set of permutations over a fixed vector. When the fixed vector is one-hot, we only need to constrain the output to be one-hot, which however produces zero gradients almost everywhere and thus makes gradient-based optimization difficult. In this work, we introduce the sparse regularization strategy to approximate the one-hot constraint, which is composed of network output sharpening operation that enforces the output distribution of a net-work to be sharp and the ℓ p -norm (p ≤ 1) regularization that promotes the network output to be sparse. This simple approach guarantees the robustness of arbitrary loss functions while not hindering the fitting ability. Experimental results demonstrate that our method can significantly improve the performance of commonly-used loss functions in the presence of noisy labels and class imbalance, and out-perform the state-of-the-art methods. The code is available at https://github.com/hitcszx/lnl_sr. <<<
翻译
7.
王昊 (2022-08-10 11:27):
#paper 10.48550/arXiv.2109.07872 TAN S, GE M, GUO D, 等. Knowledge-based Embodied Question Answering[J/OL]. 2021[2022-08-09]. https://arxiv.org/abs/2109.07872v1.清华孙富春组的文章,主要介绍具身智能体在AI2thor空间里回答针对周围环境的问题,且这些问题需要外部知识库的支持才能回答. 之前存在的问题:具身问答(EQA)不具备回答需要外部知识图谱的问题的能力(其实在KBVQA领域已经有人这么做过了),且不具备推理能力(其实什么可以被定义为推理挺难说的),多跳问答是一个较难的问题.,且现在的EQA系统不能使用遗忘的记忆来节省智能体重新探索的时间. 本文贡献: 1.提出了knowledge-EQA的任务,基于AI2THOR虚拟环境; 2.建立了数据集(数据集的种类只有一些很简单的问题,不是很难) 3.提出了基于 神经编程诊断、3D场景图、3D重建、问题转换为SQL语句、蒙特卡洛树搜索 等技术综合起来的方法来解决上述问题。
Abstract:
In this paper, we propose a novel Knowledge-based Embodied Question Answering (K-EQA) task, in which the agent intelligently explores the environment to answer various questions with the knowledge. Different from … >>>
In this paper, we propose a novel Knowledge-based Embodied Question Answering (K-EQA) task, in which the agent intelligently explores the environment to answer various questions with the knowledge. Different from explicitly specifying the target object in the question as existing EQA work, the agent can resort to external knowledge to understand more complicated question such as "Please tell me what are objects used to cut food in the room?", in which the agent must know the knowledge such as "knife is used for cutting food". To address this K-EQA problem, a novel framework based on neural program synthesis reasoning is proposed, where the joint reasoning of the external knowledge and 3D scene graph is performed to realize navigation and question answering. Especially, the 3D scene graph can provide the memory to store the visual information of visited scenes, which significantly improves the efficiency for the multi-turn question answering. Experimental results have demonstrated that the proposed framework is capable of answering more complicated and realistic questions in the embodied environment. The proposed method is also applicable to multi-agent scenarios. <<<
翻译
8.
王昊 (2022-08-05 23:08):
#paper doi:10.1109/ICCV48922.2021.01307 [ZHANG F Z, CAMPBELL D, GOULD S. Spatially Conditioned Graphs for Detecting Human–Object Interactions[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021: 13299-13307. https://doi.org/10.1109/ICCV48922.2021.01307. 本文使用GNN处理图像中人-物交互(HOI)的任务。在传统方法中,节点向它们的每个邻居发送与其它节点同质消息,本文根据它们的空间关系来调节节点对之间的消息传递内容,从而使得不同的消息发送到同一节点的邻居。其中用到了配对的二向图的概念和各向异性消息传递算法.多维度的数据的融合使用了MBF网络.本文是2021ICCV文章,在当年性能还行.可作为场景图生成(SGG)任务的子任务.
Abstract:
We address the problem of detecting human–object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, … >>>
We address the problem of detecting human–object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state-of-the-art on fine-tuned detections. <<<
翻译
9.
王昊 (2022-07-28 09:51):
#paper doi:10.48550/arXiv.2207.04630 Yi Ma, Doris Tsao, and Heung-Yeung Shum. 2022. On the Principles of Parsimony and Self-Consistency for the Emergence of Intelligence. 作者马毅数学功底很好,和做神经科学的Doris Tsao合作的一篇讲述他们认为的2个重要的AI基本原理的文章。本文提出了一个理解深度神经网络的新框架:压缩闭环转录,并回答了从数据中学习的目标是什么,如何衡量?(信息编码论)以及 如何通过高效和有效的计算实现这样的目标?(控制)这两个问题。提出理解AI的两个基本原理:简约性与自洽性。
Abstract:
Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of Intelligence in … >>>
Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of Intelligence in general. We introduce two fundamental principles, Parsimony and Self-consistency, that address two fundamental questions regarding Intelligence: what to learn and how to learn, respectively. We believe the two principles are the cornerstones for the emergence of Intelligence, artificial or natural. While these two principles have rich classical roots, we argue that they can be stated anew in entirely measurable and computable ways. More specifically, the two principles lead to an effective and efficient computational framework, compressive closed-loop transcription, that unifies and explains the evolution of modern deep networks and many artificial intelligence practices. While we mainly use modeling of visual data as an example, we believe the two principles will unify understanding of broad families of autonomous intelligent systems and provide a framework for understanding the brain. <<<
翻译
10.
王昊 (2022-06-30 17:08):
#paper doi:https://doi.org/10.48550/arXiv.2201.12086 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv:2201.12086 [cs]. BLIP 是一个统一的视觉语言预训练(vision-language pre-training, VLP)框架,从有噪声的图像文本对中学习。 BLIP 通过自展标注(bootstrapping the captions),可以有效地利用带有噪声的 web 数据,其中标注器(captioner)生成标注,过滤器(filter)去除有噪声的标注。本模型属于开源的视觉语言模型中性能较好的(2022年6月),可以直接docker部署,应用于多个视觉语言下游任务。我们尝试了以后可以一定程度上实现zero-shot的功能。在VQA 2.0数据集上性能较好。思考下一步将其作为预训练模型,微调后应用于落地的其它下游任务。
Abstract:
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been … >>>
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at this https URL. <<<
翻译
回到顶部