来自杂志 arXiv 的文献。
当前共找到 112 篇文献分享,本页显示第 21 - 40 篇。
21.
符毓 Yu
(2024-03-31 23:50):
#paper doi.org/10.48550/arXiv.2403.16527, 2024, Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art. 智能控制系统能通过预训练在各场景下得到广泛应用,但在训练外场景下表现糟糕。大模型出现有希望提供现有训练方式缺乏的推理能力,但大模型会产生“幻觉”(听起来合理但很差的决策)。本文尝试定义“幻觉”,并给出检测和缓解规划中出现“幻觉”的方法分类,评估指标和数据集等
arXiv,
2024.
DOI: 10.48550/arXiv.2403.16527
Abstract:
Autonomous systems are soon to be ubiquitous, from manufacturing autonomy toagricultural field robots, and from health care assistants to the entertainmentindustry. The majority of these systems are developed with modularsub-components …
>>>
Autonomous systems are soon to be ubiquitous, from manufacturing autonomy toagricultural field robots, and from health care assistants to the entertainmentindustry. The majority of these systems are developed with modularsub-components for decision-making, planning, and control that may behand-engineered or learning-based. While these existing approaches have beenshown to perform well under the situations they were specifically designed for,they can perform especially poorly in rare, out-of-distribution scenarios thatwill undoubtedly arise at test-time. The rise of foundation models trained onmultiple tasks with impressively large datasets from a variety of fields hasled researchers to believe that these models may provide common sense reasoningthat existing planners are missing. Researchers posit that this common sensereasoning will bridge the gap between algorithm development and deployment toout-of-distribution tasks, like how humans adapt to unexpected scenarios. Largelanguage models have already penetrated the robotics and autonomous systemsdomains as researchers are scrambling to showcase their potential use cases indeployment. While this application direction is very promising empirically,foundation models are known to hallucinate and generate decisions that maysound reasonable, but are in fact poor. We argue there is a need to step backand simultaneously design systems that can quantify the certainty of a model'sdecision, and detect when it may be hallucinating. In this work, we discuss thecurrent use cases of foundation models for decision-making tasks, provide ageneral definition for hallucinations with examples, discuss existingapproaches to hallucination detection and mitigation with a focus on decisionproblems, and explore areas for further research in this exciting field.
<<<
翻译
22.
符毓 Yu
(2024-02-29 22:43):
#paper doi.org/10.48550/arXiv.2304.09349
2023, LLM as A Robotic Brain: Unifying Egocentric Memory and Control. LLM 代理通过预训练获得知识和推理能力来解决机器人技术和规划任务。然而,人们在教机器人“该做什么”付出了较多努力。文章重点在于传达机器人不能做什么,以及满足安全操作标准。针对在协作环境中部署LLM代理,提出了解决LLM模型固有的概率性和不能应对复杂条件的约束方式。最终在VirtualHome环境和真实机器人实验上都表明,能在不影响目标完成率的情况下满足安全约束条件
arXiv,
2023.
Abstract:
No abstract available.
23.
小W
(2024-02-29 20:28):
#paper doi:arXiv:2203.13906 Biolink Model: A Universal Schema for Knowledge Graphs in
Clinical, Biomedical, and Translational Science 本文介绍了欧洲分子生物学实验室对于生命进程的认识 Biolink 模型,其使用yaml变体 linkml ( Linked data Modeling Language )定义一组分层的、相互关联的类以及它们之间的关系,以此来表征转化科学中的实体以及这些实体之间的联系。其工作包含标准生物模式、样本、TranslatorMinimal三个模型库以及使用其模型关联不同本体数据的方法。基于此模型,其他团队开发了NIH 的Biomedical Data Translator项目,以及 2023 发表于 Nat. Biotechnol 的 BioCypher 。
arXiv,
2022.
DOI: 10.48550/arXiv.2203.13906
Abstract:
Within clinical, biomedical, and translational science, an increasing numberof projects are adopting graphs for knowledge representation. Graph-based datamodels elucidate the interconnectedness between core biomedical concepts,enable data structures to be easily …
>>>
Within clinical, biomedical, and translational science, an increasing numberof projects are adopting graphs for knowledge representation. Graph-based datamodels elucidate the interconnectedness between core biomedical concepts,enable data structures to be easily updated, and support intuitive queries,visualizations, and inference algorithms. However, knowledge discovery acrossthese "knowledge graphs" (KGs) has remained difficult. Data set heterogeneityand complexity; the proliferation of ad hoc data formats; poor compliance withguidelines on findability, accessibility, interoperability, and reusability;and, in particular, the lack of a universally-accepted, open-access model forstandardization across biomedical KGs has left the task of reconciling datasources to downstream consumers. Biolink Model is an open source data modelthat can be used to formalize the relationships between data structures intranslational science. It incorporates object-oriented classification andgraph-oriented features. The core of the model is a set of hierarchical,interconnected classes (or categories) and relationships between them (orpredicates), representing biomedical entities such as gene, disease, chemical,anatomical structure, and phenotype. The model provides class and edgeattributes and associations that guide how entities should relate to oneanother. Here, we highlight the need for a standardized data model for KGs,describe Biolink Model, and compare it with other models. We demonstrate theutility of Biolink Model in various initiatives, including the Biomedical DataTranslator Consortium and the Monarch Initiative, and show how it has supportedeasier integration and interoperability of biomedical KGs, bringing togetherknowledge from multiple sources and helping to realize the goals oftranslational science.
<<<
翻译
24.
🐼太真实
(2024-02-29 10:04):
#paper ProPainter: Improving Propagation and Transformer for Video Inpainting 本文介绍了一种新的视频修复技术——ProPainter,通过双域传播和掩码引导稀疏视频Transformer的设计,实现了高效而准确的视频修复。文章详细介绍了ProPainter的三个关键组成部分:循环流场完成、双域传播和掩码引导稀疏视频Transformer,并提供了相应的技术细节和实验结果。
arXiv,
2023.
DOI: 10.48550/arXiv.2309.03897
Abstract:
Flow-based propagation and spatiotemporal Transformer are two mainstreammechanisms in video inpainting (VI). Despite the effectiveness of thesecomponents, they still suffer from some limitations that affect theirperformance. Previous propagation-based approaches are …
>>>
Flow-based propagation and spatiotemporal Transformer are two mainstreammechanisms in video inpainting (VI). Despite the effectiveness of thesecomponents, they still suffer from some limitations that affect theirperformance. Previous propagation-based approaches are performed separatelyeither in the image or feature domain. Global image propagation isolated fromlearning may cause spatial misalignment due to inaccurate optical flow.Moreover, memory or computational constraints limit the temporal range offeature propagation and video Transformer, preventing exploration ofcorrespondence information from distant frames. To address these issues, wepropose an improved framework, called ProPainter, which involves enhancedProPagation and an efficient Transformer. Specifically, we introducedual-domain propagation that combines the advantages of image and featurewarping, exploiting global correspondences reliably. We also propose amask-guided sparse video Transformer, which achieves high efficiency bydiscarding unnecessary and redundant tokens. With these components, ProPainteroutperforms prior arts by a large margin of 1.46 dB in PSNR while maintainingappealing efficiency.
<<<
翻译
25.
尹志
(2024-02-28 22:09):
#paper An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists doi: https://doi.org/10.48550/arXiv.1710.04019 生成式AI风光无两,Sora甚嚣尘上,虽然我还做不到这样的效果(对,我就是酸),但我却认为这不是终极方案,特别是对于物理世界、生物系统。The Bitter Lesson中对scaling law的强调甚至信奉,在语言、视频这样的领域有其价值,但生命科学、物理系统有数十亿年的的历史(物理系统应该是创始之初把),生命的演化、物理系统的本源,人类对其千百年来积累的原理性探索,应该是更优的先验。哦,回到这篇paper的主题。拓扑数据分析,是一种将系统的拓扑与几何性质引入分析建模过程,从而对系统获取更深刻的理解的工具。本篇综述对这个工具做了细致的讲解并对它的应用领域做了分析和tutorial。对拓扑数据分析这门技术的数学前置也做了简单但细致的介绍,主要是代数拓扑和计算几何。之所以有前面一段的碎碎念,就是因为我结合最近的一些实践,切实感受到拓扑和几何这些抽象的数学工具与生成式AI的结合,对生物系统和物理世界的描述,也许是优于目前暴力怼计算的一种更高效的建模方式,能够更深入系统的物理本质。如果你也相信物理系统和生命世界的简单高效的,是美丽简洁的,建议尝试一下这些新的技术。对了,这篇综述的revison信息是[Submitted on 11 Oct 2017 (v1), last revised 25 Feb 2021 (this version, v2)], 是不是说明了点什么呢?
arXiv,
2017.
DOI: 10.48550/arXiv.1710.04019
Abstract:
Topological Data Analysis is a recent and fast growing field providing a setof new topological and geometric tools to infer relevant features for possiblycomplex data. This paper is a brief …
>>>
Topological Data Analysis is a recent and fast growing field providing a setof new topological and geometric tools to infer relevant features for possiblycomplex data. This paper is a brief introduction, through a few selectedtopics, to basic fundamental and practical aspects of \tda\ for non experts.
<<<
翻译
26.
前进
(2024-01-31 22:50):
#paper arxiv.org//pdf/2311.026 2023 Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection.
大型多模态模型 (LMM) GPT-4V(ision) 赋予 GPT-4 视觉grounding能力,使得通过视觉问答 (VQA) 范式处理某些任务成为可能。本文探讨了面向 VQA 的 GPT-4V 在最近流行的视觉异常检测(AD)中的潜力,并首次对流行的 MVTec AD 和 VisA 数据集进行定性和定量评估。 考虑到该任务需要图像/像素级评估,提出的 GPT-4V-AD 框架包含三个组成部分:1)粒度区域划分,2)提示设计,3)用于轻松定量评估的 Text2Segmentation,并做了一些不同的 尝试进行比较分析。 结果表明,GPT-4V可以通过VQA范式在零样本AD任务中取得一定的结果,例如在MVTec AD和VisA数据集上分别实现图像级77.1/88.0和像素级68.0/76.6 AU-ROC 。 然而,其性能与最先进的零样本方法(例如WinCLIP和CLIP-AD)相比仍然存在一定差距,需要进一步研究。 这项研究为零样本 AD 任务中面向 VQA 的 LMM 的研究提供了基线参考
arXiv,
2023.
DOI: 10.48550/arXiv.2311.02612
Abstract:
Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual groundingcapabilities, making it possible to handle certain tasks through the VisualQuestion Answering (VQA) paradigm. This paper explores the potential ofVQA-oriented GPT-4V …
>>>
Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual groundingcapabilities, making it possible to handle certain tasks through the VisualQuestion Answering (VQA) paradigm. This paper explores the potential ofVQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) andis the first to conduct qualitative and quantitative evaluations on the popularMVTec AD and VisA datasets. Considering that this task requires bothimage-/pixel-level evaluations, the proposed GPT-4V-AD framework contains threecomponents: 1) Granular Region Division, 2) Prompt Designing, 3)Text2Segmentation for easy quantitative evaluation, and have made somedifferent attempts for comparative analysis. The results show that GPT-4V canachieve certain results in the zero-shot AD task through a VQA paradigm, suchas achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTecAD and VisA datasets, respectively. However, its performance still has acertain gap compared to the state-of-the-art zero-shot method, e.g., WinCLIPann CLIP-AD, and further research is needed. This study provides a baselinereference for the research of VQA-oriented LMM in the zero-shot AD task, and wealso post several possible future works. Code is available at\url{https://github.com/zhangzjn/GPT-4V-AD}.
<<<
翻译
27.
尹志
(2024-01-31 10:39):
#paper doi: https://doi.org/10.48550/arXiv.2304.02643 Segment Anything。Meta在2023年的一篇工作,提出了一个CV领域的基础模型。文章的目标很清楚,通过prompt的方式,实现通用的segmentatoin任务。虽然在互联网上爆炒一轮后趋于平淡,但是对CV社区的影响还是非常大的。后续的Grounding-DINO,Grounded-SAM等工作,都有着不错的效果,而且对后续CV任务的解决给出了一套不同的思考范式。整个工作偏工程,或者想法上原创性的亮点不多,网络结构上也充分借鉴了大量基于Transformer的创新工作。值得一提的正是工程上的思路或者说解决方案。meta提出了一个新颖的任务,即:如何通过一个通用的任务来解决图像分割。进而设计训练流程和对应的损失。在过程中,设计了一套有效的数据标注引擎,实现了高效标注数据生产,这对于行业应用有着很强的借鉴价值。
从研究角度来看,如何充分利用预训练好的sam模型,大模型中的先验如何提取,从而为特定领域下游任务提供支持是一个重要的研究方向。
arXiv,
2023.
DOI: 10.48550/arXiv.2304.02643
Abstract:
We introduce the Segment Anything (SA) project: a new task, model, anddataset for image segmentation. Using our efficient model in a data collectionloop, we built the largest segmentation dataset to …
>>>
We introduce the Segment Anything (SA) project: a new task, model, anddataset for image segmentation. Using our efficient model in a data collectionloop, we built the largest segmentation dataset to date (by far), with over 1billion masks on 11M licensed and privacy respecting images. The model isdesigned and trained to be promptable, so it can transfer zero-shot to newimage distributions and tasks. We evaluate its capabilities on numerous tasksand find that its zero-shot performance is impressive -- often competitive withor even superior to prior fully supervised results. We are releasing theSegment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and11M images at https://segment-anything.com to foster research into foundationmodels for computer vision.
<<<
翻译
28.
🐼太真实
(2024-01-30 21:45):
#paper: doi:2110.11316 文章《CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP》介绍了一种名为CLOOB(Contrastive Leave One Out Boost)的新型自监督学习方法。这种方法结合了现代霍普菲尔德网络(Modern Hopfield Networks)和InfoLOOB目标(Leave One Out Bound),用于提升对比学习的效能。CLOOB在零样本转移学习(zero-shot transfer learning)方面,不论在哪种架构或数据集上,均优于之前的CLIP方法。
CLOOB的核心是使用现代霍普菲尔德网络来增强数据的共现性和协方差结构。这种网络与传统的霍普菲尔德网络相比,具有更高的存储容量和更快的检索速度。通过使用这些网络,CLOOB能够加强输入样本中特征的共现性和协方差结构,有效地提取和强化数据中的重要特征。
此外,CLOOB还采用了InfoLOOB目标函数来避免InfoNCE目标函数中出现的饱和问题。InfoLOOB目标是一种对比学习的目标,用于处理匹配对和不匹配对之间的关系,以减少目标函数的饱和,并使得学习过程更加高效。
arXiv,
2021.
DOI: 10.48550/arXiv.2110.11316
Abstract:
CLIP yielded impressive results on zero-shot transfer learning tasks and isconsidered as a foundation model like BERT or GPT3. CLIP vision models thathave a rich representation are pre-trained using the …
>>>
CLIP yielded impressive results on zero-shot transfer learning tasks and isconsidered as a foundation model like BERT or GPT3. CLIP vision models thathave a rich representation are pre-trained using the InfoNCE objective andnatural language supervision before they are fine-tuned on particular tasks.Though CLIP excels at zero-shot transfer learning, it suffers from anexplaining away problem, that is, it focuses on one or few features, whileneglecting other relevant features. This problem is caused by insufficientlyextracting the covariance structure in the original multi-modal data. Wesuggest to use modern Hopfield networks to tackle the problem of explainingaway. Their retrieved embeddings have an enriched covariance structure derivedfrom co-occurrences of features in the stored embeddings. However, modernHopfield networks increase the saturation effect of the InfoNCE objective whichhampers learning. We propose to use the InfoLOOB objective to mitigate thissaturation effect. We introduce the novel "Contrastive Leave One Out Boost"(CLOOB), which uses modern Hopfield networks for covariance enrichment togetherwith the InfoLOOB objective. In experiments we compare CLOOB to CLIP afterpre-training on the Conceptual Captions and the YFCC dataset with respect totheir zero-shot transfer learning performance on other datasets. CLOOBconsistently outperforms CLIP at zero-shot transfer learning across allconsidered architectures and datasets.
<<<
翻译
29.
尹志
(2023-12-31 14:32):
#paper Consistency Models https://doi.org/10.48550/arXiv.2303.01469 扩散模型目前已经是生成式AI的核心技术方案了,但是由于它的迭代生成的性质,使得采样速度一直存在问题,因此在实际应用的场景下就会遇到阻碍。CM(consistency models)作为常规的扩散模型的高效改进方案,基于PE(probability flow) ODE轨道,提出一个针对ODE轨道(可以认为是演化迭代的步骤)上的映射,使得我们能够从任意轨道点,即任意迭代的timestep,映射到初始点,即原图。cm模型的提出,让单步扩散模型采样的质量变得更高,从而带动了大量实际应用的产生,包括图像编辑、图像补全等。目前大量基于扩散模型的实际应用,都已经使用了cm。这个是年初的时候Yang Song大佬和Ilya Sutskever一起的工作,四个作者全部都是来自openAI的扩散模型大佬。
arXiv,
2023.
DOI: 10.48550/arXiv.2303.01469
Abstract:
Diffusion models have significantly advanced the fields of image, audio, andvideo generation, but they depend on an iterative sampling process that causesslow generation. To overcome this limitation, we propose consistency …
>>>
Diffusion models have significantly advanced the fields of image, audio, andvideo generation, but they depend on an iterative sampling process that causesslow generation. To overcome this limitation, we propose consistency models, anew family of models that generate high quality samples by directly mappingnoise to data. They support fast one-step generation by design, while stillallowing multistep sampling to trade compute for sample quality. They alsosupport zero-shot data editing, such as image inpainting, colorization, andsuper-resolution, without requiring explicit training on these tasks.Consistency models can be trained either by distilling pre-trained diffusionmodels, or as standalone generative models altogether. Through extensiveexperiments, we demonstrate that they outperform existing distillationtechniques for diffusion models in one- and few-step sampling, achieving thenew state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 forone-step generation. When trained in isolation, consistency models become a newfamily of generative models that can outperform existing one-step,non-adversarial generative models on standard benchmarks such as CIFAR-10,ImageNet 64x64 and LSUN 256x256.
<<<
翻译
30.
🐼太真实
(2023-12-28 20:39):
#paper https://doi.org/10.48550/arXiv.2312.03701 , Self-conditioned Image Generation via Generating Representations
这篇文章介绍了一种名为“表示条件图像生成”(RCG)的新型图像生成框架。RCG 不依赖于人类标注,而是基于自监督的表示分布来生成图像。使用预训练的编码器将图像分布映射到表示分布,然后通过表示扩散模型(RDM)从中采样,最后通过像素生成器根据采样的表示生成图像。RCG 在 ImageNet 256×256 数据集上实现了显著的性能提升,其 FID 和 IS 分别达到了 3.31 和 253.4。这个方法不仅显著提升了类无条件图像生成的水平,而且与当前领先的类条件图像生成方法相比也具有竞争力,弥补了这两种任务之间长期存在的性能差距。
arXiv,
2023.
DOI: 10.48550/arXiv.2312.03701
Abstract:
This paper presents $\textbf{R}$epresentation-$\textbf{C}$onditioned image$\textbf{G}$eneration (RCG), a simple yet effective image generation frameworkwhich sets a new benchmark in class-unconditional image generation. RCG doesnot condition on any human annotations. Instead, it …
>>>
This paper presents $\textbf{R}$epresentation-$\textbf{C}$onditioned image$\textbf{G}$eneration (RCG), a simple yet effective image generation frameworkwhich sets a new benchmark in class-unconditional image generation. RCG doesnot condition on any human annotations. Instead, it conditions on aself-supervised representation distribution which is mapped from the imagedistribution using a pre-trained encoder. During generation, RCG samples fromsuch representation distribution using a representation diffusion model (RDM),and employs a pixel generator to craft image pixels conditioned on the sampledrepresentation. Such a design provides substantial guidance during thegenerative process, resulting in high-quality image generation. Tested onImageNet 256$\times$256, RCG achieves a Frechet Inception Distance (FID) of3.31 and an Inception Score (IS) of 253.4. These results not only significantlyimprove the state-of-the-art of class-unconditional image generation but alsorival the current leading methods in class-conditional image generation,bridging the long-standing performance gap between these two tasks. Code isavailable at https://github.com/LTH14/rcg.
<<<
翻译
31.
前进
(2023-12-27 15:11):
#paper arXiv:2312.11514v1 ,2023, LLM in a flash:
Efficient Large Language Model Inference with Limited Memory 大型语言模型(LLMs)在现代自然语言处理中具有重要作用,但其高昂的计算和内存需求对于内存有限的设备构成了挑战。为了高效运行超过可用DRAM容量的LLMs,该论文采用了存储模型参数在闪存上,并按需将其调入DRAM的方法。研究方法包括构建与闪存行为协调的推理模型,并在两个关键领域进行优化:减少闪存传输的数据量和以更大、更连续的块来读取数据。在这个框架下,引入了两种主要技术:“windowing”策略通过重复使用先前激活的神经元减少数据传输,“row-column bunding”则充分利用了闪存的顺序数据访问特性,增加了从闪存中读取的数据块的大小。这些方法使得可以在有限DRAM上运行比原先两倍大的模型,相较于朴素的加载方法,在CPU和GPU上推断速度分别提高了4-5倍和20-25倍。
arXiv,
2023.
DOI: 10.48550/arXiv.2312.11514
Abstract:
Large language models (LLMs) are central to modern natural languageprocessing, delivering exceptional performance in various tasks. However, theirintensive computational and memory requirements present challenges, especiallyfor devices with limited DRAM capacity. …
>>>
Large language models (LLMs) are central to modern natural languageprocessing, delivering exceptional performance in various tasks. However, theirintensive computational and memory requirements present challenges, especiallyfor devices with limited DRAM capacity. This paper tackles the challenge ofefficiently running LLMs that exceed the available DRAM capacity by storing themodel parameters on flash memory but bringing them on demand to DRAM. Ourmethod involves constructing an inference cost model that harmonizes with theflash memory behavior, guiding us to optimize in two critical areas: reducingthe volume of data transferred from flash and reading data in larger, morecontiguous chunks. Within this flash memory-informed framework, we introducetwo principal techniques. First, "windowing'" strategically reduces datatransfer by reusing previously activated neurons, and second, "row-columnbundling", tailored to the sequential data access strengths of flash memory,increases the size of data chunks read from flash memory. These methodscollectively enable running models up to twice the size of the available DRAM,with a 4-5x and 20-25x increase in inference speed compared to naive loadingapproaches in CPU and GPU, respectively. Our integration of sparsity awareness,context-adaptive loading, and a hardware-oriented design paves the way foreffective inference of LLMs on devices with limited memory.
<<<
翻译
32.
符毓 Yu
(2023-11-30 23:11):
#paper doi.org/10.48550/arXiv.2311.05332, 2023, On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving. 文远知行的团队近期的论文,把GPT应用在自动驾驶领域。测试结果显示GPT在图像识别,点云识别,天气识别,V2X图像,模拟图像识别,多角度图片识别都有较高准确率;在交通灯识别,左右空间区分上容易出错
arXiv,
2023.
DOI: 10.48550/arXiv.2311.05332
Abstract:
The pursuit of autonomous driving technology hinges on the sophisticatedintegration of perception, decision-making, and control systems. Traditionalapproaches, both data-driven and rule-based, have been hindered by theirinability to grasp the nuance …
>>>
The pursuit of autonomous driving technology hinges on the sophisticatedintegration of perception, decision-making, and control systems. Traditionalapproaches, both data-driven and rule-based, have been hindered by theirinability to grasp the nuance of complex driving environments and theintentions of other road users. This has been a significant bottleneck,particularly in the development of common sense reasoning and nuanced sceneunderstanding necessary for safe and reliable autonomous driving. The advent ofVisual Language Models (VLM) represents a novel frontier in realizing fullyautonomous vehicle driving. This report provides an exhaustive evaluation ofthe latest state-of-the-art VLM, GPT-4V(ision), and its application inautonomous driving scenarios. We explore the model's abilities to understandand reason about driving scenes, make decisions, and ultimately act in thecapacity of a driver. Our comprehensive tests span from basic scene recognitionto complex causal reasoning and real-time decision-making under varyingconditions. Our findings reveal that GPT-4V demonstrates superior performancein scene understanding and causal reasoning compared to existing autonomoussystems. It showcases the potential to handle out-of-distribution scenarios,recognize intentions, and make informed decisions in real driving contexts.However, challenges remain, particularly in direction discernment, trafficlight recognition, vision grounding, and spatial reasoning tasks. Theselimitations underscore the need for further research and development. Projectis now available on GitHub for interested parties to access and utilize:\url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}
<<<
翻译
33.
Vincent
(2023-11-30 16:34):
#paper Contrastive Variational Autoencoder Enhances Salient Features, arxiv, 2019 https://arxiv.org/abs/1902.04601 最近的对比PCA采用了对比学习的思路,能够捕捉目标数据集与背景之间的差异,从而实现保留对比信号的无监督降维。然而对比PCA跟PCA类似,只能对变量做线性组合进行降维,无法捕捉变量间的非线性关系。这篇文章对对比PCA做了拓展,使用变分自编码模型(VAE)来实现对非线性关系的捕捉,该方法称为对比VAE。对比VAE通过对数据集间的共享特征以及富集在目标数据中的特征进行显式建模,从而分离和增强目标数据中的突出潜在特征。该方法的运算时间与VAE类似,并且对噪音和数据纯度有较高的鲁棒性。文章在多个数据集上(例如手写数字MNIST)验证了该方法在捕捉突出潜在特征方面的有效性,比起传统的VAE也有持续提高。同时其作为一种生成式学习工具,训练好以后也能够用这些显著潜在特征来生成新的数据。
arXiv,
2019.
DOI: 10.48550/arXiv.1902.04601
Abstract:
Variational autoencoders are powerful algorithms for identifying dominantlatent structure in a single dataset. In many applications, however, we areinterested in modeling latent structure and variation that are enriched in atarget …
>>>
Variational autoencoders are powerful algorithms for identifying dominantlatent structure in a single dataset. In many applications, however, we areinterested in modeling latent structure and variation that are enriched in atarget dataset compared to some background---e.g. enriched in patients comparedto the general population. Contrastive learning is a principled framework tocapture such enriched variation between the target and background, butstate-of-the-art contrastive methods are limited to linear models. In thispaper, we introduce the contrastive variational autoencoder (cVAE), whichcombines the benefits of contrastive learning with the power of deep generativemodels. The cVAE is designed to identify and enhance salient latent features.The cVAE is trained on two related but unpaired datasets, one of which hasminimal contribution from the salient latent features. The cVAE explicitlymodels latent features that are shared between the datasets, as well as thosethat are enriched in one dataset relative to the other, which allows thealgorithm to isolate and enhance the salient latent features. The algorithm isstraightforward to implement, has a similar run-time to the standard VAE, andis robust to noise and dataset purity. We conduct experiments across diversetypes of data, including gene expression and facial images, showing that thecVAE effectively uncovers latent structure that is salient in a particularanalysis.
<<<
翻译
34.
Ricardo
(2023-10-31 22:15):
#paper https://doi.org/10.48550/arXiv.2308.01316 Patched Denoising Diffusion Models For High-Resolution Image Synthesis 最近在研究如何使用生成模型将脑分割图像映射回T1w/T2w图像,不过大多数医学图像生成算法都是基于patch的,然后将patch在体素空间拼回,但是这样的方法会出现边界不连续的现象。这篇文章提出用patch训练扩散模型,并在特征空间中消除边界效应。因此最近在尝试如何将这个方法应用于我的工作里。最近在做的工作是在全年龄段上构建脑模板图像,有机会可以和大家讲一讲这方面的工作。
arXiv,
2023.
DOI: 10.48550/arXiv.2308.01316
Abstract:
We propose an effective denoising diffusion model for generatinghigh-resolution images (e.g., 1024$\times$512), trained on small-size imagepatches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a newfeature collage strategy is …
>>>
We propose an effective denoising diffusion model for generatinghigh-resolution images (e.g., 1024$\times$512), trained on small-size imagepatches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a newfeature collage strategy is designed to avoid the boundary artifact whensynthesizing large-size images. Feature collage systematically crops andcombines partial features of the neighboring patches to predict the features ofa shifted image patch, allowing the seamless generation of the entire image dueto the overlap in the patch feature space. Patch-DM produces high-quality imagesynthesis results on our newly collected dataset of nature images(1024$\times$512), as well as on standard benchmarks of smaller sizes(256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare ourmethod with previous patch-based generation methods and achievestate-of-the-art FID scores on all four datasets. Further, Patch-DM alsoreduces memory complexity compared to the classic diffusion models.
<<<
翻译
35.
Vincent
(2023-08-31 23:50):
#paper https://doi.org/10.48550/arXiv.2306.03301. arxiv 2023, Estimating Conditional Mutual Information for Dynamic Feature Selection. 动态特征选择涉及到学习特征选择策略,以及使用任意特征对目标值进行预测。其中学习选择策略往往十分具有挑战性。这篇文章介绍了一种基于特征与预测目标的条件互信息(conditional mutual information)对特征进行优先级排序,该方法通过训练一个神经网络估算在给定特征集情况下,其他特征的预测能力(条件互信息),每一步选择最具信息的特征加入到已有特征集中。依次迭代下去直到满足停止条件(例如达到给定特征数量,不确定度,代价等)。此外,该框架同样能够利用先验信息。文章验证了该方法在表格与图像数据集测试中均有不错效果。
arXiv,
2023.
DOI: 10.48550/arXiv.2306.03301
Abstract:
Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into the …
>>>
Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into the prediction process. The problem is challenging, however, as it requires both making predictions with arbitrary feature sets and learning a policy to identify the most valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is learning this selection policy, and we design a straightforward new modeling approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our learning approach, we introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform costs between features, incorporating prior information, and exploring modern architectures to handle partial input information. We find that our method provides consistent gains over recent state-of-the-art methods across a variety of datasets.
<<<
翻译
36.
符毓 Yu
(2023-08-31 22:39):
#paper doi.org/10.48550/arXiv.2303.09165 2023, A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation。 为了解决机器视觉中大量人工标注的成本问题,团队尝试通过用合成数据的方式解决。基于一定规则生成合成数据后,本文展示了通过合成数据进行预训练的方式优于真实数据,同时也能优于几种数据增加后的结果的可能性。未来应用具有较大的想象力
arXiv,
2023.
DOI: 10.48550/arXiv.2303.09165
Abstract:
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. However, exhaustive data annotation is impracticable for each task of all domains of …
>>>
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. However, exhaustive data annotation is impracticable for each task of all domains of interest, due to high labor costs and unguaranteed labeling accuracy. Besides, the uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. All these nuisances may hinder the verification of typical theories and exposure to new findings. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization. We in this work push forward along this line by doing profound and extensive research on bare supervised learning and downstream domain adaptation. Specifically, under the well-controlled, IID data setting enabled by 3D rendering, we systematically verify the typical, important learning insights, e.g., shortcut learning, and discover the new laws of various data regimes and network architectures in generalization. We further investigate the effect of image formation factors on generalization, e.g., object scale, material texture, illumination, camera viewpoint, and background in a 3D scene. Moreover, we use the simulation-to-reality adaptation as a downstream task for comparing the transferability between synthetic and real data when used for pre-training, which demonstrates that synthetic data pre-training is also promising to improve real test results. Lastly, to promote future research, we develop a new large-scale synthetic-to-real benchmark for image classification, termed S2RDA, which provides more significant challenges for transfer from simulation to reality. The code and datasets are available at this https URL.
<<<
翻译
37.
尹志
(2023-08-31 22:11):
#paper https://doi.org/10.48550/arXiv.1812.07907 PnP-AdaNet: Plug-and-Play Adversarial Domain Adaptation Network at Unpaired Cross-Modality Cardiac Segmentation。调研高效生成模型的过程中偶遇的论文,发现还是有点意思的。文章提出了一个网络结构:PnP-AdaNet,实现了无监督的不同模态间分割任务领域适应。考虑到是2018年的老文章,其替换网络结构和利用对抗学习的想法现在已经比较常见,但我认为替换网络的思想在大模型盛行的今天有着更深刻的内涵,本人手头的一个研究主题也是沿着这条线索,目前看部分实验结果还是很不错的。
arXiv,
2018.
DOI: 10.48550/arXiv.1812.07907
Abstract:
Deep convolutional networks have demonstrated the state-of-the-art performance on various medical image computing tasks. Leveraging images from different modalities for the same analysis task holds clinical benefits. However, the generalization …
>>>
Deep convolutional networks have demonstrated the state-of-the-art performance on various medical image computing tasks. Leveraging images from different modalities for the same analysis task holds clinical benefits. However, the generalization capability of deep models on test data with different distributions remain as a major challenge. In this paper, we propose the PnPAdaNet (plug-and-play adversarial domain adaptation network) for adapting segmentation networks between different modalities of medical images, e.g., MRI and CT. We propose to tackle the significant domain shift by aligning the feature spaces of source and target domains in an unsupervised manner. Specifically, a domain adaptation module flexibly replaces the early encoder layers of the source network, and the higher layers are shared between domains. With adversarial learning, we build two discriminators whose inputs are respectively multi-level features and predicted segmentation masks. We have validated our domain adaptation method on cardiac structure segmentation in unpaired MRI and CT. The experimental results with comprehensive ablation studies demonstrate the excellent efficacy of our proposed PnP-AdaNet. Moreover, we introduce a novel benchmark on the cardiac dataset for the task of unsupervised cross-modality domain adaptation. We will make our code and database publicly available, aiming to promote future studies on this challenging yet important research topic in medical imaging.
<<<
翻译
38.
尹志
(2023-07-31 22:52):
#paper doi: https://doi.org/10.48550/arXiv.2210.13695
Structure-based Drug Design with Equivariant Diffusion Models 又读了一遍这篇文献,用等变扩散模型进行结构化药物设计确实是一种有效的药物设计方式,越来越多的工作也在不断证明它的价值。这篇工作挺经典的(虽然貌似被iclr拒了),它基于蛋白质口袋利用se3等变扩散模型进行了分子生成。大量实验证明它生成药物分子的新颖性和多样性在效率和有效性上都很不错。文章还讨论了使用该方法对现有分子的优化,基于补全进行分子设计等问题,虽然在效果上还存在很多缺陷,但这些思路对于小分子药物设计及现有方法的改进都非常有价值。
arXiv,
2023.
DOI: 10.48550/arXiv.2210.13695
Abstract:
Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. In this paper, we formulate SBDD as a 3D-conditional generation …
>>>
Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Comprehensive in silico experiments demonstrate the efficiency and effectiveness of DiffSBDD in generating novel and diverse drug-like ligands with competitive docking scores. We further explore the flexibility of the diffusion framework for a broader range of tasks in drug design campaigns, such as off-the-shelf property optimization and partial molecular design with inpainting.
<<<
翻译
39.
Ricardo
(2023-07-31 22:16):
#paper doi: https://doi.org/10.48550/arXiv.2112.05149 DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model 形变图像配准是医学成像的基本任务之一。经典的配准算法通常需要较高的计算成本进行迭代优化。尽管基于深度学习的图像配准方法已被用于快速图像配准,但要获得从运动图像到固定图像的真实连续形变且拓扑折叠较少,仍然是一个挑战性的问题。为解决这个问题,本文提出一种新的基于扩散模型的图像配准方法DiffuseMorph。DiffuseMorph不仅可以通过反向扩散生成合成的变形图像,而且可以通过变形场进行图像配准。具体来说,形变场由运动图像和固定图像之间的形变的条件得分函数生成,通过简单缩放得分的潜在特征即可从连续形变中进行配准。在2D人脸和3D医学图像配准任务上的实验结果表明,该方法可以提供灵活的形变和拓扑保持能力。
arXiv,
2022.
DOI: 10.48550/arXiv.2112.05149
Abstract:
Deformable image registration is one of the fundamental tasks in medical imaging. Classical registration algorithms usually require a high computational cost for iterative optimizations. Although deep-learning-based methods have been developed …
>>>
Deformable image registration is one of the fundamental tasks in medical imaging. Classical registration algorithms usually require a high computational cost for iterative optimizations. Although deep-learning-based methods have been developed for fast image registration, it is still challenging to obtain realistic continuous deformations from a moving image to a fixed image with less topological folding problem. To address this, here we present a novel diffusion-model-based image registration method, called DiffuseMorph. DiffuseMorph not only generates synthetic deformed images through reverse diffusion but also allows image registration by deformation fields. Specifically, the deformation fields are generated by the conditional score function of the deformation between the moving and fixed images, so that the registration can be performed from continuous deformation by simply scaling the latent feature of the score. Experimental results on 2D facial and 3D medical image registration tasks demonstrate that our method provides flexible deformations with topology preservation capability.
<<<
翻译
40.
符毓 Yu
(2023-07-31 16:41):
#paper doi: 10.48550/arXiv.2307.05973 2023, Composable 3D Value Maps for Robotic Manipulation with Language Models.
李飞飞团队最新论文研究,把语言模型与机器人操作结合。与大语言模型结合后人机交互效率得到提高,并且能做到基于视觉的实时轨迹规划。目测机械臂移动速率为常见机械臂工作速率的八分之一,到真实应用的话稳定性还需要进一步提高(超过25%的出错率)
arXiv,
2023.
DOI: 10.48550/arXiv.2307.05973
Abstract:
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, …
>>>
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a visual-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Project website: this https URL
<<<
翻译