来自杂志 arXiv 的文献。
当前共找到 132 篇文献分享,本页显示第 1 - 20 篇。
1.
符毓
(2025-05-31 22:59):
#paper doi: 10.48550/arXiv.2505.21906, 2025, Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge.
视觉-语言-动作 (VLA) 模型已成为机器人领域的下一代模型。然而,尽管现有的端到端 VLA 系统利用了强大的预训练视觉-语言模型 (VLM),但在微调过程中,由于模型需要适应特定的机器人任务,它们往往会丢失关键功能。我们认为,一个可泛化的 VLA 模型应该保留并扩展 VLM 的核心能力:1)开放世界具身推理——VLA 应该继承 VLM 的知识,即识别 VLM 能够识别的任何事物,能够解决数学问题,并具备视觉空间智能;2)推理跟随——有效地将开放世界推理转化为机器人可执行的步骤。
本文推出ChatVLA-2,通过端到端利用预训练视觉语言模型所获得的先天推理和理解能力,赋予视觉-语言-动作 (VLA) 模型执行各种任务的能力。核心贡献是在预训练的视觉语言主干之上集成了一个dynamic Mixture-of-Experts (MoE)模块。该模块可以有效地管理不同的任务需求,其中一些专家共识共享普遍的多模态特征,而其他专家则专注于特定任务的表征。此外,提出了一种两阶段训练策略:首先,引导 VLA 模型建立预训练多模态知识与机器人动作之间的联系;随后,引入推理跟踪阶段,使模型能够理解推理输出并有效地将其转化为相应的动作。
arXiv,
2025-05-28T02:48:42Z.
DOI: 10.48550/arXiv.2505.21906
Abstract:
Vision-language-action (VLA) models have emerged as the next generation ofmodels in robotics. However, despite leveraging powerful pre-trainedVision-Language Models (VLMs), existing end-to-end VLA systems often lose keycapabilities during fine-tuning as the …
>>>
Vision-language-action (VLA) models have emerged as the next generation ofmodels in robotics. However, despite leveraging powerful pre-trainedVision-Language Models (VLMs), existing end-to-end VLA systems often lose keycapabilities during fine-tuning as the model adapts to specific robotic tasks.We argue that a generalizable VLA model should retain and expand upon the VLM'score competencies: 1) Open-world embodied reasoning - the VLA should inheritthe knowledge from VLM, i.e., recognize anything that the VLM can recognize,capable of solving math problems, possessing visual-spatial intelligence, 2)Reasoning following - effectively translating the open-world reasoning intoactionable steps for the robot. In this work, we introduce ChatVLA-2, a novelmixture-of-expert VLA model coupled with a specialized three-stage trainingpipeline designed to preserve the VLM's original strengths while enablingactionable reasoning. To validate our approach, we design a math-matching taskwherein a robot interprets math problems written on a whiteboard and pickscorresponding number cards from a table to solve equations. Remarkably, ourmethod exhibits exceptional mathematical reasoning and OCR capabilities,despite these abilities not being explicitly trained within the VLA.Furthermore, we demonstrate that the VLA possesses strong spatial reasoningskills, enabling it to interpret novel directional instructions involvingpreviously unseen objects. Overall, our method showcases reasoning andcomprehension abilities that significantly surpass state-of-the-art imitationlearning methods such as OpenVLA, DexVLA, and pi-zero. This work represents asubstantial advancement toward developing truly generalizable roboticfoundation models endowed with robust reasoning capacities.
<<<
翻译
2.
刘馨云
(2025-05-31 21:32):
#paper https://arxiv.org/pdf/2505.20290
人类通过观察他人来学习新任务。受到这一点启发,我们提出了 EgoZero 框架,它可以从人类佩戴智能眼镜拍摄的第三人称视频中学习闭环机器人策略。智能眼镜能够捕捉人类交互的丰富多模态第一人称视角:RGB 视频记录周围场景,IMU(惯性测量单元)提供头部运动信息,麦克风则记录对话和环境声音。我们的方法仅通过观察这些第一人称视频来学习如何行动,无需任何机器人演示。当给定一个人类完成任务的视频时,EgoZero 能预测一系列中间目标和语言子目标,并据此在真实机器人上以闭环方式执行任务。EgoZero 将人类观察压缩为与机器人形态无关的状态表示,这些表示可用于决策和闭环控制。所学策略在不同的机器人形态、环境和任务之间表现出良好的泛化能力。我们在真实的 Franka Panda 机械臂上进行了验证,结果表明 EgoZero 能以 70% 的零样本成功率完成多种具有挑战性的操控任务,每项任务仅需 20 分钟的数据采集时间。
arXiv,
2025-05-26T17:59:17Z.
DOI: 10.48550/arXiv.2505.20290
Abstract:
Despite recent progress in general purpose robotics, robot policies still lagfar behind basic human capabilities in the real world. Humans interactconstantly with the physical world, yet this rich data resource …
>>>
Despite recent progress in general purpose robotics, robot policies still lagfar behind basic human capabilities in the real world. Humans interactconstantly with the physical world, yet this rich data resource remains largelyuntapped in robot learning. We propose EgoZero, a minimal system that learnsrobust manipulation policies from human demonstrations captured with ProjectAria smart glasses, $\textbf{and zero robot data}$. EgoZero enables: (1)extraction of complete, robot-executable actions from in-the-wild, egocentric,human demonstrations, (2) compression of human visual observations intomorphology-agnostic state representations, and (3) closed-loop policy learningthat generalizes morphologically, spatially, and semantically. We deployEgoZero policies on a gripper Franka Panda robot and demonstrate zero-shottransfer with 70% success rate over 7 manipulation tasks and only 20 minutes ofdata collection per task. Our results suggest that in-the-wild human data canserve as a scalable foundation for real-world robot learning - paving the waytoward a future of abundant, diverse, and naturalistic training data forrobots. Code and videos are available at https://egozero-robot.github.io.
<<<
翻译
3.
尹志
(2025-05-31 21:23):
#paper https://doi.org/10.48550/arXiv.2012.07436 Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting。这是AAAI2021上的一篇关于长序列时序建模的经典工作。文章对传统Transformer进行了改进,提出了一类新的模型Informer,通过对self attention的改进和蒸馏,以及generative style decoder的构建,在时间复杂度、空间复杂度上都改善了传统Transformer存在的问题。该工作在多个数据集上取得了良好的性能。上述的几个思路在后续的时序建模中被频繁使用,非常具有启发性。
arXiv,
2020-12-14T11:43:09Z.
DOI: 10.48550/arXiv.2012.07436
Abstract:
Many real-world applications require the prediction of long sequencetime-series, such as electricity consumption planning. Long sequencetime-series forecasting (LSTF) demands a high prediction capacity of the model,which is the ability to …
>>>
Many real-world applications require the prediction of long sequencetime-series, such as electricity consumption planning. Long sequencetime-series forecasting (LSTF) demands a high prediction capacity of the model,which is the ability to capture precise long-range dependency coupling betweenoutput and input efficiently. Recent studies have shown the potential ofTransformer to increase the prediction capacity. However, there are severalsevere issues with Transformer that prevent it from being directly applicableto LSTF, including quadratic time complexity, high memory usage, and inherentlimitation of the encoder-decoder architecture. To address these issues, wedesign an efficient transformer-based model for LSTF, named Informer, withthree distinctive characteristics: (i) a $ProbSparse$ self-attention mechanism,which achieves $O(L \log L)$ in time complexity and memory usage, and hascomparable performance on sequences' dependency alignment. (ii) theself-attention distilling highlights dominating attention by halving cascadinglayer input, and efficiently handles extreme long input sequences. (iii) thegenerative style decoder, while conceptually simple, predicts the longtime-series sequences at one forward operation rather than a step-by-step way,which drastically improves the inference speed of long-sequence predictions.Extensive experiments on four large-scale datasets demonstrate that Informersignificantly outperforms existing methods and provides a new solution to theLSTF problem.
<<<
翻译
4.
符毓
(2025-04-30 22:15):
#paper doi: arxiv.org/abs/2504.19193, 2025, Trajectory Planning with Model Predictive Control for Obstacle Avoidance Considering Prediction Uncertainty. 本文介绍了一种用于自主机器人的新型轨迹规划器,在机器人操作系统(ROS2) 和导航框架(Nav2)中融入动态避障功能来增强导航性能。该方法利用模型预测控制 (MPC),重点处理与动态障碍物运动预测相关的不确定性。与主要处理静态障碍物或对动态障碍物当前位置做出反应的现有Nav2轨迹规划器不同,该规划器预测未来障碍物的位置,从而确保机器人避开可能存在障碍物的区间
arXiv,
2025-04-27T11:00:19Z.
DOI: 10.48550/arXiv.2504.19193
Abstract:
This paper introduces a novel trajectory planner for autonomous robots,specifically designed to enhance navigation by incorporating dynamic obstacleavoidance within the Robot Operating System 2 (ROS2) and Navigation 2 (Nav2)framework. The …
>>>
This paper introduces a novel trajectory planner for autonomous robots,specifically designed to enhance navigation by incorporating dynamic obstacleavoidance within the Robot Operating System 2 (ROS2) and Navigation 2 (Nav2)framework. The proposed method utilizes Model Predictive Control (MPC) with afocus on handling the uncertainties associated with the movement prediction ofdynamic obstacles. Unlike existing Nav2 trajectory planners which primarilydeal with static obstacles or react to the current position of dynamicobstacles, this planner predicts future obstacle positions using a stochasticVector Auto-Regressive Model (VAR). The obstacles' future positions arerepresented by probability distributions, and collision avoidance is achievedthrough constraints based on the Mahalanobis distance, ensuring the robotavoids regions where obstacles are likely to be. This approach considers therobot's kinodynamic constraints, enabling it to track a reference path whileadapting to real-time changes in the environment. The paper details theimplementation, including obstacle prediction, tracking, and the constructionof feasible sets for MPC. Simulation results in a Gazebo environmentdemonstrate the effectiveness of this method in scenarios where robots mustnavigate around each other, showing improved collision avoidance capabilities.
<<<
翻译
5.
尹志
(2025-04-30 15:56):
#paper doi:10.48550/arXiv.2407.20516, Machine Unlearning in Generative AI: A Survey. 很有意思的方向,应该是翻译机器遗忘吧。随着模型越做越大,如何通过对模型的处理达到可控的添加与擦除特定信息,是未来一个重要的主题,不管是从隐私保护还是模型控制的层面上
arXiv,
2024-07-30T03:26:09Z.
DOI: 10.48550/arXiv.2407.20516
Abstract:
Generative AI technologies have been deployed in many places, such as(multimodal) large language models and vision generative models. Theirremarkable performance should be attributed to massive training data andemergent reasoning abilities. …
>>>
Generative AI technologies have been deployed in many places, such as(multimodal) large language models and vision generative models. Theirremarkable performance should be attributed to massive training data andemergent reasoning abilities. However, the models would memorize and generatesensitive, biased, or dangerous information originated from the training dataespecially those from web crawl. New machine unlearning (MU) techniques arebeing developed to reduce or eliminate undesirable knowledge and its effectsfrom the models, because those that were designed for traditionalclassification tasks could not be applied for Generative AI. We offer acomprehensive survey on many things about MU in Generative AI, such as a newproblem formulation, evaluation methods, and a structured discussion on theadvantages and limitations of different kinds of MU techniques. It alsopresents several critical challenges and promising directions in MU research. Acurated list of readings can be found:https://github.com/franciscoliu/GenAI-MU-Reading.
<<<
翻译
6.
Vincent
(2025-03-31 16:09):
#paper doi: https://doi.org/10.48550/arXiv.2503.00096 BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology 大语言模型在加速科学发现方面展现出了重要潜力。目前大语言模型智能体在生物信息领域的应用缺乏系统评估,这篇文章整理了近50个真实场景,约300个开放性问题来衡量基于大语言模型的智能体在解决复杂生信问题的能力,作者测试了两个前沿大语言模型(gpt 4o和claude 3.5 sonnet),发现这些模型在回答开放性问题的准确率都较低,回答多选问题的能力也并不比随机选择策略好。这篇文章的贡献在于提供了测试用例与评估框架,为更搭建性能更好的智能体打下了基础
arXiv,
2025-02-28T18:47:57Z.
DOI: 10.48550/arXiv.2503.00096
Abstract:
Large Language Models (LLMs) and LLM-based agents show great promise inaccelerating scientific research. Existing benchmarks for measuring thispotential and guiding future development continue to evolve from pure recalland rote knowledge …
>>>
Large Language Models (LLMs) and LLM-based agents show great promise inaccelerating scientific research. Existing benchmarks for measuring thispotential and guiding future development continue to evolve from pure recalland rote knowledge tasks, towards more practical work such as literature reviewand experimental planning. Bioinformatics is a domain where fully autonomousAI-driven discovery may be near, but no extensive benchmarks for measuringprogress have been introduced to date. We therefore present the BioinformaticsBenchmark (BixBench), a dataset comprising over 50 real-world scenarios ofpractical biological data analysis with nearly 300 associated open-answerquestions designed to measure the ability of LLM-based agents to explorebiological datasets, perform long, multi-step analytical trajectories, andinterpret the nuanced results of those analyses. We evaluate the performance oftwo frontier LLMs (GPT-4o and Claude 3.5 Sonnet) using a custom agent frameworkwe open source. We find that even the latest frontier models only achieve 17%accuracy in the open-answer regime, and no better than random in amultiple-choice setting. By exposing the current limitations of frontiermodels, we hope BixBench can spur the development of agents capable ofconducting rigorous bioinformatic analysis and accelerate scientific discovery.
<<<
翻译
7.
尹志
(2025-03-31 15:06):
#paper:doi:doi.org/10.48550/arXiv.2502.11974, Image Inversion: A Survey from GANs to Diffusion and Beyond(2025).
综述了image inversion常见的算法模型,很新,主要介绍了GAN和diffusion模型,也提了DiT和Rectified Flow框架。image inversion的核心问题涉及latent space, 对其它生成式AI的问题都非常重要。
arXiv,
2025-02-17T16:20:48Z.
DOI: 10.48550/arXiv.2502.11974
Abstract:
Image inversion is a fundamental task in generative models, aiming to mapimages back to their latent representations to enable downstream applicationssuch as editing, restoration, and style transfer. This paper provides …
>>>
Image inversion is a fundamental task in generative models, aiming to mapimages back to their latent representations to enable downstream applicationssuch as editing, restoration, and style transfer. This paper provides acomprehensive review of the latest advancements in image inversion techniques,focusing on two main paradigms: Generative Adversarial Network (GAN) inversionand diffusion model inversion. We categorize these techniques based on theiroptimization methods. For GAN inversion, we systematically classify existingmethods into encoder-based approaches, latent optimization approaches, andhybrid approaches, analyzing their theoretical foundations, technicalinnovations, and practical trade-offs. For diffusion model inversion, weexplore training-free strategies, fine-tuning methods, and the design ofadditional trainable modules, highlighting their unique advantages andlimitations. Additionally, we discuss several popular downstream applicationsand emerging applications beyond image tasks, identifying current challengesand future research directions. By synthesizing the latest developments, thispaper aims to provide researchers and practitioners with a valuable referenceresource, promoting further advancements in the field of image inversion. Wekeep track of the latest works at https://github.com/RyanChenYN/ImageInversion
<<<
翻译
8.
Kunji
(2025-02-28 23:59):
#paper, https://arxiv.org/pdf/2410.05273, HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers, VLA依赖于数十亿参数的VLM,虽然具有强大的泛化能力,但计算成本高、推理速度慢,限制了其在动态任务中的应用。为了解决这些局限性,文章提出了HiRT框架(Hierarchical Robot Transformer framework),借鉴了人类认知的双过程理论,采用双系统架构和异步操作机制,实现频率与性能之间的平衡。在模拟和真实环境中的实验结果表明,HiRT取得了显著的改进。在静态任务中,控制频率提高了一倍,并实现了相当的成功率。此外,在之前VLA模型难以应对的真实世界动态操作任务中,HiRT将成功率从48%提高到了75%。
arXiv,
2024-09-12T09:18:09Z.
DOI: 10.48550/arXiv.2410.05273
Abstract:
Large Vision-Language-Action (VLA) models, leveraging powerful pre trainedVision-Language Models (VLMs) backends, have shown promise in robotic controldue to their impressive generalization ability. However, the success comes at acost. Their reliance …
>>>
Large Vision-Language-Action (VLA) models, leveraging powerful pre trainedVision-Language Models (VLMs) backends, have shown promise in robotic controldue to their impressive generalization ability. However, the success comes at acost. Their reliance on VLM backends with billions of parameters leads to highcomputational costs and inference latency, limiting the testing scenarios tomainly quasi-static tasks and hindering performance in dynamic tasks requiringrapid interactions. To address these limitations, this paper proposes HiRT, aHierarchical Robot Transformer framework that enables flexible frequency andperformance trade-off. HiRT keeps VLMs running at low frequencies to capturetemporarily invariant features while enabling real-time interaction through ahigh-frequency vision-based policy guided by the slowly updated features.Experiment results in both simulation and real-world settings demonstratesignificant improvements over baseline methods. Empirically, in static tasks,we double the control frequency and achieve comparable success rates.Additionally, on novel real-world dynamic ma nipulation tasks which arechallenging for previous VLA models, HiRT improves the success rate from 48% to75%.
<<<
翻译
9.
符毓
(2025-02-28 23:00):
#paper doi.org/10.48550/arXiv.2411.13677, 2024, Bimanual Dexterity for Complex Tasks. 遥操作是机器人获取数据的重要方式。文章介绍了一种便携、低成本(总成本约12k美元,其中5k的手,7k的系统;可额外配合双机械臂16k)且极其精确的双手人形机器人手臂系统遥操作方法,展示了该系统在桌面和移动环境中的适用性,并展示了它在执行双手灵巧任务时相较于其他方法(如 SteamVR 和 Vision Pro等)的高效性。但由于缺乏触觉反馈,操作员只能依赖视觉反馈进行遥操作,无法感知机器人手臂的感觉
arXiv,
2024-11-20T19:53:35Z.
DOI: 10.48550/arXiv.2411.13677
Abstract:
To train generalist robot policies, machine learning methods often require asubstantial amount of expert human teleoperation data. An ideal robot forhumans collecting data is one that closely mimics them: bimanual …
>>>
To train generalist robot policies, machine learning methods often require asubstantial amount of expert human teleoperation data. An ideal robot forhumans collecting data is one that closely mimics them: bimanual arms anddexterous hands. However, creating such a bimanual teleoperation system withover 50 DoF is a significant challenge. To address this, we introduce Bidex, anextremely dexterous, low-cost, low-latency and portable bimanual dexterousteleoperation system which relies on motion capture gloves and teacher arms. Wecompare Bidex to a Vision Pro teleoperation system and a SteamVR system andfind Bidex to produce better quality data for more complex tasks at a fasterrate. Additionally, we show Bidex operating a mobile bimanual robot for in thewild tasks. The robot hands (5k USD) and teleoperation system (7k USD) isreadily reproducible and can be used on many robot arms including two xArms(16k USD). Website at https://bidex-teleop.github.io/
<<<
翻译
10.
尹志
(2025-02-28 15:55):
#paper doi:10.48550/arXiv.2205.15463 Few-Shot Diffusion Models. 文章提出了一种扩散模型及set-based ViT的方式实现few shot生成的技术。实验表明,该模型仅需5个样本就可以完成新类别的生成。
arXiv,
2022-05-30T23:20:33Z.
DOI: 10.48550/arXiv.2205.15463
Abstract:
Denoising diffusion probabilistic models (DDPM) are powerful hierarchicallatent variable models with remarkable sample generation quality and trainingstability. These properties can be attributed to parameter sharing in thegenerative hierarchy, as well …
>>>
Denoising diffusion probabilistic models (DDPM) are powerful hierarchicallatent variable models with remarkable sample generation quality and trainingstability. These properties can be attributed to parameter sharing in thegenerative hierarchy, as well as a parameter-free diffusion-based inferenceprocedure. In this paper, we present Few-Shot Diffusion Models (FSDM), aframework for few-shot generation leveraging conditional DDPMs. FSDMs aretrained to adapt the generative process conditioned on a small set of imagesfrom a given class by aggregating image patch information using a set-basedVision Transformer (ViT). At test time, the model is able to generate samplesfrom previously unseen classes conditioned on as few as 5 samples from thatclass. We empirically show that FSDM can perform few-shot generation andtransfer to new datasets. We benchmark variants of our method on complex visiondatasets for few-shot learning and compare to unconditional and conditionalDDPM baselines. Additionally, we show how conditioning the model on patch-basedinput set information improves training convergence.
<<<
翻译
11.
刘昊辰
(2025-02-25 22:38):
#paper Playing Hex and Counter Wargames using Reinforcement Learning and Recurrent Neural Networks. 这是一篇关于如何使用强化学习(Reinforcement Learning)和循环神经网络(Recurrent Neural Networks, RNN)来玩六角格战棋游戏(Hex and Counter Wargames)的研究论文。论文提出一种结合AlphaZero强化学习算法和循环神经网络的新系统,以应对六角格战棋游戏的战略复杂性。该系统能够在不同地形和战术情况下进行泛化,并探索其在更大地图上的扩展能力。提出的系统在有限的训练资源和计算能力下,能够在复杂的六角格战棋游戏中取得良好的表现,展示了其在复杂场景中的泛化能力。下载地址:https://arxiv.org/abs/2502.13918
arXiv,
2025-02-19T17:52:45Z.
DOI: 10.48550/arXiv.2502.13918
Abstract:
Hex and Counter Wargames are adversarial two-player simulations of realmilitary conflicts requiring complex strategic decision-making. Unlikeclassical board games, these games feature intricate terrain/unit interactions,unit stacking, large maps of varying sizes, …
>>>
Hex and Counter Wargames are adversarial two-player simulations of realmilitary conflicts requiring complex strategic decision-making. Unlikeclassical board games, these games feature intricate terrain/unit interactions,unit stacking, large maps of varying sizes, and simultaneous move and combatdecisions involving hundreds of units. This paper introduces a novel systemdesigned to address the strategic complexity of Hex and Counter Wargames byintegrating cutting-edge advancements in Recurrent Neural Networks withAlphaZero, a reliable modern Reinforcement Learning algorithm. The systemutilizes a new Neural Network architecture developed from existing research,incorporating innovative state and action representations tailored to thesespecific game environments. With minimal training, our solution has shownpromising results in typical scenarios, demonstrating the ability to generalizeacross different terrain and tactical situations. Additionally, we explore thesystem's potential to scale to larger map sizes. The developed system is openlyaccessible, facilitating continued research and exploration within thischallenging domain.
<<<
翻译
12.
惊鸿
(2025-02-15 00:02):
#paper DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Pub Date : 2024-05-07
DOI : arxiv-2405.04434
我们提出了 DeepSeek-V2,一种强大的专家混合 (MoE) 语言模型,其特点是经济的训练和高效的推理。它总共包括236B个参数,其中每个令牌激活21B个参数,并支持128K令牌的上下文长度。 DeepSeek-V2采用多头潜在注意力(MLA)和DeepSeekMoE等创新架构。 MLA 通过将键值 (KV) 缓存显着压缩为潜在向量来保证高效推理,而 DeepSeekMoE 则可以通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比,DeepSeek-V2 性能显着增强,同时节省了 42.5% 的训练成本,减少了 93.3% 的 KV 缓存,最大生成吞吐量提升至 5.76 倍。我们在由 8.1T 代币组成的高质量多源语料库上对 DeepSeek-V2 进行预训练,并进一步进行监督微调(SFT)和强化学习(RL)以充分释放其潜力。评估结果表明,即使只有21B个激活参数,DeepSeek-V2及其聊天版本仍然达到了开源模型中顶级的性能。模型检查点位于“https://github.com/deepseek-ai/DeepSeek-V2”。
arXiv,
2024-05-07T15:56:43Z.
DOI: 10.48550/arXiv.2405.04434
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language modelcharacterized by economical training and efficient inference. It comprises 236Btotal parameters, of which 21B are activated for each token, and supports acontext …
>>>
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language modelcharacterized by economical training and efficient inference. It comprises 236Btotal parameters, of which 21B are activated for each token, and supports acontext length of 128K tokens. DeepSeek-V2 adopts innovative architecturesincluding Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guaranteesefficient inference through significantly compressing the Key-Value (KV) cacheinto a latent vector, while DeepSeekMoE enables training strong models at aneconomical cost through sparse computation. Compared with DeepSeek 67B,DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximumgeneration throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-qualityand multi-source corpus consisting of 8.1T tokens, and further performSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlockits potential. Evaluation results show that, even with only 21B activatedparameters, DeepSeek-V2 and its chat versions still achieve top-tierperformance among open-source models.
<<<
翻译
13.
林海onrush
(2025-01-31 23:53):
#paper, https://doi.org/10.48550/arXiv.2312.01156, Efficient Light Source Placement using Quantum Computing, 这是一个有趣的小问题, 如何利用量子计算解决《我的世界》游戏中的火把放置问题,将形式转化为二次无约束二进制优化(QUBO)问题,通过迭代学习拉格朗日乘子来处理约束条件。实验说明该方法能在合理迭代次数内找到有效的火把放置方案,虽然当前量子硬件存在局限性,经典方法在较大地图上表现更优一些。火把放置问题与集合覆盖问题相联系,展示量子计算在资源优化问题中的价值。
arXiv,
2023-12-02T15:28:59Z.
DOI: 10.48550/arXiv.2312.01156
Abstract:
NP-hard problems regularly come up in video games, with interestingconnections to real-world problems. In the game Minecraft, players placetorches on the ground to light up dark areas. Placing them in …
>>>
NP-hard problems regularly come up in video games, with interestingconnections to real-world problems. In the game Minecraft, players placetorches on the ground to light up dark areas. Placing them in a way thatminimizes the total number of torches to save resources is far from trivial. Inthis paper, we use Quantum Computing to approach this problem. To this end, wederive a QUBO formulation of the torch placement problem, which we uncover tobe very similar to another NP-hard problem. We employ a solution strategy thatinvolves learning Lagrangian weights in an iterative process, adding to theever growing toolbox of QUBO formulations. Finally, we perform experiments onreal quantum hardware using real game data to demonstrate that our approachyields good torch placements.
<<<
翻译
14.
前进
(2025-01-31 22:31):
#paper 10.48550/arxiv.2408.10234 The Unbearable Slowness of Being: Why do we live at 10 bits/s? arXiv:2408.10234v2 [q-bio.NC] Jieyu Zheng, Markus Meiste
论文探讨了人类行为信息处理速度的悖论性缓慢。尽管人类的感官系统能够以每秒约10⁹比特(bits/s)的速度收集信息,但人类的整体信息处理速度却仅为每秒10比特。这种巨大的差异尚未得到充分解释,涉及大脑功能的许多基本方面。通过多种实验和案例,论文展示了人类行为的信息处理速度约为10 bits/s,且这种速度限制可能与大脑的串行处理特性有关。尽管外周神经系统(如视锥细胞和视神经)能够以极高的速率处理信息,但大脑的中枢部分似乎以串行方式处理信息,一次只能专注于一个任务。这种串行处理方式可能是大脑在进化过程中形成的,因为早期神经系统的主要功能是控制运动,而运动决策通常是局部的、单一的。此外,论文还提出大脑可能存在“外脑”和“内脑”两种模式:外脑负责处理高维度的感官输入和运动输出,信息处理速率极高;内脑则负责处理低维度的信息流,用于决策和行为控制,信息处理速率极低(约10 bits/s)。这种内外脑的分工可能是导致信息处理速度受限的重要原因。论文建议未来的研究需要进一步探索大脑内外信息处理的差异,以及如何优化信息处理效率。
arXiv,
2024-08-03T22:56:45Z.
DOI: 10.48550/arXiv.2408.10234
Abstract:
This article is about the neural conundrum behind the slowness of humanbehavior. The information throughput of a human being is about 10 bits/s. Incomparison, our sensory systems gather data at …
>>>
This article is about the neural conundrum behind the slowness of humanbehavior. The information throughput of a human being is about 10 bits/s. Incomparison, our sensory systems gather data at ~10^9 bits/s. The stark contrastbetween these numbers remains unexplained and touches on fundamental aspects ofbrain function: What neural substrate sets this speed limit on the pace of ourexistence? Why does the brain need billions of neurons to process 10 bits/s?Why can we only think about one thing at a time? The brain seems to operate intwo distinct modes: the "outer" brain handles fast high-dimensional sensory andmotor signals, whereas the "inner" brain processes the reduced few bits neededto control behavior. Plausible explanations exist for the large neuron numbersin the outer brain, but not for the inner brain, and we propose new researchdirections to remedy this.
<<<
翻译
15.
尹志
(2025-01-31 17:05):
#paper https://doi.org/10.48550/arXiv.2403.07183 Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
一篇讨论大语言模型使用情况的文章, 特别举了在AI顶会评审中使用的具体例子。(包括ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023。)研究发现,这些论文review中,有6.5%至16.9%可能被LLM大幅修改,而且这些review有很多有趣的特点,比如confidence比较低,接近ddl才提交,而且不太愿意回应作者反驳等。更多有趣的现象可参考原文。文章中贴了最常见的AI喜欢使用的形容词,比如“commendable”, “meticulous”, and “intricate”等,确实很像AI搞的,哈哈哈。 看来以后审稿人要对作者更加负责才行噢。
arXiv,
2024-03-11T21:51:39Z.
DOI: 10.48550/arXiv.2403.07183
Abstract:
We present an approach for estimating the fraction of text in a large corpuswhich is likely to be substantially modified or produced by a large languagemodel (LLM). Our maximum likelihood …
>>>
We present an approach for estimating the fraction of text in a large corpuswhich is likely to be substantially modified or produced by a large languagemodel (LLM). Our maximum likelihood model leverages expert-written andAI-generated reference texts to accurately and efficiently examine real-worldLLM-use at the corpus level. We apply this approach to a case study ofscientific peer review in AI conferences that took place after the release ofChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggestthat between 6.5% and 16.9% of text submitted as peer reviews to theseconferences could have been substantially modified by LLMs, i.e. beyondspell-checking or minor writing updates. The circumstances in which generatedtext occurs offer insight into user behavior: the estimated fraction ofLLM-generated text is higher in reviews which report lower confidence, weresubmitted close to the deadline, and from reviewers who are less likely torespond to author rebuttals. We also observe corpus-level trends in generatedtext which may be too subtle to detect at the individual level, and discuss theimplications of such trends on peer review. We call for futureinterdisciplinary work to examine how LLM use is changing our information andknowledge practices.
<<<
翻译
16.
Vincent
(2025-01-31 14:05):
#paper https://doi.org/10.48550/arXiv.2111.06377 arxiv. 2021. Masked Autoencoders Are Scalable Vision Learners. Computer vision里很经典的一篇文章,提出了一种简单、快速、有效的模型 Masked autoencoder (MAE)。核心思路是随机遮盖图像区域,然后用模型去复原这些被遮盖的区域。MAE由不对称的编码器和解码器构成,编码器将图像的可见区域编码到隐空间,解码器使用隐空间的数据表征和遮盖符还原原始图片。值得注意的是即使遮盖区域达到75%,还原的图像和原始图像仍然很像,也说明图像里面的信息是十分稀疏的。另外由于编码区域只使用了原始图像的一部分,这使得MAE能大大加速训练的过程,同时得益于自监督学习和更好的表征能力,其在下游任务的预测效果也更好。值得注意的是,这种“预测掩盖区域”的技术在语言模型中早有应用,这篇文章只是将其用在了CV领域,展现了CV也可以用NLP的一些研究思路来推进。
arXiv,
2021-11-11T18:46:40Z.
DOI: 10.48550/arXiv.2111.06377
Abstract:
This paper shows that masked autoencoders (MAE) are scalable self-supervisedlearners for computer vision. Our MAE approach is simple: we mask randompatches of the input image and reconstruct the missing pixels. …
>>>
This paper shows that masked autoencoders (MAE) are scalable self-supervisedlearners for computer vision. Our MAE approach is simple: we mask randompatches of the input image and reconstruct the missing pixels. It is based ontwo core designs. First, we develop an asymmetric encoder-decoder architecture,with an encoder that operates only on the visible subset of patches (withoutmask tokens), along with a lightweight decoder that reconstructs the originalimage from the latent representation and mask tokens. Second, we find thatmasking a high proportion of the input image, e.g., 75%, yields a nontrivialand meaningful self-supervisory task. Coupling these two designs enables us totrain large models efficiently and effectively: we accelerate training (by 3xor more) and improve accuracy. Our scalable approach allows for learninghigh-capacity models that generalize well: e.g., a vanilla ViT-Huge modelachieves the best accuracy (87.8%) among methods that use only ImageNet-1Kdata. Transfer performance in downstream tasks outperforms supervisedpre-training and shows promising scaling behavior.
<<<
翻译
17.
符毓
(2025-01-31 11:25):
#paper doi.org/10.48550/arXiv.2405.18730, 2024, Development of a Novel Impedance-Controlled Quasi-Direct-Drive Robotic Hand. 准直驱执行器除了低成本、易于控制等优势外,本文提出准直驱执行器在灵巧手的应用场景,如从桌子边缘拾取硬币等小物体,或从非结构化环境中快速 / 动态抓取小物体,也有独特的优势。
arXiv,
2024-05-29T03:20:46Z.
DOI: 10.48550/arXiv.2405.18730
Abstract:
Most robotic hands and grippers rely on actuators with large gearboxes andforce sensors for controlling gripping force. However, this might not be idealfor tasks that require the robot to interact …
>>>
Most robotic hands and grippers rely on actuators with large gearboxes andforce sensors for controlling gripping force. However, this might not be idealfor tasks that require the robot to interact with an unstructured and unknownenvironment. In this paper, we introduce a novel quasi-direct-drivetwo-fingered robotic hand with variable impedance control in the joint spaceand Cartesian space. The hand has a total of four degrees of freedom,backdrivable differential gear trains, and four brushless direct current (BLDC)motors. Motor torque is controlled through Field-Oriented Control (FOC) withcurrent sensing. Variable impedance control enables the robotic hand to executedexterous manipulation tasks safely during environment-robot and human-robotinteractions. The quasi-direct-drive actuators eliminate the need for complextactile/force sensors or precise motion planning when handling environmentalcontact. A majority-3D-printed assembly makes this a low-cost research platformbuilt with affordable, readily available off-the-shelf components. Experimentalvalidation demonstrates the robotic hand's capability for stable force-closureand form-closure grasps in the presence of disturbances, reliable in-handmanipulation, and safe dynamic manipulations despite contact with theenvironment.
<<<
翻译
18.
刘昊辰
(2025-01-24 14:04):
#paper Proof Number Based Monte-Carlo Tree Search. 这篇论文提出了 PN-MCTS 算法,将蒙特卡洛树搜索(MCTS)和证明数搜索(PNS)相结合,通过在多个游戏领域实验,验证了该算法在部分游戏上相比传统 MCTS 的优势,为游戏搜索算法改进提供了新方向。下载地址:https://arxiv.org/pdf/2303.09449
arXiv,
2023-03-16T16:27:07Z.
DOI: 10.48550/arXiv.2303.09449
Abstract:
This paper proposes a new game-search algorithm, PN-MCTS, which combinesMonte-Carlo Tree Search (MCTS) and Proof-Number Search (PNS). These twoalgorithms have been successfully applied for decision making in a range ofdomains. …
>>>
This paper proposes a new game-search algorithm, PN-MCTS, which combinesMonte-Carlo Tree Search (MCTS) and Proof-Number Search (PNS). These twoalgorithms have been successfully applied for decision making in a range ofdomains. We define three areas where the additional knowledge provided by theproof and disproof numbers gathered in MCTS trees might be used: final moveselection, solving subtrees, and the UCB1 selection mechanism. We test allpossible combinations on different time settings, playing against vanilla UCTon several games: Lines of Action ($7$$\times$$7$ and $8$$\times$$8$ boardsizes), MiniShogi, Knightthrough, and Awari. Furthermore, we extend this newalgorithm to properly address games with draws, like Awari, by adding anadditional layer of PNS on top of the MCTS tree. The experiments show thatPN-MCTS is able to outperform MCTS in all tested game domains, achieving winrates up to 96.2% for Lines of Action.
<<<
翻译
19.
林海onrush
(2025-01-01 00:27):
#paper, doi: https://doi.org/10.48550/arXiv.2305.19229 ,FedDisco: Federated Learning with Discrepancy-Aware Collaboration, AI顶会ICML上的一篇联邦学习文章,这篇论文提出了一种新的联邦学习(Federated Learning, FL)方法,称为 FedDisco,用于解决数据异质性问题,特别是类别分布的差异性。传统联邦学习通常根据客户端数据集的大小分配模型聚合权重,但这种方法无法充分反映客户端数据的类别分布差异,导致全局模型优化性能不足。FedDisco 引入了一种“差异感知”的聚合权重计算方式,将客户端的数据集大小和本地与全局类别分布的差异程度结合起来,通过调整聚合权重优化全局模型。这一方法在保持隐私保护的前提下,提高了通信和计算效率,并通过理论分析证明了其能有效收紧优化误差上界,从而改善全局模型性能。
实验表明,FedDisco 在多种异质性场景和数据集上显著优于现有的联邦学习方法,且其模块化设计可以轻松整合到现有方法中以进一步提升性能。此外,该方法在仅部分客户端参与的场景和文本分类任务中也表现出良好的适用性。FedDisco 的关键优势在于其创新的聚合权重分配策略,能够在低计算和通信开销下,提升联邦学习算法的鲁棒性和泛化能力。
arXiv,
2023-05-30T17:20:51Z.
DOI: 10.48550/arXiv.2305.19229
Abstract:
This work considers the category distribution heterogeneity in federatedlearning. This issue is due to biased labeling preferences at multiple clientsand is a typical setting of data heterogeneity. To alleviate this …
>>>
This work considers the category distribution heterogeneity in federatedlearning. This issue is due to biased labeling preferences at multiple clientsand is a typical setting of data heterogeneity. To alleviate this issue, mostprevious works consider either regularizing local models or fine-tuning theglobal model, while they ignore the adjustment of aggregation weights andsimply assign weights based on the dataset size. However, based on ourempirical observations and theoretical analysis, we find that the dataset sizeis not optimal and the discrepancy between local and global categorydistributions could be a beneficial and complementary indicator for determiningaggregation weights. We thus propose a novel aggregation method, FederatedLearning with Discrepancy-aware Collaboration (FedDisco), whose aggregationweights not only involve both the dataset size and the discrepancy value, butalso contribute to a tighter theoretical upper bound of the optimization error.FedDisco also promotes privacy-preservation, communication and computationefficiency, as well as modularity. Extensive experiments show that our FedDiscooutperforms several state-of-the-art methods and can be easily incorporatedwith many existing methods to further enhance the performance. Our code will beavailable at https://github.com/MediaBrain-SJTU/FedDisco.
<<<
翻译
20.
前进
(2024-12-31 20:09):
#paper DOI 10.48550/arXiv.2111.06377 He, K., Chen, X., Xie, S., Li, Y., Doll'ar, P., & Girshick, R. B. (2021). Masked Autoencoders Are Scalable Vision Learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 本文提出了一种创新的自监督学习框架器(MAE)。该方法的核心创新在于采用随机遮蔽策略,仅利用图像中未被遮蔽的25%像素来重建整个图像,从而迫使模型学习到更有效的视觉特征。此外,MAE采用非对称的编码器-解码器架构。使用一个编码器,仅处理未被遮蔽的图像部分,以及一个轻量级的解码器,它从编码器的输出和遮蔽部分的位置信息中重建原始图像。大幅降低了计算成本,提高了训练效率。实验结果表明,MAE在自监督预训练方面具有出色的泛化能力,可应用于多种下游任务,且具备良好的可扩展性。
arXiv,
2021-11-11T18:46:40Z.
DOI: 10.48550/arXiv.2111.06377
Abstract:
This paper shows that masked autoencoders (MAE) are scalable self-supervisedlearners for computer vision. Our MAE approach is simple: we mask randompatches of the input image and reconstruct the missing pixels. …
>>>
This paper shows that masked autoencoders (MAE) are scalable self-supervisedlearners for computer vision. Our MAE approach is simple: we mask randompatches of the input image and reconstruct the missing pixels. It is based ontwo core designs. First, we develop an asymmetric encoder-decoder architecture,with an encoder that operates only on the visible subset of patches (withoutmask tokens), along with a lightweight decoder that reconstructs the originalimage from the latent representation and mask tokens. Second, we find thatmasking a high proportion of the input image, e.g., 75%, yields a nontrivialand meaningful self-supervisory task. Coupling these two designs enables us totrain large models efficiently and effectively: we accelerate training (by 3xor more) and improve accuracy. Our scalable approach allows for learninghigh-capacity models that generalize well: e.g., a vanilla ViT-Huge modelachieves the best accuracy (87.8%) among methods that use only ImageNet-1Kdata. Transfer performance in downstream tasks outperforms supervisedpre-training and shows promising scaling behavior.
<<<
翻译