林海onrush (2025-06-07 13:27):
#paper, Token-Importance Guided Direct Preference Optimization,DOI: https://arxiv.org/abs/2505.19653, share一下个人最新的大模型微调算法工作,我们针对大语言模型(LLMs)如何更好地对齐人类偏好提出了一种新方法——TI-DPO。以往常用的DPO(直接偏好优化)方法虽然省去了显式奖励模型,直接用人类偏好数据优化模型,但忽略了不同token(词/字)在生成内容中的重要性差异,这可能导致模型在关键token上犯错,从而产生不符合人类价值观的输出。 TI-DPO通过两大创新点解决了这一问题: 1. 在token level层面引入基于梯度归因的Token重要性权重,能动态识别和优先优化对人类偏好最关键的token; 2. 加入基于对比学习的Triplet(三元组)损失,不仅区分“好-坏”样本,还引入“中间”输出,使优化更细致,有助于模型生成更接近人类期望、远离不理想响应的内容。 实验表明,TI-DPO在多个任务上(如TruthfulQA、IFEval等)表现优异,准确率和生成多样性均超过DPO及其他对齐方法。消融实验进一步验证了token-importance机制和triplet loss的必要性和有效性。理论分析还证明了TI-DPO在优化上拥有更严格的损失下界,训练过程更加稳定。TI-DPO通过精细化地关注关键token,并结合三元组对齐结构,有效提升了大模型的对齐能力与输出质量,为人机交互中的AI安全和有用性提供了新的解决方案。
arXiv, 2025-05-26T08:11:24Z. DOI: 10.48550/arXiv.2505.19653
Token-Importance Guided Direct Preference Optimization
翻译
Abstract:
Ensuring that large language models (LLMs) generate outputs aligned withhuman preferences is important for safe and effective AI interactions. WhileDirect Preference Optimization (DPO) employs an implicit reward function tooptimize the policy model, however, it and its related variants overlook thedifferential importance of individual tokens and are sensitive to judgmentnoise in preference datasets during generation. Although recent methods attemptto assess the important weight of tokens via probability prediction orsimplistic weighting schemes, these evaluation methods are prone to biases andstill cannot fully address these issues. To solve this problem, we propose theToken-Importance Guided Direct Preference Optimization (TI-DPO), whichintroduces two key innovations: the gradient-based token-importance weightsthat dynamically prioritize critical tokens, and a triple loss that explicitlyguides model outputs to approach human-preferred responses and stay away fromnon-preferred responses. Experimental results show that TI-DPO achieves higheraccuracy and stronger generative diversity, providing more stable andcomputationally efficient solutions compared with DPO and other RLHF methods.
翻译
回到顶部