林海onrush
(2025-06-07 13:27):
#paper, Token-Importance Guided Direct Preference Optimization,DOI: https://arxiv.org/abs/2505.19653, share一下个人最新的大模型微调算法工作,我们针对大语言模型(LLMs)如何更好地对齐人类偏好提出了一种新方法——TI-DPO。以往常用的DPO(直接偏好优化)方法虽然省去了显式奖励模型,直接用人类偏好数据优化模型,但忽略了不同token(词/字)在生成内容中的重要性差异,这可能导致模型在关键token上犯错,从而产生不符合人类价值观的输出。
TI-DPO通过两大创新点解决了这一问题:
1. 在token level层面引入基于梯度归因的Token重要性权重,能动态识别和优先优化对人类偏好最关键的token;
2. 加入基于对比学习的Triplet(三元组)损失,不仅区分“好-坏”样本,还引入“中间”输出,使优化更细致,有助于模型生成更接近人类期望、远离不理想响应的内容。
实验表明,TI-DPO在多个任务上(如TruthfulQA、IFEval等)表现优异,准确率和生成多样性均超过DPO及其他对齐方法。消融实验进一步验证了token-importance机制和triplet loss的必要性和有效性。理论分析还证明了TI-DPO在优化上拥有更严格的损失下界,训练过程更加稳定。TI-DPO通过精细化地关注关键token,并结合三元组对齐结构,有效提升了大模型的对齐能力与输出质量,为人机交互中的AI安全和有用性提供了新的解决方案。
arXiv,
2025-05-26T08:11:24Z.
DOI: 10.48550/arXiv.2505.19653
Token-Importance Guided Direct Preference Optimization
Ning Yang,
Hai Lin,
Yibo Liu,
Baoliang Tian,
Guoqing Liu,
Haijun Zhang
Abstract:
Ensuring that large language models (LLMs) generate outputs aligned with<br>human preferences is important for safe and effective AI interactions. While<br>Direct Preference Optimization (DPO) employs an implicit reward function to<br>optimize the policy model, however, it and its related variants overlook the<br>differential importance of individual tokens and are sensitive to judgment<br>noise in preference datasets during generation. Although recent methods attempt<br>to assess the important weight of tokens via probability prediction or<br>simplistic weighting schemes, these evaluation methods are prone to biases and<br>still cannot fully address these issues. To solve this problem, we propose the<br>Token-Importance Guided Direct Preference Optimization (TI-DPO), which<br>introduces two key innovations: the gradient-based token-importance weights<br>that dynamically prioritize critical tokens, and a triple loss that explicitly<br>guides model outputs to approach human-preferred responses and stay away from<br>non-preferred responses. Experimental results show that TI-DPO achieves higher<br>accuracy and stronger generative diversity, providing more stable and<br>computationally efficient solutions compared with DPO and other RLHF methods.
Related Links: