当前位置：首页 > news >正文

LLMs之Pretrain：《Reinforcement Pre-Training》翻译与解读

news 2025/7/5 8:48:21

导读：强化预训练（RPT）是一种新的LLM预训练范式，它将next-token预测重新定义为可验证的推理任务，并应用基于正确性的奖励进行强化学习。RPT利用大量未标注文本数据，促进更深层次的理解和泛化，显著提高了next-token预测准确性，并为后续的强化微调提供了更强大的预训练基础。实验结果表明，RPT在数学和通用推理基准测试中表现出色，并具有良好的扩展性，为开发更强大和更通用的LLM提供了一种有前景的新方向。

>> 背景痛点

● LLM能力提升的挑战：虽然大型语言模型（LLMs）在各种任务中表现出卓越的能力，但主要依赖于在大量文本语料库上进行next-token预测目标的可扩展性。

● RL应用的局限性：强化学习（RL）虽然可以用于微调LLM，使其与人类偏好对齐或增强特定技能，但当前RL在LLM训练中的应用面临可扩展性和通用性挑战。

● 人工反馈RL的不足：从人类反馈中进行强化学习虽然有效，但依赖于昂贵的人类偏好数据，并且其学习到的奖励模型容易受到奖励黑客攻击，限制了可扩展性。

● 可验证奖励RL的限制：使用可验证奖励的强化学习（RLVR）虽然可以减轻奖励黑客攻击，但通常受到带有可验证答案的标注数据稀缺性的限制，使其应用受限于特定领域的微调，而不是通用预训练。

>> 解决方案

● 强化预训练（RPT）：提出一种新的范式，弥合了可扩展的自监督预训练和强化学习的能力之间的差距。

● Next-token推理任务：将基本的next-token预测任务重新定义为next-token推理过程。对于预训练语料库中的任何给定上下文，模型被激励去推理后续token，然后进行预测。

● 可验证的内在奖励：模型根据其预测与语料库本身中的ground-truth next token的正确性来接收可验证的内在奖励。

● 大规模通用RL数据集：这种方法将通常用于next-token预测的大量未标注文本数据转换为大规模的通用RL数据集，而无需外部标注或特定领域的奖励函数。

>> 核心思路步骤

● Next-token推理：模型在预测next-token之前，生成一个链式思考（chain-of-thought）推理序列。

● On-policy强化学习：使用on-policy强化学习训练LLM执行next-token推理。

● 奖励机制：引入前缀匹配奖励，以验证预测的正确性。如果预测的字节序列是ground-truth完成序列的精确前缀，并且其长度匹配任何有效的token边界，则奖励为1，否则为0。

● 数据过滤：使用Deepseek-R1-Distill-Qwen-1.5B作为小型代理模型，计算每个token在top-16 next token上的代理模型熵，通过应用熵阈值来过滤掉低熵位置，优先训练需要更多计算努力才能预测的具有挑战性的token。

>> 优势

● 可扩展性和通用性：利用与标准next-token预测相同的大量未标注文本数据，将其转换为大规模的通用RL数据集，而无需外部标注。

● 最小化奖励黑客风险：使用直接的、基于规则的奖励信号（即预测的next token的正确性）固有地最小化了与复杂的、学习到的奖励模型相关的奖励黑客风险。

● 促进更深层次的理解和泛化：通过显式地鼓励next-token推理模式，RPT促进了更深层次的理解和泛化，而不是仅仅记忆next token。

● 提高next-token预测准确性：内部推理过程有效地允许模型为每个预测步骤分配更多的“思考”或计算努力，类似于在训练时为每个token应用的一种推理时扩展形式，这直接有助于提高next-token预测准确性。

● 更好的预训练基础：为后续的强化微调提供更强大的预训练基础，从而带来更好的最终任务性能。

>> 结论和观点

● RPT显著提高了预测next token的准确性。

● RPT为后续的强化微调提供了更强大的预训练基础，从而带来更好的最终任务性能。

● 增加训练计算量可以持续提高next-token预测准确性，表明其作为一种可持续的扩展策略的潜力。

● RPT模型在经过RLVR进一步训练后，实现了更高的上限。

● 持续使用next-token预测目标在相同数据上进行训练时，模型的推理能力会显著下降。

● RPT-14B在所有基准测试中始终优于R1-Distill-Qwen-14B。

● RPT-14B的next-token推理过程与R1-Distill-Qwen-14B的问题解决过程明显不同，前者更多地使用假设模式和演绎模式。

● 清晰的提示可以显著提高初始性能的正确性。

《Reinforcement Pre-Training》翻译与解读

Abstract

1、Introduction

Conclusion

《Reinforcement Pre-Training》翻译与解读

地址	地址：[2506.08007] Reinforcement Pre-Training
时间	2025年6月9日
作者	Microsoft 、北京大学、清华大学

Abstract

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

在这项工作中，我们引入了强化预训练（RPT）作为大型语言模型和强化学习（RL）的一种新的扩展范式。具体而言，我们将下一个标记预测重新定义为一个使用 RL 训练的推理任务，在给定上下文的情况下，它能因正确预测下一个标记而获得可验证的奖励。RPT 提供了一种可扩展的方法，利用大量文本数据进行通用的 RL，而不是依赖特定领域的标注答案。通过激励下一个标记推理的能力，RPT 显著提高了预测下一个标记的语言模型的准确性。此外，RPT 为进一步的强化微调提供了强大的预训练基础。扩展曲线表明，增加训练计算量始终能提高下一个标记预测的准确性。结果表明，RPT 是一种有效且有前景的扩展范式，可推进语言模型的预训练。

Figure 1:Reinforcement pre-training (RPT) reframes next-token prediction as a reasoning task, where the language model is incentivized via reinforcement learning (RL) to reason about and correctly predict the next token. The proposed approach allows RL to be scaled to the web-text corpus. The image of the cherry-on-top cake is taken from LeCun’s slides LeC (16).图 1：强化预训练（RPT）将下一个标记预测重新定义为一个推理任务，在此任务中，语言模型通过强化学习（RL）受到激励，从而对下一个标记进行推理并正确预测。所提出的方法使 RL 能够扩展到网络文本语料库。顶部有樱桃的蛋糕图片取自 LeCun 的幻灯片 LeC（16）。

1、Introduction

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, largely driven by the scalability of the next-token prediction objective on vast text corpora. This self-supervised paradigm has proven to be an effective general-purpose pre-training approach. Concurrently, reinforcement learning (RL) has emerged as a powerful technique for fine-tuning LLMs, aligning them with human preferences or enhancing specific skills such as complex reasoning OWJ+ (22); JKL+ (24); GYZ+ (25).

However, current applications of RL in LLM training face scalability and generality challenges. Reinforcement learning from human feedback OWJ+ (22), while effective for alignment, relies on costly human preference data, and its learned reward models can be susceptible to reward hacking, limiting scalability. Alternatively, reinforcement learning with verifiable rewards (RLVR) LMP+ (25) utilizes objective, rule-based rewards, often from question-answer pairs. While this mitigates reward hacking, RLVR is typically constrained by the scarcity of annotated data with verifiable answers, restricting its application to domain-specific fine-tuning rather than general-purpose pre-training.

大型语言模型（LLMs）在众多任务中展现出了卓越的能力，这在很大程度上得益于在海量文本语料库上进行下一个标记预测目标的可扩展性。这种自监督范式已被证明是一种有效的通用预训练方法。与此同时，强化学习（RL）已成为微调 LLM 以使其与人类偏好对齐或增强诸如复杂推理等特定技能的强大技术 OWJ+（22）；JKL+（24）；GYZ+（25）。

然而，当前在 LLM 训练中应用强化学习面临着可扩展性和通用性方面的挑战。从人类反馈中进行强化学习 OWJ+（22），虽然在对齐方面效果显著，但依赖于昂贵的人类偏好数据，而且其学习到的奖励模型容易受到奖励操纵的影响，限制了其可扩展性。另一种选择是使用可验证奖励的强化学习（RLVR）LMP+（25），它采用客观的基于规则的奖励，通常来自问答对。虽然这减轻了奖励破解的问题，但强化学习验证式预训练（RLVR）通常受限于带有可验证答案的标注数据稀缺，这使其应用局限于特定领域的微调，而非通用的预训练。

In this work, we introduce reinforcement pre-training (RPT), a novel paradigm that bridges the gap between scalable self-supervised pre-training and the power of reinforcement learning. RPT reframes the fundamental next-token prediction task as a next-token reasoning process. For any given context in a pre-training corpus, the model is incentivized to reason about the subsequent token before predicting it. It receives a verifiable, intrinsic reward based on the correctness of its prediction against the ground-truth next token from the corpus itself. This approach transforms the vast, unannotated text data typically used for next-token prediction into a massive dataset for general-purpose RL, without requiring external annotations or domain-specific reward functions.

This approach offers several crucial advantages. First, RPT is inherently scalable and general-purpose: it leverages the same vast, unannotated text data used for standard next-token prediction, transforming it into a massive dataset for general-purpose RL without requiring external annotations. Second, the use of direct, rule-based reward signals (i.e., the correctness of the predicted next token) inherently minimizes the risk of reward hacking often associated with complex, learned reward models. Third, by explicitly encouraging next-token reasoning patterns, RPT promotes deeper understanding and generalization instead of merely memorizing next tokens. The model learns to explore and validate hypotheses about why a certain token should follow, fostering more robust representations. Finally, the internal reasoning process during pre-training effectively allows the model to allocate more “thought” or computational effort to each prediction step, akin to a form of inference-time scaling applied at training time for each token, which directly contributes to improved next-token prediction accuracy.

在本研究中，我们引入了强化预训练（RPT），这是一种新颖的范式，它弥合了可扩展的自监督预训练与强化学习能力之间的差距。RPT 将基本的下一个标记预测任务重新定义为下一个标记推理过程。对于预训练语料库中的任何给定上下文，模型都被激励在预测下一个标记之前对其进行推理。它会根据其预测与语料库中真实下一个标记的正确性获得一个可验证的内在奖励。这种方法将通常用于下一个标记预测的大量未标注文本数据转化为一个用于通用强化学习的海量数据集，无需外部标注或特定领域的奖励函数。

这种方法具有几个关键优势。首先，RPT 本质上具有可扩展性和通用性：它利用了用于标准下一个标记预测的相同海量未标注文本数据，将其转化为一个用于通用强化学习的庞大数据集，而无需外部标注。其次，直接使用基于规则的奖励信号（即预测的下一个标记的正确性）这一做法，从根本上降低了与复杂学习奖励模型相关的奖励操纵风险。第三，通过明确鼓励下一个标记的推理模式，RPT 促进了更深入的理解和泛化，而不仅仅是记忆下一个标记。模型学会了探索和验证某个标记为何应跟在后面的假设，从而培养出更稳健的表示。最后，在预训练期间的内部推理过程有效地让模型为每个预测步骤分配更多的“思考”或计算资源，类似于在训练期间为每个标记应用的一种推理时间缩放形式，这直接有助于提高下一个标记预测的准确性。

Our experiments demonstrate that RPT significantly improves the accuracy of predicting next tokens. RPT also provides a more robust pre-trained foundation for subsequent reinforcement fine-tuning, leading to better final task performance. The scaling curves reveal that increased training compute under the RPT framework consistently improves next-token prediction accuracy, indicating its potential as a sustainable scaling strategy. These results position reinforcement pre-training as an effective and promising new paradigm to advance the pre-training of large language models.

Our contributions are summarized as follows:

• We introduce reinforcement pre-training (RPT), a new scaling paradigm that reframes next-token prediction as a reasoning task trained with reinforcement learning, utilizing intrinsic verifiable rewards derived directly from the pre-training corpus.

• RPT offers a scalable and general-purpose approach to RL pre-training, minimizing reward hacking through rule-based rewards and promoting generalization by encouraging next-token reasoning patterns over rote memorization.

• RPT significantly improves next-token prediction accuracy and exhibits favorable scaling properties, where performance consistently improves with increased training compute.

• RPT yields a stronger pre-trained foundation for subsequent reinforcement fine-tuning and enhances zero-shot performance on various downstream tasks.

我们的实验表明，RPT 显著提高了预测下一个标记的准确性。RPT 还为后续的强化微调提供了更强大的预训练基础，从而带来更好的最终任务表现。扩展曲线表明，在 RPT 框架下增加训练计算量始终能提高下一个标记预测的准确性，这表明其作为可持续扩展策略的潜力。这些结果将强化预训练定位为一种有效且有前景的新范式，以推进大型语言模型的预训练。

我们的贡献总结如下：

• 我们引入了强化预训练（RPT），这是一种新的扩展范式，它将下一个标记预测重新定义为一个通过强化学习训练的推理任务，利用直接从预训练语料库中得出的内在可验证奖励。

• RPT 提供了一种可扩展且通用的强化学习预训练方法，通过基于规则的奖励机制来最大程度减少奖励作弊，并通过鼓励基于下一个标记的推理模式而非机械记忆来促进泛化能力。

• RPT 显著提高了下一个标记预测的准确性，并展现出良好的扩展特性，即随着训练计算量的增加，性能持续提升。

• RPT 为后续的强化学习微调提供了更强大的预训练基础，并在各种下游任务的零样本性能方面表现出色。

Conclusion

We introduce reinforcement pre-training (RPT), a novel paradigm for pre-training large language models. By framing next-token prediction as a verifiable reasoning task and applying reinforcement learning with correctness-based rewards, RPT allows LLMs to leverage extended computation during pre-training to build stronger foundational reasoning capabilities. Our experiments demonstrate that RPT improves next-token prediction, enhances performance on mathematical and general reasoning benchmarks in zero-shot settings, and provides a better starting point for further RL fine-tuning. RPT offers a promising new direction for developing more capable and generally intelligent LLMs by fundamentally rethinking the pre-training objective itself.

我们引入了强化预训练（RPT），这是一种用于预训练大型语言模型的新范式。通过将下一个标记预测视为可验证的推理任务，并应用基于正确性的奖励的强化学习，RPT 允许语言模型在预训练期间利用更长的计算时间来构建更强的基础推理能力。我们的实验表明，RPT 改进了下一个标记预测，在零样本设置下提高了数学和一般推理基准的表现，并为后续的强化学习微调提供了更好的起点。RPT 通过从根本上重新思考预训练目标本身，为开发更强大、更具通用智能的语言模型提供了有前景的新方向。

While promising, this initial exploration of RPT has certain limitations. Our experiments are primarily conducted using a 14B parameter model. Although the RPT methodology is designed to be general, the current pre-training corpus predominantly consists of mathematical documents; future work will explore its efficacy on broader, general-domain text. Furthermore, RPT training is initialized from a reasoning model; investigating RPT training from a standard base language model would provide further insights into its foundational impact.

The work can be advanced from the following perspectives. We would like to scale up the training corpus, including data size, and domain coverage. Large-scale general Internet data can be utilized during reinforcement pre-training. We will also scale up training compute to push the frontier. Moreover, we can establish scaling laws for reinforcement pre-training to guide the scaling of large language models. Additionally, we are interested in integrating hybrid thinking JWH+ (25) with RPT to enable fine-grained adaptive thinking by adaptively triggering next-token reasoning.

尽管前景看好，但 RPT 的这一初步探索仍存在一定的局限性。我们的实验主要使用了一个 140 亿参数的模型。尽管 RPT 方法旨在具有通用性，但当前的预训练语料库主要由数学文档组成；未来的工作将探索其在更广泛的一般领域文本上的有效性。此外，RPT 训练是从一个推理模型初始化的；从标准基础语言模型的角度研究 RPT 训练，将为探究其基础性影响提供进一步的见解。可以从以下方面推进这项工作。我们希望扩大训练语料库的规模，包括数据量和领域覆盖范围。在强化预训练期间可以利用大规模的通用互联网数据。我们还将扩大训练计算规模以推进前沿发展。此外，我们可以为强化预训练建立缩放规律，以指导大型语言模型的缩放。另外，我们对将混合思维 JWH+（25）与 RPT 相结合很感兴趣，以通过自适应触发下一个标记推理实现细粒度的自适应思维。

查看全文

http://www.lqws.cn/news/483085.html