NVIDIA Researchers Recommend Reinforcement Learning Pretraining (RLP): Reinforcement As A Pretraining Aim For Establishing Reasoning All through Pretraining

Next Business 24

7 hours ago

NVIDIA Researchers Recommend Reinforcement Learning Pretraining (RLP): Reinforcement As A Pretraining Aim For Establishing Reasoning All through Pretraining

Why this points technically: not like prior “reinforcement pretraining” variants that rely on sparse, binary correctness alerts or proxy filters, RLP’s dense, verifier-free reward attaches position-wise credit score rating wherever pondering improves prediction, enabling updates at every token place typically web-scale corpora with out exterior verifiers or curated reply keys.

Understanding the Outcomes

Qwen3-1.7B-Base: Pretraining with RLP improved the overall math+science widespread by ~19% vs the underside model and ~17% vs compute-matched regular pretraining (CPT). After equal post-training (SFT + RLVR) all through all variants, the RLP-initialized model retained a ~7–8% relative profit, with crucial constructive facets on reasoning-heavy benchmarks (AIME25, MMLU-Skilled).

Nemotron-Nano-12B v2: Making use of RLP to a 12B hybrid Mamba-Transformer checkpoint yielded an whole widespread improve from 42.81% to 61.32% and an absolute +23% obtain on scientific reasoning, regardless that the RLP run used ~200B fewer tokens (teaching for 19.8T vs 20T tokens; RLP utilized for 250M tokens). This highlights information effectivity and architecture-agnostic conduct.

https://github.com/NVlabs/RLP/blob/predominant/pdf/RLP_Reinforcement_as_a_Pretraining_Objective.pdf

RPT comparability: Under matched information and compute with Omni-MATH-style settings, RLP outperformed RPT on math, science, and whole averages—attributed to RLP’s regular information-gain reward versus RPT’s sparse binary signal and entropy-filtered tokens.

Positioning vs. Publish-Teaching RL and Information Curation

Reinforcement Learning Pretraining (RLP) is orthogonal to post-training pipelines (SFT, RLVR) and reveals compounding enhancements after regular alignment. On account of the reward is computed from model log-evidence fairly than exterior verifiers, it scales to domain-agnostic corpora (web crawl, tutorial textual content material, textbooks) and SFT-style reasoning corpora, avoiding the brittleness of slender curated datasets. In compute-matched comparisons (along with CPT with 35× additional tokens to match FLOPs), RLP nonetheless led on whole averages, suggesting the enhancements derive from aim design, not funds.

Key Takeaways

RLP makes reasoning a pretraining aim: sample a chain-of-thought sooner than next-token prediction and reward it by information obtain over a no-think EMA baseline.
Verifier-free, dense, position-wise signal: works on irregular textual content material streams with out exterior graders, enabling scalable pretraining updates on every token.
Qwen3-1.7B outcomes: +19% vs Base and +17% vs compute-matched CPT all through pretraining; with equal SFT+RLVR, RLP retains ~7–8% constructive facets (largest on AIME25, MMLU-Skilled).
Nemotron-Nano-12B v2: whole widespread rises 42.81% → 61.32% (+18.51 pp; ~35–43% rel.) and +23 elements on scientific reasoning, using ~200B fewer NTP tokens.
Teaching particulars that matter: substitute gradients solely on thought tokens with a clipped surrogate and group-relative advantages; additional rollouts (≈16) and longer thought lengths (≈2048) help; token-level KL anchoring provides no revenue.

Conclusion

RLP reframes pretraining to right away reward “think-before-predict” conduct using a verifier-free, information-gain signal, yielding sturdy reasoning constructive facets that persist by the use of equal SFT+RLVR and delay all through architectures (Qwen3-1.7B, Nemotron-Nano-12B v2). The tactic’s aim—contrasting CoT-conditioned likelihood in the direction of a no-think EMA baseline—integrates cleanly into large-scale pipelines with out curated verifiers, making it a smart enhance to next-token pretraining fairly than a post-training add-on.

Attempt the Paper, Code and Enterprise Net web page. Be pleased to check out our GitHub Net web page for Tutorials, Codes and Notebooks. Moreover, be pleased to adjust to us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now it’s possible you’ll be a part of us on telegram as successfully.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is devoted to harnessing the potential of Artificial Intelligence for social good. His newest endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth safety of machine finding out and deep finding out data that’s every technically sound and easily understandable by a big viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

🙌 Adjust to MARKTECHPOST: Add us as a hottest provide on Google.

Elevate your perspective with NextTech Data, the place innovation meets notion.
Uncover the latest breakthroughs, get distinctive updates, and be a part of with a worldwide neighborhood of future-focused thinkers.
Unlock tomorrow’s tendencies as we converse: be taught additional, subscribe to our e-newsletter, and develop to be part of the NextTech group at NextTech-news.com

Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our publication, and be a part of our rising neighborhood at nextbusiness24.com