Back
Nitrobrew: Fast, Lossless Distillation for Free
TL;DR
- Distillation, especially on-policy distillation, has become a crucial component in reasoning model post-training workflows.
- Logit distillation at modern vocabulary sizes is bottlenecked by communication and memory, not compute.
- Nitrobrew exploits the fact that teacher logits are generated from a much lower-dimensional hidden state through the unembedding matrix.
- Sends hidden states instead of logits (up to 60× less communication)
- Computes divergence online without materializing the full logit tensor (37× less memory).
- Nitrobrew is fully lossless and exact, unlike top-k style approximations.
- For NemoRL and VeRL, Nitrobrew achieves 1.5–3× faster end-to-end throughput for on-policy distillation.
- We open-source Nitrobrew and a VeRL PR for support.
- We also investigate spectral compression approaches for even stronger savings.

Note: Shortly before this post went live, DeepSeek-V4 independently reported the same core idea for full-vocabulary OPD — caching teacher hidden states and reconstructing logits on the fly. Nitrobrew was developed independently and differs in several respects: open-source framework integrations (NeMo RL, VeRL), explicit online divergence algorithms for forward/reverse KL and JSD, detailed profiling across model scales, and spectral compression experiments (SVD-Nitrobrew) for further reducing communication below .
Introduction
In distillation, a student model is trained to reproduce the behaviour of a more capable teacher model. The teacher learns a compressed representation of the data distribution, which is then transferred to the student through supervision on soft labels, logits, or hidden states [1] [2]. Soft teacher outputs often provide richer supervision than hard labels because they expose the teacher's learned approximation to the noisy data-generating process, including its uncertainty, calibration, and relative preferences among alternatives [1].
Different distillation methods can broadly be classified along two axes: data source (off-policy, or on-policy) and training objective (token-level or sequence-level [3]). We will mostly focus on the first axis, which usually has a larger impact on the resulting implementation complexity.
Off-policy distillation trains the student on a fixed dataset of teacher outputs. Logit distillation typically minimizes a KL divergence between the teacher and student distributions, while sequence-level distillation uses cross-entropy on teacher-generated target sequences. Implementing off-policy distillation usually requires only small modifications to standard training pipelines, but full-logit losses can be prohibitively expensive for modern vocabulary sizes and model scales [4] [5].
Off-policy vs on-policy distillation. On-policy distillation leverages student-generated rollouts.
On-policy distillation (OPD) instead trains the student on its own rollouts [6] [7]. Because the rollout distribution depends on the current student, fresh data must be generated at every step1 rather than precomputed, and student generation and teacher scoring must alternate throughout training.
In distributed on-policy setups, the student and teacher often run on different devices because of memory constraints or parallelism strategy. This adds an additional communication overhead: the teacher's logits must be sent to the same device as the student logits to compute the loss each step. The size of the logit vectors scales with the vocabulary size, which is often quite large (usually 128k to 200k tokens). To reduce this overhead, traditional approaches subsample logits (e.g. take the top-k logits) but this results in its own set of pathologies, which we discuss further in The Pathology of Top-k Truncation.
We introduce Nitrobrew which addresses communication costs without introducing any of the pathologies of top-, and avoids full-logit materialization in both the on- and off-policy settings. In the on-policy setting, Nitrobrew communicates the teacher's hidden state (concurrent with DeepSeek-V4, which independently adopts the same principle), which is much smaller than . For loss computation, Nitrobrew computes teacher and student logits tile-by-tile inside a fused online divergence kernel, avoiding materialization of the full logit tensor. In the off-policy and self-distillation settings [8] [9], Nitrobrew can store cached teacher hidden states instead of full logits, reconstructing teacher logits on the fly during the divergence computation. We find that Nitrobrew improves OPD throughput by 3x for a fixed communication budget while being fully lossless. We open-source our code as well as open a PR to VeRL directly for Nitrobrew support.
The Pathology of Top-k Truncation
Top- is the standard lossy approach for further reducing the cost of (on-policy) distillation.
Overview of existing approaches to OPD. Existing approaches are forced to make approximation errors in order to enjoy favorable compute/communication/memory tradeoffs.
The student and teacher logit tensors are compressed along the vocabulary dimension by discarding all but the largest floats. We highlight two significant problems with this approach:
The distortion is input-dependent and discontinuous. The rank- and rank- tokens may have nearly identical probabilities, but top- keeps one and completely discards the other. This discontinuity means that tokens may go from providing no signal (masked by top-) to giving large signal, due to a small perturbation in . Furthermore, renormalization of the kept tokens will inflate gradients when original teacher distribution had high entropy. The student will thus receive stronger updates when the teacher is uncertain, which is undesirable.
Most of the teacher's calibration lives in the tail of the distribution, which is discarded. The long tail of low-probability tokens encodes the teacher's uncertainty and calibration, including its implicit ranking of alternatives, and soft knowledge about what not to predict [1]. The student will only learn from those tokens about which the teacher is most confident, which are probably easiest to predict.
Top-k truncation + renormalization results in a distorted distribution. It is not an unbiased estimator of the true distribution.
Previous work has identified several shortcomings of top- truncation for distillation. In particular, top-k truncation results in miscalibrated student distributions, catastrophic forgetting and a lack of generalization [10][11]. NVIDIA Minitron directly found that resulted in a large accuracy drop, and larger still never outperformed full logits [5].
We seek an approach that addresses these issues, while still achieving the communication/memory benefits of top- truncation.
Introducing Nitrobrew: Efficient, Lossless Distillation
For a vocabulary of size , batch-size and sequence length , naive on-policy distillation requires materializing, communicating, and computing divergences over tensors of shape . This adds three distinct overheads on top of the student and teacher forward passes: (1) communicating the teacher's logits, (2) HBM allocation for the teacher and student logit tensors and (3) an element-wise divergence computation between the logits. Nitrobrew addresses all three of these issues: (1) by communicating hidden states instead of logits, and (2) & (3) with a fused online divergence kernel. Our approach is illustrated below.
Unlike previous approaches, Nitrobrew is both lossless and memory/compute efficient.
One of the core observations motivating Nitrobrew is that the embedding matrix is low-rank. Recall that logits are computed by taking the final layer's hidden state , and applying the unembedding matrix .
The logit vector lives in the column space of , which has rank at most . In modern transformers, it is usually the case that . For example, Qwen3.5-397B-A17B has and . In this case, the rank of the embedding matrix is at most 4096 so the logits lie in a -dimensional subspace of . Luckily, we already have access to a compressed version of : the teacher's final hidden states.
Nitrobrew exploits this fact by communicating and reconstructing locally using a copy of the teacher's unembedding, . It reconstructs logits directly on the device where the loss is computed, adding no asymptotic computation over the standard full-logit loss, while avoiding communication of the full -dimensional teacher logit vector. For Qwen3.5-397B-A17B, this yields a 60x reduction in communication overhead.
This approach also provides a straightforward way to reduce memory bottleneck introduced by the divergence computation. We fuse the teacher and student unembedding matmuls into the divergence kernel: rather than constructing the full logit tensors and then computing their divergence, we tile over the last dimension and process one -sized chunk at a time. Within each tile, teacher and student logits are computed on the fly, consumed by the running divergence accumulators, and then discarded. This reduces peak memory cost from to because it avoids ever materializing the full logit tensors. The full Nitrobrew algorithm is given below.
Cost Anatomy of On-Policy Distillation
Before presenting the online kernel, it is worth understanding where the time and memory go in a single on-policy distillation step. As a concrete example, consider distilling a Qwen3-32B teacher into a Qwen3-8B student, with batch size , sequence length , vocabulary size and .
The table below breaks down the three dominant costs — communication, memory, and compute — for each method. We assume bf16 for communication and fp32 for the divergence working set.
| Aspect | Naive | Top-k (k=128) | Nitrobrew | SVD-Nitrobrew (k=512) |
|---|---|---|---|---|
| Floats sent per position | V (152k) | 2k (256)¹ | d (4096) | k (512) |
| Bytes per step | 9.4 GB | 15.8 MB | 256 MB | 32 MB |
| Divergence working set | 2×BTV (37.6 GB) | 2×BTk (32 MB)² | 2×BT·VBLK (1 GB) | 2×BT·VBLK (1 GB) |
| Full-vocabulary signal | Yes | No | Yes | Approximate |
| Offline precomputation | None | None | Copy teacher W_U to student | SVD of teacher W_U |
Communication is the most variable and severe cost. The naive approach sends 9.4 GB of logits per step. At 50 GB/s inter-node bandwidth, that is 190 ms of pure transfer time. Nitrobrew sends only 256 MB, which is more than an order of magnitude less communication. Of note, the distinction matters most in the inter-node large-scale regime, where interconnects are the bottleneck. Intranode, NVLink bandwidth is more than high enough to tolerate the naive approach.
Memory is where Nitrobrew's fused kernel has the largest impact. The naive divergence computation requires both the teacher and student logit tensors to be resident simultaneously. At 37.6 GB, this alone can exceed the free HBM on an 80 GB GPU after model weights and activations are loaded. Nitrobrew reduces peak memory usage to only 1 GB.
While theoretical compute is roughly equivalent across methods, we will see that kernel compilation enables tiled, coupled matmuls to get strong throughput gains at long context.
Online Divergence
We describe our online divergence kernel for the forward KL case, but the algorithm is analogous for other divergences like reverse KL or Jensen-Shannon Divergence (JSD), which just requires some additional book-keeping. Recall that for and , forward KL decomposes as
where and are the partition functions of and respectively. Every term here is either a log-sum-exp or an expectation under the softmax, and can be computed in a single streaming pass over the vocabulary, using the online softmax/log-sum-exp trick from FlashAttention [12], analogous to fused large-vocabulary cross-entropy kernels that avoid full-logit materialization [13].
We maintain six running accumulators as we tile over vocabulary chunks of size VBLOCK. The entire computation touches each logit exactly once and stores state per position. When fused with the unembedding matmuls from the Nitrobrew framework, the logits are computed tile-by-tile as and immediately consumed by the accumulator updates. The only tensors that survive are the six scalars per position. The algorithms for online reverse KL and JSD can be found in Appendix B.
Kernel Implementation
We experimented with optimized Triton and TileLang kernels for Nitrobrew. Similar to flash attention, we can fuse large matrix multiplies (unembeds) with their subsequent nonlinearities to avoid excessive roundtrips to HBM.
We ultimately found, however, that for larger models/longer context lengths, an optimized torch implementation was superior. The savings from full fusion are dwarfed by the efficiency of cuBLAS matmuls for the unembedding projections. A hybrid approach, where only the online KL itself is fused, is a low-hanging direction for future work. For reference, a similar approach has found success in the fused_linear_crossentropy kernel from Quack written in the CuTE DSL.
Below is the example forward pass for the performant torch implementation used.
@torch.compile
def _nitrobrew_fwd_chunked(
xs: torch.Tensor,
xt: torch.Tensor,
ws: torch.Tensor,
wt: torch.Tensor,
temperature: float,
chunk_V: int = 4096,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Chunked forward online-softmax KL.
xs: [N, D_s] student hidden states
xt: [N, D_t] teacher hidden states
ws: [V, D_s] student unembed
wt: [V, D_t] teacher unembed
Returns kl, logZs, logZt as flat [N] fp32 tensors.
"""
N = xs.shape[0]
V = ws.shape[0]
inv_temp = 1.0 / temperature
ms = torch.full((N,), float("-inf"), dtype=torch.float32, device=xs.device)
mt = torch.full((N,), float("-inf"), dtype=torch.float32, device=xs.device)
ss = torch.zeros(N, dtype=torch.float32, device=xs.device)
st = torch.zeros(N, dtype=torch.float32, device=xs.device)
ts = torch.zeros(N, dtype=torch.float32, device=xs.device)
us = torch.zeros(N, dtype=torch.float32, device=xs.device)
for v0 in range(0, V, chunk_V):
v1 = min(v0 + chunk_V, V)
zs_tile = torch.mm(xs, ws[v0:v1].T).float().mul_(inv_temp)
zt_tile = torch.mm(xt, wt[v0:v1].T).float().mul_(inv_temp)
# Student online softmax update
tile_ms = zs_tile.max(dim=1).values
new_ms = torch.maximum(ms, tile_ms)
alpha_s = (ms - new_ms).exp_()
ss.mul_(alpha_s)
ts.mul_(alpha_s)
us.mul_(alpha_s)
p_tile = (zs_tile - new_ms.unsqueeze(1)).exp_()
ss.add_(p_tile.sum(dim=1))
ts.add_((p_tile * zs_tile).sum(dim=1))
us.add_((p_tile * zt_tile).sum(dim=1))
ms = new_ms
# Teacher online softmax update
tile_mt = zt_tile.max(dim=1).values
new_mt = torch.maximum(mt, tile_mt)
alpha_t = (mt - new_mt).exp_()
st.mul_(alpha_t)
st.add_((zt_tile - new_mt.unsqueeze(1)).exp_().sum(dim=1))
mt = new_mt
logZs = ms + ss.log()
logZt = mt + st.log()
kl = (ts - us) / ss - logZs + logZt
return kl, logZs, logZt
Isolated Profiling Results
Below are the profiling results for the optimized chunked torch implementation on Hopper. The tested setting had a fixed vocab size , and for the model sizes.

Nitrobrew is significantly faster than the naive torch implementation with much lower peak memory demand. While shorter sequences do not benefit from tiling over the vocabulary dimension, longer sequence lengths are compute-bound thanks to the lack of full logit tensor materialization. At a practical sequence length of 16k, Nitrobrew accelerates distillation loss calculation for a 70B model by 100x with 50% less memory!
In addition to isolated profiling, we also implemented and tested Nitrobrew inside two popular RL frameworks: NemoRL [14] & VeRL [15].
On-policy Distillation Results
We first profiled the end-to-end step time in the NemoRL [14] framework. Details on the profiling setting can be found in Appendix C.
Profiling results under various distillation settings with NemoRL + Nitrobrew.
For d_model floats per token, which is the standard Nitrobrew setup, our approach is 1.5–4.5x faster in end-to-end wall clock time compared to top-k. Notably, even at very few floats per token, our approach outperforms top-k on speed. For a fixed step time, full Nitrobrew achieves equivalent throughput to top-k while sending an order of magnitude more floats — thereby preserving information and remaining fully lossless.
We can also analyze the breakdown of where time is spent within each step.
Breakdown of savings by computation category within step for sample distillation setting. Savings are concentrated in teacher inference and policy training.
As seen above, the fused, chunked KL implementation is doing most of the work at this scale. Distillation becomes communication bound at larger scales, but at smaller scales the primary savings come from eliminating the costly unembed + top-k from teacher inference, and unembed + unfused kl from student policy training. Similar breakdowns for other distillation setups can be found in Appendix D.
We also implemented Nitrobrew in VeRL, and found even stronger speedups. We are releasing a PR to VeRL for Nitrobrew support which can be found in Appendix A.
Profiling results under various distillation settings with VeRL + Nitrobrew.
For our direct OPD experiments, we trained the student Qwen 3 1.7B Base with the teacher Qwen 3 8B on the MATH dataset. We timed one iteration through the entire dataset, on the order of 50 steps. Further training details can be found in Appendix E.

The model achieves 3x the step throughput of a top-k approach, even with 4x the float communication. Nitrobrew is lossless and performs competitively with the top-k baseline, for further notes on convergence refer to Appendix F.
Towards Spectral Logit Compression
Nitrobrew reduces the number of communicated floats from to completely losslessly but in some practical settings, such as multi-node distillation with tight interconnect budgets, even floats per position may be too many. For this setting, we propose SVD-Nitrobrew, which uses an SVD of the teacher’s unembedding to identify low-energy directions to discard.
We report here some preliminary investigations into spectral compression, but this direction warrants further research.
SVD-Nitrobrew
Consider the thin SVD of the teacher's unembedding matrix:
Empirically, the spectral energy of tends to be concentrated in the top few singular values — i.e. its effective rank tends to be much smaller than . SVD-Nitrobrew exploits this fact by projecting the teacher's hidden state onto the top- right singular vectors of before it is communicated.
SVD-Nitrobrew
- Precompute (once, offline): the rank- truncated SVD , where are the top- right singular vectors.
- Communicate:
- Reconstruct: construct approximate logits online in the divergence kernel as before.
Unlike top-k, this method produces a smooth, full-vocabulary approximation of the logits and does not require any renormalization. The reconstruction error is bounded by the energy of the spectral tail of which is small for sufficiently large. The per-token communication cost is exactly floats, which is the same as the top-k approach, and the SVD is done offline so it doesn't add any overhead.

Comparison to Top- Compression
To compare SVD-Nitrobrew with top- we need a notion of approximation quality. We will start by considering L2 error on logits. For any hidden state , truncated SVD gives the worst-case reconstruction bound
By Eckart–Young, the truncated SVD used by SVD-Nitrobrew is the optimal rank- approximation in operator norm, so no other rank- linear map can improve this worst-case bound. However, since top-k truncation is nonlinear and input-dependent, this optimality result does not imply that SVD is always better than top-.
Furthermore, the relevant quantity for distillation is the divergence between the teacher distribution induced by the true logits and the approximate distribution induced by the reconstructed logits. This divergence defines the student's training signal, so compression should preserve it even when the logits are only approximately reconstructed. We can derive a simple illustrative bound by Lipschitzness of softmax. For let for all tokens, then
Where is the true teacher distribution and is the softmax temperature. Combining this with the SVD reconstruction bound gives,
This bound is loose in large-vocabulary settings because the minimum token probability can be very small. We therefore use it only as a qualitative guide: reducing spectral reconstruction error should reduce distributional distortion, but KL is much more sensitive to errors on high-probability tokens than logit MSE alone suggests.
Comparison of Compression Strategies
We then sought to improve upon the spectral compression strategy employed. For each compression strategy, we tested its KL against the true teacher distribution and MSE against the true teacher logit vector. For details on compression experiments, refer to Appendix G.
Naive SVD: As outlined above in SVD-Nitrobrew, the default approach simply uses the truncated right singular matrix for projection.

We find that naive SVD truncation is a poor compressor. For most compression budgets, it has significantly higher KL divergence from the ground truth relative to top-k.
Importance SVD:
Analyzing the spectral properties of the unembedding in isolation ignores the geometry of the input activations to the unembedding — which are also highly structured and have dominant directions of variance. We can instead ask for dominant directions shared between the post-RMS final activation and the unembedding.
We measure this through the effective logit importance of each singular direction :
where is the -th singular value of , is the corresponding right singular vector, and is the post-RMSNorm hidden state at the final layer. The quantity is the average activation energy projected onto direction measured over a held-out corpus of model completions. The product captures the actual contribution of direction to the logits.

We observe a U-shaped phenomenon, where both the dominant and smallest singular values have the most logit importance. We can motivate an approach which orders singular values (and directions) by their effective importance, and truncate accordingly.
The importance-ordered truncation has a nontrivial crossover point in KL , allowing for compression.
When we investigate the source of the remaining discrepancy, we find that most tokens in generation are low-entropy, and SVD has a hard time matching these. Furthermore, the optimal linear map in the MSE sense is not necessarily the optimal linear map in the KL sense. The latter, for example, cares far more about sharpness around the top candidates than accuracy on the tail.
Probability-reweighted SVD:
Motivated by this observation, we attempt a final approach which reweights the SVD by the square root of the average token probability.
is the marginal probability of each token in the vocabulary, averaged over our held-out corpus of completions. Tokens with high are the ones that the model frequently assigns significant probability mass — these are precisely the tokens whose logit reconstruction errors dominate KL divergence.
The reweighted logit map is:
where is the empirical activation covariance matrix.
The left factor scales each row of by the square root of that token's average probability. The right factor weights input directions by their empirical activation scale and covariance. Performing SVD on finds directions that maximize the probability-weighted logit variance under the empirical activation distribution.
The SVD of this reweighted matrix yields:
The decoder includes the inverse weighting to undo the scaling, ensuring that at full rank (), we recover exactly. At reduced rank, the approximation error is concentrated on low-probability tokens, which are least important for KL. The results are shown below.

We finally have a nontrivial truncation point at . We can compress our Nitrobrew state spectrally and still expect to have more information than top-k.
There are significantly more effective ways to compress the final hidden state, particularly nonlinear approaches such as autoencoders, that we did not investigate here but remain a low-hanging direction for future gains.
Takeaways + Looking Forward
Nitrobrew makes full-vocabulary logit distillation cheap enough to be the default — no top-k truncation or biased, lossy approximation.
We would like to stress two guiding principles:
- Simplicity: Nitrobrew requires minimal changes to existing training pipelines, and even the kernel admits a very efficient compiled torch implementation.
- Generality: Nitrobrew is effective for both on- and off-policy distillation. It can dramatically reduce the storage costs for caching teacher predictions in off-policy distillation, which was the original motivation for the tool internally.
In practice, Nitrobrew yields 100× faster loss computation at long context, 37× less peak memory, and 2.5–3× faster training steps.
Spectral compression of logits for distillation is an important avenue of future work that we are quite excited about. Instead of post-hoc strategies, this line of work directly leverages a deeper understanding of model architecture & dynamics to improve practical performance. At Tilde, that's precisely the type of approach we look for.
Code is available at Nitrobrew-Release. We have submitted a PR to VeRL for direct integration as well.
Appendix
A. Code
A simple torch implementation is available at Nitrobrew-Release. The VeRL PR can be found here.
B. Online Divergence Algorithms
Reverse KL
Backward for Forward & Reverse KL
C. Profiling Setup
For within-trainer profiling, we followed a standard baseline setup. We tested on three student–teacher pairs:
| Student | Teacher | |
|---|---|---|
| Pair 1 | Qwen3 0.6B | Qwen3 8B |
| Pair 2 | Qwen3 1.7B | Qwen3 32B |
| Pair 3 | Qwen3 4B | Qwen3 32B |
We use the DAPO-Math-17k [16] dataset. We test a max generation length of 8192, 256 prompts per step, and 1 generation per prompt. We start timing after 5 steps of training have already completed and average step time over the subsequent 10 steps.
For floats per token , we adopt SVD-Nitrobrew as described in SVD-Nitrobrew.
D. Extended Profiling Results
Breakdown of within-step savings for 1.7B→32B distillation setup.
Breakdown of within-step savings for 4B→32B distillation setup.
E. Training Setup
We mostly follow the setup of Jin et al. The teacher is Qwen3-8B and the student is Qwen3-1.7B-Base, trained on the MATH dataset (~7,500 problems) for 2 epochs (~116 steps) using the verl-rl framework.
We use supervised forward KL distillation (use_policy_gradient=False) with a GRPO advantage estimator, cosine learning rate schedule peaking at 3×10⁻⁶, and per-token loss clamping at 10.0 nats. Rollout generation uses temperature 0.6.
For the top-k baselines, we evaluate transmitting floats per token (index–logprob pairs). For Nitrobrew, we transmit PCA-compressed teacher hidden states at d_comp = D_model = 4096 (i.e., full rank, lossless compression). Validation accuracy is measured on MATH-lighteval every 10 steps.
F. OPD Convergence
For training convergence, we tested on-policy distillation with Qwen3-1.7B/8B on MATH and found Nitrobrew competitive with top-k baselines on downstream accuracy. However, this is a small-scale regime — the benefits of full-vocabulary distillation (calibration, tail signal) manifest more clearly at larger model and data scales, where the information discarded by top-k truncation becomes a binding constraint and students are more capable of learning from the tail.
We observed that full-vocabulary KL and top-k KL are qualitatively different loss functions. They require independent hyperparameter tuning and practitioners adopting Nitrobrew should expect to perform adjustment.
G. Compression Experiment Details
Model and data: We use Qwen3-8B [17] as the teacher. For compression strategies that require calibration data (importance SVD, probability-reweighted SVD), we collect teacher activations by running Qwen3-8B on a subset of OpenThoughts3-1.2M [18]. We record the post-RMSNorm final-layer hidden states , empirical activation covariance , and marginal token probabilities .
Evaluation setting. Compression quality is measured in a realistic on-policy distillation setting rather than on the calibration data. We generate rollouts from the student (Qwen3-1.7B-Base [17]) on the MATH dataset [19], then run the teacher (Qwen3-8B) on these student trajectories to produce ground-truth hidden states and logits .
For each compression strategy at rank , we compute the approximate hidden state and reconstruct logits .
Notably, for top-k, we do not count the cost of transmitting indices in the analysis. In reality .
Metrics. We report two metrics, averaged over all token positions in the evaluation set:
KL divergence between the true and approximate teacher distributions:
This measures how much the compression distorts the distributional signal that the student receives during training.
Logit MSE between the true and approximate teacher logits:
This measures raw reconstruction fidelity independent of the softmax nonlinearity.
H. Note on Throughput Variation
The 3× headline figure is drawn from the mid-scale NemoRL configuration (Qwen3-1.7B student, Qwen3-32B teacher) and is consistent with end-to-end training time in VeRL for the OPD runs. In practice, the speedup varies substantially, from 1.5× to over 14×, depending on sequence length, teacher model size, and framework-specific overhead. We report 3× as a representative figure; for many practical configurations, it is conservative.
References
- Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff (2015).
- Romero, Adriana (2014).
- Kim, Yoon and Rush, Alexander M (2016).
- Character AI.
- Muralidharan, Saurav and Turuvekere Sreenivas, Sharath and Joshi, Raviraj and Chochowski, Marcin and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Kautz, Jan and Molchanov, Pavlo (2024).
- Agarwal, Rishabh and Vieillard, Nino and Zhou, Yongchao and Stanczyk, Piotr and Garea, Sabela Ramos and Geist, Matthieu and Bachem, Olivier (2024).
- Thinking Machines.
- Furlanello, Tommaso and Lipton, Zachary and Tschannen, Michael and Itti, Laurent and Anandkumar, Anima (2018).
- Shenfeld, Idan and Damani, Mehul and Hübotter, Jonas and Agrawal, Pulkit (2026).
- Anshumann, Anshumann and Zaidi, Mohd Abbas and Kedia, Akhil and Ahn, Jinwoo and Kwon, Taehwak and Lee, Kangwook and Lee, Haejun and Lee, Joohyung (2025).
- Dasgupta, Sayantan and Cohn, Trevor and Baldwin, Timothy (2026).
- Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and Ré, Christopher (2022).
- Wijmans, Erik and Huval, Brody and Hertzberg, Alexander and Koltun, Vladlen and Krähenbühl, Philipp (2024).
- NVIDIA (2025).
- Sheng, Guangming and Zhang, Chi and Ye, Ziling and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan (2024).
- BytedTsinghua-SIA (2025).
- Yang, An and others (2025).
- OpenThoughts (2025).
- Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob (2021).
Footnotes
-
We may allow some policy drift by doing a few rollouts before updating the student in a relaxed version. ↩