Back
Wall Attention: Length Generalization With Diagonal Gates
TL;DR
- We introduce Wall Attention, which generalizes diagonal forget gates from linear RNNs to softmax attention, yielding a data-dependent positional encoding that is a full replacement for RoPE.
- Wall Attention achieves strong pretraining gains and exceptional length extrapolation, outperforming RoPE and Forgetting Attention (FoX).
- Wall is production-ready - it retains the parallel structure of vanilla attention, is compatible with GQA & MLA, and we open-source reference Triton kernels for training and decoding. Our WallDecode kernel matches FA3 decode.
- We develop a general induced action framework for translating RNN structures to softmax attention, unifying FoX, PaTH, and Wall as special cases.
Outline
- Introduction — Overview of positional embeddings and recent advances in linear RNNs.
- Deriving Wall Attention — From linear RNN gates to Wall Attention via the induced action framework
- Kernel Implementation — Development of an optimized training and decode kernel
- Empirical Results — Pretraining, length extrapolation, & mechanistic results
- Inflection — The future of Wall
1: Introduction
Why Position Matters
Standard attention is permutation-equivariant, and the operation does not recognize the temporal ordering in which tokens were introduced. Every effective transformer architecture injects some form of positional signal to break this symmetry and encode a recency bias.
Positional embeddings (PE), such as RoPE [1] and ALiBi [2], form the backbone of continual learning via in-context learning and long-context generalization, as they enable the model to adapt to the stream of tokens by selectively forgetting old memories [3][4][5][6], and so advances in PE are critical.
Data-Independent vs. Data-Dependent
The landscape of positional methods can be partitioned into methods that depend on context and those that do not.
Data-independent methods bake position into the architecture as a fixed function of token index. RoPE rotates query-key pairs by an angle proportional to their relative distance; ALiBi subtracts a linear penalty from attention logits. Both are static: the same two positions always receive the same bias regardless of content.
Data-dependent methods let the model decide, based on the per-token latent state, how to modulate its positional prior. Two recent examples are FoX [7] and PaTH [8], which we discuss in detail below.
Data-dependent PE is, in principle, superior because natural language has variable-rate information density. The model can learn end-to-end what to retain and what to forget as context grows.
From Linear to Softmax Attention
Modern linear RNNs compress the sequence into a fixed-size recurrent state that is updated at each step. Without gating, can only accumulate token features, and the model has no way to prioritize recent, relevant information over stale context. The forget gate‡ is what gives these architectures selective memory, by decaying before writing new information in.
Mamba [9] and Gated DeltaNet [10] use scalar gates, applying the same decay uniformly across all features. GLA [11], RWKV-7 [12] and KDA [13] use diagonal gates , allowing different features to forget at different rates.
The diagonal structure is one of the core innovations behind the latest generation of linear RNNs, and it dramatically improves empirical performance.
Softmax attention has no finite-dimensional recurrent state, so it is not obvious how gating should translate. FoX [7], to our knowledge the first work to attempt this, takes the scalar gate from linear attention and formulates gating on the softmax state as an additive bias. The result is a data-dependent ALiBi, which is a content-dependent distance penalty.
A scalar forget gate produces a cumulative decay factor:
This factors out of the kernel and becomes an additive logit bias:
Analogous to ALiBi, but with a learned, data-dependent distance penalty instead of a fixed linear one.
PaTH [8] instead translates the generalized Householder products from DeltaNet's evolution rule [14], which are fundamental in state tracking. The result is loosely analogous to a data-dependent RoPE, as a content-dependent bilinear modulation of the query-key score, though generalized Householder products are not strictly orthogonal.
A generalized Householder transform produces a cumulative linear transform:
This acts as a multiplicative query-key transform:
Analogous to RoPE, but with learned, data-dependent transforms instead of fixed sinusoidal ones.
Both are instances of lifting linear RNN structures to softmax attention and producing data-dependent positional embeddings. The landscape of positional methods forms a natural taxonomy whether the bias is additive or multiplicative, and whether it is data-independent or data-dependent. Each quadrant allows a very specific temporal geometry.
independent
dependent
†Additive PE is a degenerate case of multiplicative PE and a logit bias is equivalent to a multiplicative gate on an extra dimension appended to the key. The distinction is one of parameterization, not expressivity.
‡We use "forget gate" and "retention" interchangeably throughout. A retention value means the gate retains fraction of the state per step; is no forgetting.
From the Gate to the Wall
We introduce Wall Attention, which generalizes diagonal gates to softmax attention as a full replacement for RoPE. Wall is plug-and-play with existing training and inference frameworks, and we open-source reference Triton implementations for both. Along the way, we develop a general framework for translating RNN structures to softmax attention, unifying FoX, PaTH, and Wall as special cases.
A diagonal forget gate produces a per-channel cumulative decay:
Via the induced action, this becomes a multiplicative per-channel rescaling of the dot product:
Each channel forgets at its own rate. Some decay quickly (recency), others persist (long-range memory).
2: Deriving Wall Attention
In Section 1, we described the gated linear RNN
where is some transition matrix. When we unroll this recurrence, the output at position is a weighted sum over past values. Define the unnormalized attention weight , where is the cumulative transition product from position to .
To go from linear to softmax attention, we need . Following convention, we omit the softmax normalization term throughout this derivation and work with unnormalized weights. The standard way to make this connection is to replace the dot product with a general kernel for some embedding . If is the identity, and we recover linear attention. If , which is the exponential kernel, we recover softmax attention.
We are specifically interested in gating transition matrices that take the form . When we put a gate inside the recurrence, it interacts nontrivially with the feature map.
Primitives
Fix a feature map from the input space to a (possibly infinite-dimensional) Hilbert space, which induces a corresponding positive-definite kernel . We consider two kernels:
- Linear kernel — ,
- Exponential kernel — , *
To offer some brief intuition on the Taylor feature map for the exponential kernel, the resulting infinite-dimensional vector has as elements all of the monomials of elements of (up to scaling).
*We use the Taylor expansion feature map. Other infinite-dimensional feature maps, e.g. Gaussian random sketches [15][16], also correspond to the exponential kernel. Our derivations hold for any valid feature map.
A generalized gated linear RNN maintains state with a gate takes the following form
Unrolling, we obtain unnormalized attention weights
For a scalar gate where is a single number, the cumulative product is also scalar. Since is a scalar, it factors out of the inner product:
For the exponential kernel specifically
This is exactly the Forgetting Attention (FoX) formulation, whereby the scalar decay is absorbed into an additive logit bias.
Why Diagonal Gates Break for Exp
The gate must match the dimension of the kernel's feature space. For the linear kernel (), a -dimensional diagonal gate can serve as directly. Unrolling gives the linear kernel score , a per-channel decay-weighted dot product and precisely the score used in GLA, RWKV-7 & KDA.
For the exponential kernel, is infinite-dimensional, so a -dimensional diagonal gate cannot serve as directly. We need a lift that extends the -dimensional gate to the full feature space.
One might try to define as a diagonal operator on , assigning each monomial basis element a gate value. However, the monomials in the Taylor expansion correspond to combinations of multiple input coordinates at once, e.g. which involves both channels 1 and 3 simultaneously.
There is no canonical way to assign a single gate value from to a multi-channel monomial, so the scalar absorption trick that enables FoX has no analogue for diagonal gates in the exponential kernel.
The Induced Action
Instead of lifting to the feature space directly, we may let it act on the input space and induce the action through . For any input-space operator , define its induced action by
The induced action lifts any linear RNN gate structure to softmax attention by gating in input space before embedding.
Intuitively, this definition says "gating the input before embedding equals gating the embedded state". It is well-defined for any feature map and any input-space linear operator, beyond just diagonal gates. Note that this implicitly requires to be a linear operator on , i.e. that is linear in . This holds for the Taylor feature map since is a linear combination of monomials in .
This resolves the obstruction from the previous section. The monomial had no single gate to inherit from , but the induced action assigns it the product of its constituent gates , since . In fact, for the Taylor feature map, the induced action coincides with the symmetric power lift , the unique functorial extension of a linear operator to the symmetric algebra (see Appendix F for the full treatment).
Crucially, the induced action is a group homomorphism [18]. For any linear operators on , the induced action respects composition.
This means the cumulative induced transition equals the induced action of the cumulative product . When combined with the kernel trick, we obtain the following simplification for the exponential kernel
where and is its induced action on .
This holds for any sequence of linear operators . If as a generalized Householder product, for example, we recover PaTH attention. Notably, if we restrict we recover a different object than the FoX formulation discussed above, which we discuss in Appendix C.
Specializing to Diagonal Gates & Wall Attention
For diagonal gates where , the cumulative product is also diagonal. From our induced action formulation, the attention score is then as follows
We can thus define Wall Attention.
Wall attention is an extremely powerful attention variant with the ability to selectively forget and remember per-channel.
As the sequence grows, each channel builds its own attention pattern. The fast channel retains only recent tokens. The slow channel preserves signal from anchor tokens (0, 4, 8) that had strong keys, visible as persistent bright columns.
Define as the prefix sum in log-space of gate vectors. An equivalent, important formulation of Wall Attention is vanilla attention with modified queries and keys.
This decoupled formulation works because the cumulative diagonal product is trivially invertible, allowing a clean factorization into independent Q and K transforms. This factorization is what makes Wall fast in practice (Section 3).
Perspectives on Wall Attention
There are many ways to understand Wall beyond diagonal forget gates in attention. We present five further framings to build intuition.
Beyond simply FoX with more channels
Wall is a fundamentally different creature from FoX with more channels. Even when the gate is a constant vector ( for all ), Wall does not reduce to FoX.
FoX absorbs the scalar decay as an additive logit bias, placing it in the ALiBi family:
Wall instead applies a per-channel multiplicative rescaling of the dot product, placing it in the RoPE family:
One shifts the logit; the other rescales the inner product per channel. These are algebraically and geometrically distinct (see Appendix C).
3: Kernel Implementation & Practical Considerations
Recall, in the factorized form from Section 2, the attention score where and . This factorization is what makes Wall practical. The per-channel gate reduces to a one-pass rescaling of and , after which the attention kernel is algorithmically identical to FlashAttention [19][20].
This section describes how to make this work at scale: the design choices that control overhead, the numerical problems that arise in finite precision, and the kernel that overcomes them.
Design Choices
Two (orthogonal) decisions control the cost of Wall's gate before we even touch the kernel.
The gate applies to the first of head dimensions. The rest are ungated (vanilla attention).
One gate vector per KV head, shared across all query heads in the group. Reduces gate overhead by the GQA ratio.
Same applied to all four query heads → 4× fewer gate parameters
Gate Tying. Wall produces a gate per query head, resulting in scalars per token (where is the number of query heads and is the head dimension). In GQA, we can instead produce one gate per KV head and share it across the query group. This reduces the gate projection & overhead by a factor of the GQA ratio (typically 4–8x) and creates a shared decay geometry across query heads within a KV group. We find empirically that KV-head gating matches Q-head gating (Section 4).
Subdim Gating. Wall can be applied to only the first of head dimensions, leaving the rest ungated.
Unlike RoPE, where partial application maintains or improves performance, reducing for Wall degrades performance uniformly (Section 4).
Training Kernel
We now describe the development of our Triton training kernel, from inception to final version. Let be the tile size along the query dimension and the tile size along the key dimension.
Approach 1: Offline Rescale
The simplest approach is to compute and , then run vanilla FlashAttention.
Qt <- Q * exp2(P)Kt <- K * exp2(-P)O, L <- FlashAttention(Qt, Kt, V)Unfortunately, this approach results in overflow and catastrophic cancellation. is a cumulative sum where the magnitude grows with sequence length monotonically. For example, at , , exceeds 160, which passes bf16 exponent range and is uncomfortably close to fp32 saturation. The factors and underflow and overflow respectively when decomposed, even though the score they reconstruct is bounded by the per-tile difference .
Approach 2: Per-Tile Anchor
To mitigate this instability, we introduce and split the exponent into two bounded halves.
We set so that each factor is bounded by the accumulated gate within a single tile, which is at most .
Input: Q, K, V, gate prefix P in HBM (all N x d)Output: O (N x d), L (N) in HBMparallel for i = 1 .. T_r do▷ one thread block per Q tileLoad Q_i, P_i to SRAMR <- P[i_start]▷ per-tile anchorQt <- Q_i * exp2(P_i - R)▷ rescaled queries (bounded)O_i <- 0, l_i <- 0, m_i <- -inffor j = 1 .. T_c do▷ sequential over K/V tilesLoad K_j, V_j, P_j to SRAMKt <- K_j * exp2(R - P_j)▷ rescaled keys (bounded)S <- Qt @ Kt^T * scalem' <- max(m_i, rowmax(S))P_e <- exp(S - m')a <- exp(m_i - m')l_i <- a * l_i + rowsum(P_e)O_i <- a * O_i + P_e * V_jm_i <- m'O_i <- O_i / l_iWrite O_i, m_i + log(l_i) to HBMThis is numerically stable, and the forward overhead versus FlashAttention is under 5%. On top of FA, Wall only requires one elementwise multiply per tile, which is fully hidden by the matmul.
Gate Clamping. For numerical stability, we soft-clamp the log-gate via , bounding each step's decay to with a minimum per-step retention of . Empirically, looser bounds improve performance. See Appendix G for the full justification.
Approach 3: Fused Gate Gradient
The forward kernel from Approach 2 is correct and fast, but the corresponding backward is slow. Profiling localizes the bottleneck to backward register pressure. The naive backward maintains three accumulators () alongside the usual softmax statistics. This exceeds the register budget for our preferred block size, and the Triton compiler silently shrinks the tile from 64 to 32, collapsing SM occupancy.
parallel for j = 1 .. T_c dodK_j <- 0, dV_j <- 0dP_j <- 0▷ extra accumulator (3rd register)for i = j .. T_r do< recompute Pe, compute dS, accumulate dV_j >dKt <- dS^T @ QtdK_j <- dK_j + dKt * exp2(R - P_j)dP_j <- dP_j - ln2 * Kt * dKt▷ per-iteration updatedQt <- dS @ KtdQ_i <- dQ_i + dQt * exp2(P_i - R)dP_i <- dP_i + ln2 * Qt * dQt▷ query gate gradScatter dQ_i, dP_iWrite dK_j, dV_j, dP_jThe critical insight is that , which are both already resident when the loop terminates. The gate gradient can be derived once, post-loop, from tensors already present.
parallel for j = 1 .. T_c dodK_j <- 0, dV_j <- 0▷ no gate accumulatorfor i = j .. T_r do< recompute Pe, compute dS, accumulate dV_j >dK_j <- dK_j + (dS^T @ Qt) * exp2(R - P_j)dQt <- dS @ KtdQ_i <- dQ_i + dQt * exp2(P_i - R)dP_i <- dP_i + ln2 * Qt * dQt▷ query gate gradScatter dQ_i, dP_idP_j <- -ln2 * K_j * dK_j▷ key gate grad (fused, post-loop)Write dK_j, dV_j, dP_jEliminating one live accumulator frees enough registers for the autotuner to select and the matmuls saturate tensor cores again.
The Final Kernel
The fused gradient yields a 12% wall-clock reduction at . We further implement several standard kernel-level optimizations, detailed in Appendix D, including splitting the inner loop into separate off-diagonal and diagonal passes, casting the diagonal matmul inputs to bf16 for WGMMA tensor cores, and autotuning block shapes. Together, these bring the cumulative backward speedup to 41%, closing the gap to FoX from 2.2x to 1.4x.
The remaining gap is structural: per-channel decay requires more HBM traffic (loading ) and more gate-gradient output than FoX's scalar approach. Finer control over the memory hierarchy (e.g., a native CUDA implementation) can close this gap further.
Our Triton implementation is roughly 2x the cost of FlashAttention-2 in isolation. In practice, this overhead is amortized because attention FLOPs are a small fraction of end-to-end training cost (dominated by MLP and embedding layers), so the wall-clock impact on full training is small.
Decode
A naive approach would require storing for all , increasing the KV cache by up to 50%. We describe three approaches that avoid extra storage while maintaining arithmetic intensity.
Approach 1: Absorb + Store References
Rather than storing all in the cache, we can absorb the gate into the cached keys themselves and retain only a per-chunk anchor. We partition the sequence into chunks of size , freeze an anchor at each chunk boundary and cache in place of .
At the decode step , we have:
The query-side rescaling is computed once per chunk, and the inner loop is byte-for-byte a standard FlashDecoding [21] inner loop, with no per-key gate arithmetic and no read from HBM.
Wall Decode is comparable to FlashDecode (FA3) across all context lengths, and significantly faster than a naive Wall decode that recomputes gates from stored prefixes.
Approach 2: Key-Projected Gates
Instead of projecting gate vectors from the hidden state, we can project head-wise from the key vectors as . Since is already in the KV cache, the gate can be materialized on the fly during attention with zero additional storage. Inside the kernel, each KV block computes the gate from the loaded keys in registers.
b_log_f = tl.sum(b_k * b_W_g[:, None], axis=0) + b_bias_g
b_log_f = log_sigmoid(b_log_f)
b_c = c_running + tl.cumsum(b_log_f, axis=0)
c_running = b_c[-1]
b_k_til = b_k * exp2((c_ref - b_c)[:, None] * RCP_LN2)
The overhead is one small matmul and one cumsum per block, both compute-bound on data already in SRAM. The only extra state is a running scalar per head.
Approach 3: WallMLA
In MLA [22], keys and values are materialized from a compressed latent as . WallMLA derives the gate from the same latent: , requiring one extra small matmul in MLA's existing pipeline. The decode kernel materializes , and the gate correction entirely from the cached latent in a single fused loop.
4: Empirical Results
Pretraining
We pretrained 400M and 1B transformers with Wall Attention on Nemotron CC v2 [23][28]. Training details are in the Appendix and broadly mirror the Aurora release [24].
At 400M, Wall (NoPE) outperforms the RoPE baseline under both Muon and Aurora optimizers. The consistency across optimizers confirms that the gains from diagonal gating are orthogonal to optimizer choice.
At 1B, Wall achieves a 0.01 nats gain over the RoPE baseline and defeats both FoX and Wall (RoPE), validating it as a standalone PE strategy.
We evaluate the 1B models on standard downstream benchmarks for small model pretraining.
Dashed = older training generation (different arch & optimizer tuning)
Wall (NoPE) achieves the strongest performance across both pretraining and (short-context) downstream evals. Wall 1B is also a new SOTA on the pretraining tokens vs performance plot from Aurora, beating our old generation Aurora runs despite using only Muon.
The magnitude of pretraining gains is surprising. Many positional embedding variants do not show strong gains over RoPE in pretraining, since their focus is length extrapolation. Wall achieves significant improvements in pretraining convergence over RoPE.
Ablations
Gate Biases
Wall's per-channel retention is , where is the soft clamp that bounds each step's log-gate to . The bias sets the gate's operating point at initialization. Large positive starts retention near 1 (fully open, vanilla-attention-like); starts at before clamping, corresponding to retention after the soft clamp.
FoX prefers a gate bias of 0, matching the authors' findings. Wall prefers conservative biases of 6-8, starting fully open and gradually learning to close so that behavior matches vanilla attention at init.
Subdim & GQA
We ablated both efficiency strategies from Section 3: subdim gating (applying Wall to only head dims) and KV-head gate tying (one gate per KV head instead of per Q head under GQA).
Reducing the subdim degrades performance monotonically, but 2-4x subdim still substantially outperforms vanilla attention. KV-head gating matches the more expensive Q-head gating, confirming that gate tying is a free efficiency win.
Long-Context Extrapolation
We tested long-context extrapolation on CodeParrot, NarrativeQA, and PG-19 for the 1B models. Notably, these models were trained with a 4k maximum sequence length on a dataset where most sequences are significantly shorter than 4k tokens.
Wall (NoPE) generalizes to 160k+ sequence lengths, despite only having seen a maximum sequence length of 4k during training. At no point does Wall diverge in loss as context increases.
We further analyzed Wall on the Needle-in-a-Haystack (NIAH) retrieval task. We found Wall was able to extrapolate to 4x its training context length without difficulty, where all other methods failed.
Finally, for LongBench v1, Wall was SOTA over RoPE and FoX across categories including coding, summarization, and few-shot learning.
All the tests we ran here tested zero-shot extreme length generalization, which is an exceedingly difficult test of PE methods. With simple long-context midtraining, Wall is well-poised to generalize to enormous context lengths.
Mechanistic Analysis
We analyzed the learned gate distributions on LongCrawl64 (65K-token documents, 100 documents, 256-token bins).
Wall (NoPE) achieves lower NLL than RoPE at every position, with the gap widening as context grows. Wall (RoPE) falls between the two, suggesting the diagonal gate alone captures most of the positional structure that RoPE provides. We collected and analyzed per-channel gate values (retention scores) for LongCrawl64 document samples.
Wall learns a multi-timescale memory hierarchy across layers, with some layers acting as tight local windows and others maintaining signal over much longer ranges. The NoPE model shows wider variation than RoPE, consistent with the gates taking over more of the positional encoding role.
These layer-level aggregates, however, mask a more interesting finding at the channel level. Per-channel retention reveals two distinct populations - "always-on" channels with retention identically 1.0 (zero variance), and highly dynamic channels whose retention swings between the clamp floor and near-1 on a per-token basis. The dynamic channels respond to content, closing hard at semantic boundaries and opening elsewhere. The static channels provide unconditional long-range memory.
This bimodal structure is learned end-to-end. All channels start with effectively identical initialization due to strong gate biases, and the model discovers that some dimensions should serve as permanent memory while others should gate based on content. The cumulative product (Panel C) demonstrates how dynamic channels create sharp memory cutoffs whereas constant retention (dashed lines) gives only smooth exponential decay.
5: Inflection
Four things excite us about this work.
1. Sufficiency. Wall outperforms every positional variant we tested and does not benefit from stacking RoPE or FoX on top. This suggests Wall is sufficient for strong positional and temporal understanding.
2. Empirical success. Wall's pretraining gains are coupled with extreme length extrapolation. It improves upon both the convergence benefits normally afforded by RoPE and allows models trained at 4k sequence length extrapolate to 200k+ tokens.
3. Cross-pollination. Wall started as an attempt to bring diagonal gating from linear RNNs into softmax attention. The investigation forced us through a series of questions that, to our knowledge, have not been formalized before: how does a finite-dimensional gate act on an infinite-dimensional feature space? What does the induced action look like?
The induced action lifts any linear RNN gate structure to softmax attention by gating in input space before embedding.
The induced action framework connects modern linear RNNs and softmax attention in a way that, to our knowledge, has not been formalized before. PaTH motivates the score from the linear RNN analogy; our framework derives it as a consequence of the kernel's feature map.
4. Production-ready. Many proposed attention alternatives claim superior expressivity but are difficult to scale to production training and inference. They require extensive kernel design and force users to accept overhead.
Wall retains the embarrassingly parallel structure of vanilla attention, is directly compatible with GQA and MLA, and can be made faster and lighter-weight through the parametrization and kernel strategies we developed.
Future Work
We list here future directions that may be valuable to the community.
Continued pretraining. Models pretrained with RoPE can be upcycled into Wall models for little cost. We would be excited to see larger-scale open-source models with Wall.
Kernel optimizations. Our open-source Triton implementation is correct and reasonably fast, but leaves performance on the table. Internal profiling suggests a prototype CuTE implementation of Wall/WallMLA closes much of the remaining gap to FlashAttention. An open-source CuTE kernel for Wall training and decoding would be a high-impact contribution.
Broader PE comparisons. We did not compare to PaTH because PaTH focuses on state tracking, whereas our focus is forgetting and long-context. We would like to see Wall evaluated against a wider set of positional embedding methods across diverse tasks and scales.
Cite this work
@article{pai2026wall,
title = {Wall Attention: Diagonal Gates for Softmax Attention},
author = {Pai, Dhruv and Averbuch, Timor and Zhang, Ashley and Keigwin, Ben and Dewulf, Alec},
year = {2026},
url = {https://blog.tilderesearch.com/blog/wall-attn}
}
Appendix
A: Open Source Release
We release reference Triton kernels for Wall Attention training and decoding at github.com/tilde-research/wall-attention-release.
B: Training Details
We report exact configurations and training settings for our 400M and 1B runs. We used an internal tokenizer with 128k vocab size for all experiments. For reproducibility, we trained on fully open-source internet data from NVIDIA Nemotron CC v2 [23][28]. Note the training setting broadly matches those used in Aurora [24]. We use per-head Muon (MuonSplit) [27] as the optimizer for all runs.
Transformer 400M:
| Category | Details |
|---|---|
| Training Configuration | 800k tokens/batch, 8192 seq len, WSD schedule |
| Data | 10.5B tokens, NemotronCCv2 HQ split |
| Optimizer | Per-head Muon (MuonSplit) |
| Architecture | d=1024, L=24, MHA, QKNorm, ShortConv, Gated Attention |
Transformer 1B:
| Category | Details |
|---|---|
| Training Configuration | 4M tokens/batch, 2048 seq len, WSD + cosine decay |
| Data | 70B tokens, NemotronCCv2 HQ split |
| Optimizer | Per-head Muon (MuonSplit) |
| Architecture | d=2048, L=24, MHA, QKNorm, ShortConv, Gated Attention |
C: Induced Scalar Formulation
For a scalar gate , the general formula gives
This is very different from FoX. It's a multiplicative positional embedding, similar to the scalar case of Wall Attention. It is somewhat degenerate, as it amounts to merely a path-dependent global temperature.
PaTH [8] (footnote 3) observed that this formulation greatly underperforms FoX. This could be due to the bias initialization differences from our ablation study, whereby multiplicative gates require open-gate initialization.
D: Kernel
Off-diagonal / diagonal split. Each causal Wall output tile attends to many fully-causal key tiles plus one diagonal tile straddling the causal boundary. For Wall, this split coincides with a numerical split. Since (cumulative log-gate) is monotonically non-increasing, an off-diagonal tile anchored at has and , so both rescale factors lie in . The diagonal tile straddles : can grow to the within-tile budget, and masked entries can produce NaN. A single fused loop would force conservative handling (mask + clamp) onto all tiles. Splitting lets the off-diagonal pass run branch-free and mask-free, confining the clamp to one block per query.
bf16 diagonal matmuls. The off-diagonal matmuls ran in bf16 while the diagonal pass defaulted to fp32, an artifact of the reference FlashAttention implementation. This caused a throughput cliff: H100 WGMMA runs bf16 at ~2x TF32 and the backward is matmul-bound. We cast only the matmul inputs () to bf16 and keep fp32 accumulation. The diagonal cast leaves the error indistinguishable from the fp32 path.
Together with the optimizations discussed in Section 3, and block-shape autotuning, these reduce backward wall-clock by ~20% over the stable per-tile kernel.
The naive implementation is often faster than our Triton kernel, but it is numerically unstable and causes early model divergence during training.
E: KV Cache Optimization Improvements
Wall Attention offers a few unique opportunities for KV cache optimization.
Eviction. Per-channel gate values provide a principled heuristic for KV cache eviction. When all channels in an old key have decayed below a threshold, the key has been effectively forgotten and can be evicted at no cost.
Chunk Safety & MXFP8 Alignment. Within a chunk of size , the maximum rescaling factor applied to any cached key is . For typical learned gate magnitudes (), we have
| Chunk Size C | Max Exponent 0.02C | Max Rescaling Factor |
|---|---|---|
| 32 | 0.64 | 1.56 |
| 64 | 1.28 | 2.43 |
| 128 | 2.56 | 5.91 |
Even at chunk size 128 with aggressive gates, the rescaling is under 6x, and well within the representable range of fp8 E4M3 with negligible quantization error relative to the key magnitudes themselves.
The rescaled values within a chunk are monotonically ordered by distance from the anchor: keys near the chunk boundary have rescaling close to 1, keys far from it have larger rescaling. This smooth, predictable dynamic range aligns naturally with NVIDIA's MXFP8 format [25][26], which assigns a shared block-wise scaling factor to a small group of values. A single MXFP8 scale factor per block captures the local dynamic range with minimal quantization loss.
Reduced dynamic range for key-projected gates. Key-projected gates offer a distinct advantage for low-precision KV caches: since the cached keys are stored unmodified, they have the same dynamic range as vanilla attention and existing KV quantization techniques transfer directly. This contrasts with block-wise anchor decoding, where absorbed keys carry the gate rescaling in their magnitudes.
F: Symmetric Algebra Derivation
The feature space spanned by the monomials is (a completion of) the symmetric algebra , and the Taylor feature map is the canonical exponential map sending an input vector to its coherent state.
Any linear map on extends functorially to this algebra, acting on each graded piece by
and is the symmetric power lift of . The induced action is exactly this lift: , since holds by construction. This makes the induced action the canonical object rather than an ad hoc definition, as it is the unique functorial extension of to the feature space.
On the monomial basis, is diagonal with eigenvalue , since . This resolves the obstruction from the main text: each monomial receives the product of its constituent gates.
Because is a functor, the induced action respects composition.
In particular, the cumulative induced transition equals the induced action of the cumulative product, . Combined with the kernel identity, this collapses the feature-space inner product back to a finite-dimensional one.
This holds for any sequence of linear operators , recovering PaTH when and Wall when .
G: Gate Clamping Derivation
Even with per-tile anchoring, aggressive gates can overflow within a tile. The kernel accumulates prefix sums where is the per-step log-gate in natural-log domain. The maximum exponent of any exp2 call in the backward is ; staying within fp32 range requires for our block sizes ().
We enforce this with a soft clamp applied after logsigmoid:
This maps to with a steep barrier near the bound but no gradient discontinuity. In retention terms, , so the clamp imposes a floor of : each channel forgets at most 58% per step.
The soft clamp is asymptotically identity near zero (where most gates operate after training with high bias) and smoothly saturates toward for aggressive decays. A hard clamp would achieve the same bound but introduces a gradient discontinuity at the transition, which destabilizes training when channels hover near the boundary. Empirically, looser bounds yield better performance, so any kernel improvement that raises the overflow budget directly improves Wall.
H: Optimizer Ablation (Muon vs Aurora)
At 400M scale, we compared Muon and Aurora optimizers for both the RoPE baseline and Wall (NoPE). Both optimizers achieve comparable final loss on the baseline, and Wall (NoPE) outperforms the baseline under both optimizers. Aurora provides an edge over Muon, consistent with the findings in the Aurora post [24]. Since the gains from Wall are orthogonal to optimizer choice, all results in the main text use Muon except Figure 3, which shows both optimizers at 400M scale.
References
- Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y. (2024).
- Press, O., Smith, N. A., Lewis, M. (2021).
- Kazemnejad, A., Padhi, I., Natesan, K., Das, P., Reddy, S. (2023).
- Zhao, L., Feng, X., Feng, X., et al. (2024).
- Peng, B., Quesnelle, J., Fan, H., Shippole, E. (2024).
- Ding, Y., Zhang, L. L., Zhang, C., et al. (2024).
- Lin, Z., Nikishin, E., He, X., Courville, A. (2025).
- Yang, S., Shen, Y., Wen, K., Tan, S., Mishra, M., Ren, L., Panda, R., Kim, Y. (2025).
- Gu, A., Dao, T. (2023).
- Yang, S., Kautz, J., Hatamizadeh, A. (2025).
- Yang, S., Wang, B., Shen, Y., Panda, R., Kim, Y. (2024).
- Peng, B., Zhang, R., Goldstein, D., et al. (2025).
- Kimi Team, Zhang, Y., Lin, Z., et al. (2025).
- Yang, S., Wang, B., Zhang, Y., Shen, Y., Kim, Y. (2024).
- Kar, P., Karnick, H. (2012).
- Wacker, J., Kanagawa, M., Filippone, M. (2024).
- Clift, J., Doryn, D., Murfet, D., Wallbridge, J. (2020).
- Dao, T., Fu, D. Y., Ermon, S., Rudra, A., Ré, C. (2022).
- Dao, T., Haziza, D., Massa, F., Sizov, G. (2023).
- Su, D., Kong, K., Lin, Y., et al. (2025).
- Dewulf, A., Pai, D., Yang, L., Zhang, A., Keigwin, B. (2026).
- Open Compute Project (2023).
- GLM-5 Team, Zeng, A., Lv, X., et al. (2026).
- Roy, A., Chou, T., Duvvuri, S. S., et al. (2025).