Back

Wall Attention: Length Generalization With Diagonal Gates

6.02.2026
Dhruv Pai*,  Timor Averbuch*,  Ashley Zhang*,  Ben Keigwin*,  Alec Dewulf
* Core Contributor; Correspondence to dhruv@tilderesearch.com

TL;DR

  • We introduce Wall Attention, which generalizes diagonal forget gates from linear RNNs to softmax attention, yielding a data-dependent positional encoding that is a full replacement for RoPE.
  • Wall Attention achieves strong pretraining gains and exceptional length extrapolation, outperforming RoPE and Forgetting Attention (FoX).
  • Wall is production-ready - it retains the parallel structure of vanilla attention, is compatible with GQA & MLA, and we open-source reference Triton kernels for training and decoding. Our WallDecode kernel matches FA3 decode.
  • We develop a general induced action framework for translating RNN structures to softmax attention, unifying FoX, PaTH, and Wall as special cases.

Outline

  1. Introduction — Overview of positional embeddings and recent advances in linear RNNs.
  2. Deriving Wall Attention — From linear RNN gates to Wall Attention via the induced action framework
  3. Kernel Implementation — Development of an optimized training and decode kernel
  4. Empirical Results — Pretraining, length extrapolation, & mechanistic results
  5. Inflection — The future of Wall

1: Introduction

Why Position Matters

Standard attention is permutation-equivariant, and the operation does not recognize the temporal ordering in which tokens were introduced. Every effective transformer architecture injects some form of positional signal to break this symmetry and encode a recency bias.

inputshuffledabcbcasameoutputyy=

Positional embeddings (PE), such as RoPE [1] and ALiBi [2], form the backbone of continual learning via in-context learning and long-context generalization, as they enable the model to adapt to the stream of tokens by selectively forgetting old memories [3][4][5][6], and so advances in PE are critical.

Data-Independent vs. Data-Dependent

The landscape of positional methods can be partitioned into methods that depend on context and those that do not.

Data-independent methods bake position into the architecture as a fixed function of token index. RoPE rotates query-key pairs by an angle proportional to their relative distance; ALiBi subtracts a linear penalty from attention logits. Both are static: the same two positions always receive the same bias regardless of content.

Data-dependent methods let the model decide, based on the per-token latent state, how to modulate its positional prior. Two recent examples are FoX [7] and PaTH [8], which we discuss in detail below.

Data-dependent PE is, in principle, superior because natural language has variable-rate information density. The model can learn end-to-end what to retain and what to forget as context grows.

From Linear to Softmax Attention

Modern linear RNNs compress the sequence into a fixed-size recurrent state StS_t that is updated at each step. Without gating, StS_t can only accumulate token features, and the model has no way to prioritize recent, relevant information over stale context. The forget gate is what gives these architectures selective memory, by decaying StS_t before writing new information in.

Mamba [9] and Gated DeltaNet [10] use scalar gates, applying the same decay uniformly across all features. GLA [11], RWKV-7 [12] and KDA [13] use diagonal gates Gt=diag(gt,1,,gt,d)G_t = \mathrm{diag}(g_{t,1}, \ldots, g_{t,d}), allowing different features to forget at different rates.

MethodRecurrent form
Softmax Attentionj=1texp ⁣(qtkj)vj\sum_{j=1}^{t} \exp\!\bigl(q_t^\top k_j\bigr)\, v_j
SA + RoPEj=1texp ⁣(qts=j+1tRskj)vj\sum_{j=1}^{t} \exp\!\Bigl(q_t^\top \prod_{s=j+1}^{t} R_s\, k_j\Bigr)\, v_j
Linear Attentionj=1t(qtkj)vj\sum_{j=1}^{t} \bigl(q_t^\top k_j\bigr)\, v_j
Mamba2(scalar gate)j=1tqt ⁣(s=j+1tαs)kjvj\sum_{j=1}^{t} q_t^\top \!\Bigl(\prod_{s=j+1}^{t} \alpha_s\Bigr) k_j\, v_j
GLAj=1tqt ⁣(s=j+1tDiag(αs))kjvj\sum_{j=1}^{t} q_t^\top \!\Bigl(\prod_{s=j+1}^{t} \colorbox{#ede9fe}{\(\mathrm{Diag}(\alpha_s)\)}\Bigr) k_j\, v_j
Gated DeltaNet(scalar gate)j=1tqt ⁣(s=j+1tαs(Iksks))kjvj\sum_{j=1}^{t} q_t^\top \!\Bigl(\prod_{s=j+1}^{t} \alpha_s (I - k_s k_s^\top)\Bigr) k_j\, v_j
FoXj=1texp ⁣(qtkj)(s=j+1tαs)vj\sum_{j=1}^{t} \exp\!\bigl(q_t^\top k_j\bigr) \Bigl(\prod_{s=j+1}^{t} \alpha_s\Bigr) v_j
RWKV-7j=1tqt ⁣(s=j+1tDiag(αs)(Ik^sk^s))kjvj\sum_{j=1}^{t} q_t^\top \!\Bigl(\prod_{s=j+1}^{t} \colorbox{#ede9fe}{\(\mathrm{Diag}(\alpha_s)\)} (I - \hat{k}_s \hat{k}_s^\top)\Bigr) k_j\, v_j
KDAj=1tqt ⁣(s=j+1tDiag(αs)(Iksks))kjvj\sum_{j=1}^{t} q_t^\top \!\Bigl(\prod_{s=j+1}^{t} \colorbox{#ede9fe}{\(\mathrm{Diag}(\alpha_s)\)} (I - k_s k_s^\top)\Bigr) k_j\, v_j
Wall (ours)j=1texp ⁣(qt ⁣(s=j+1tDiag(αs))kj)vj\sum_{j=1}^{t} \exp\!\Bigl(q_t^\top \!\Bigl(\prod_{s=j+1}^{t}\colorbox{#ede9fe}{\(\mathrm{Diag}(\alpha_s)\)}\Bigr) k_j\Bigr)\, v_j

Table adapted from Kimi Linear (KDA). Diagonal gate terms highlighted in purple.

The diagonal structure is one of the core innovations behind the latest generation of linear RNNs, and it dramatically improves empirical performance.

Softmax attention has no finite-dimensional recurrent state, so it is not obvious how gating should translate. FoX [7], to our knowledge the first work to attempt this, takes the scalar gate from linear attention and formulates gating on the softmax state as an additive bias. The result is a data-dependent ALiBi, which is a content-dependent distance penalty.

Forgetting Attention (FoX)

A scalar forget gate gt(0,1)g_t \in (0,1) produces a cumulative decay factor:

Fi,j=s=j+1igsF_{i,j} = \prod_{s=j+1}^{i} g_s

This factors out of the kernel and becomes an additive logit bias:

scoreij=qikj+logFi,j\text{score}_{ij} = q_i^\top k_j + \log F_{i,j}

Analogous to ALiBi, but with a learned, data-dependent distance penalty instead of a fixed linear one.

PaTH [8] instead translates the generalized Householder products from DeltaNet's evolution rule [14], which are fundamental in state tracking. The result is loosely analogous to a data-dependent RoPE, as a content-dependent bilinear modulation of the query-key score, though generalized Householder products are not strictly orthogonal.

PaTH Attention

A generalized Householder transform Hs=IβswswsH_s = I - \beta_s w_s w_s^\top produces a cumulative linear transform:

Mi,j=s=j+1i(Iβswsws)M_{i,j} = \prod_{s=j+1}^{i} (I - \beta_s\, w_s w_s^\top)

This acts as a multiplicative query-key transform:

scoreij=qiMi,jkj\text{score}_{ij} = q_i^\top M_{i,j}\, k_j

Analogous to RoPE, but with learned, data-dependent transforms instead of fixed sinusoidal ones.

Both are instances of lifting linear RNN structures to softmax attention and producing data-dependent positional embeddings. The landscape of positional methods forms a natural taxonomy whether the bias is additive or multiplicative, and whether it is data-independent or data-dependent. Each quadrant allows a very specific temporal geometry.

Additive bias
s=qk+biass = q^\top k + \text{bias}
Multiplicative
s=qMks = q^\top M k
Data-
independent
ALiBifixed linear penalty
RoPEfixed rotation by distance
Data-
dependent
FoXlearned scalar decay
Walllearned per-channel decay
+ PaTH

Additive PE is a degenerate case of multiplicative PE and a logit bias bijb_{ij} is equivalent to a multiplicative gate ebije^{b_{ij}} on an extra dimension appended to the key. The distinction is one of parameterization, not expressivity.

We use "forget gate" and "retention" interchangeably throughout. A retention value r(0,1]r \in (0,1] means the gate retains fraction rr of the state per step; r=1r=1 is no forgetting.

From the Gate to the Wall

We introduce Wall Attention, which generalizes diagonal gates to softmax attention as a full replacement for RoPE. Wall is plug-and-play with existing training and inference frameworks, and we open-source reference Triton implementations for both. Along the way, we develop a general framework for translating RNN structures to softmax attention, unifying FoX, PaTH, and Wall as special cases.

Wall Attention

A diagonal forget gate Dt=diag(gt,1,,gt,d)D_t = \mathrm{diag}(g_{t,1}, \ldots, g_{t,d}) produces a per-channel cumulative decay:

Fij,n=s=j+1igs,nF_{ij,n} = \prod_{s=j+1}^{i} g_{s,n}

Via the induced action, this becomes a multiplicative per-channel rescaling of the dot product:

scoreij=nFij,nqi,nkj,n\text{score}_{ij} = \sum_n F_{ij,n}\, q_{i,n}\, k_{j,n}

Each channel forgets at its own rate. Some decay quickly (recency), others persist (long-range memory).

2: Deriving Wall Attention

In Section 1, we described the gated linear RNN

St=AtSt1+ktvtS_t = A_t S_{t-1} + k_t v_t^\top

where AtA_t is some transition matrix. When we unroll this recurrence, the output at position ii is a weighted sum over past values. Define the unnormalized attention weight wij=qiBijkjw_{ij} = q_i^\top B_{ij}\, k_j, where Bij=r=j+1iArB_{ij} = \prod_{r=j+1}^i A_r is the cumulative transition product from position jj to ii.

To go from linear to softmax attention, we need wij=exp(qiBijkj)w_{ij} = \exp(q_i^\top B_{ij}\, k_j). Following convention, we omit the softmax normalization term throughout this derivation and work with unnormalized weights. The standard way to make this connection is to replace the dot product qkq^\top k with a general kernel κ(q,k)=ϕ(q),ϕ(k)\kappa(q, k) = \langle \phi(q), \phi(k) \rangle for some embedding ϕ\phi. If ϕ\phi is the identity, κ=qk\kappa = q^\top k and we recover linear attention. If κ=exp(qk)\kappa = \exp(q^\top k), which is the exponential kernel, we recover softmax attention.

We are specifically interested in gating transition matrices that take the form At=diag(g)A_t = \mathrm{diag}(g). When we put a gate inside the recurrence, it interacts nontrivially with the feature map.

Primitives

Fix a feature map ϕ:RdH\phi : \mathbb{R}^d \to \mathcal{H} from the input space to a (possibly infinite-dimensional) Hilbert space, which induces a corresponding positive-definite kernel κ(x,y)=ϕ(x),ϕ(y)H\kappa(x,y) = \langle \phi(x), \phi(y) \rangle_{\mathcal{H}}. We consider two kernels:

  • Linear kernelϕ(x)=x  (identity)\phi(x) = x \ \ (\mathrm{identity}), κ(x,y)=xy\kappa(x,y) = x^\top y
  • Exponential kernelϕ(x)=(xα/α!)αNd\phi(x) = \left(x^\alpha/\sqrt{\alpha !}\right)_{\alpha \in \mathbb{N}^d}, κ=exp(xy)\kappa = \exp(x^\top y)*

To offer some brief intuition on the Taylor feature map for the exponential kernel, the resulting infinite-dimensional vector has as elements all of the monomials of elements of xx (up to scaling).

ϕ:R2H\phi : \mathbb{R}^2 \to \mathcal{H}Taylor feature map for κ=exp(xy)\kappa = \exp(x^\top y)
deg 0
11
deg 1
x1x_1x2x_2
deg 2
x12/2x_1^2 / \sqrt{2}x1x2x_1 x_2x22/2x_2^2 / \sqrt{2}
deg 3
x13/6x_1^3 / \sqrt{6}x12x2/2x_1^2 x_2 / \sqrt{2}x1x22/2x_1 x_2^2 / \sqrt{2}x23/6x_2^3 / \sqrt{6}
···· · ·
ϕ(x),ϕ(y)H  =  exp(xy)\langle \phi(x),\, \phi(y) \rangle_{\mathcal{H}} \;=\; \exp(x^\top y)

*We use the Taylor expansion feature map. Other infinite-dimensional feature maps, e.g. Gaussian random sketches [15][16], also correspond to the exponential kernel. Our derivations hold for any valid feature map.

A generalized gated linear RNN maintains state StHRdvS_t \in \mathcal{H} \otimes \mathbb{R}^{d_v} with a gate At:HHA_t: \mathcal{H} \to \mathcal{H} takes the following form

St=AtSt1+ϕ(kt)vtS_t = A_t S_{t-1} + \phi(k_t) v_t^\top

Unrolling, we obtain unnormalized attention weights

wi,j=ϕ(qi),(r=j+1iAr)ϕ(kj)w_{i,j} = \langle \phi(q_i), \left( \prod_{r=j+1}^i A_r\right) \phi(k_j) \rangle

For a scalar gate At=gtIHA_t = g_t \cdot I_\mathcal{H} where gt(0,1)g_t \in (0,1) is a single number, the cumulative product Ar=FijIH\prod A_r = F_{ij} \cdot I_\mathcal{H} is also scalar. Since FijF_{ij} is a scalar, it factors out of the inner product:

wi,j=Fi,jϕ(qi),ϕ(kj)Fi,j=r=j+1igrw_{i,j} = F_{i,j} \cdot \langle \phi(q_i), \phi(k_j) \rangle \qquad F_{i,j} = \prod_{r=j+1}^i g_r

For the exponential kernel specifically

Fi,jϕ(qi),ϕ(kj)=Fi,jexp(qk)=exp(qk+logFi,j)F_{i,j} \cdot \langle \phi(q_i), \phi(k_j) \rangle = F_{i,j} \cdot \exp (q^\top k) = \exp(q^\top k + \log F_{i,j})

This is exactly the Forgetting Attention (FoX) formulation, whereby the scalar decay is absorbed into an additive logit bias.

Why Diagonal Gates Break for Exp

The gate At:HHA_t: \mathcal{H} \to \mathcal{H} must match the dimension of the kernel's feature space. For the linear kernel (H=Rd\mathcal{H} = \mathbb{R}^d), a dd-dimensional diagonal gate Dt=diag(gt,1,,gt,d)D_t = \mathrm{diag}(g_{t,1}, \dots, g_{t,d}) can serve as AtA_t directly. Unrolling gives the linear kernel score wi,j=nFij,nqnknw_{i,j} = \sum_n F_{ij,n}q_nk_n, a per-channel decay-weighted dot product and precisely the score used in GLA, RWKV-7 & KDA.

For the exponential kernel, H\mathcal{H} is infinite-dimensional, so a dd-dimensional diagonal gate cannot serve as AtA_t directly. We need a lift D~t:HH\tilde{D}_t: \mathcal{H} \to \mathcal{H} that extends the dd-dimensional gate to the full feature space.

One might try to define D~t\tilde D_t as a diagonal operator on H\mathcal{H}, assigning each monomial basis element a gate value. However, the monomials in the Taylor expansion correspond to combinations of multiple input coordinates at once, e.g. x12x3x_1^2 x_3 which involves both channels 1 and 3 simultaneously.

ϕ:R2H\phi : \mathbb{R}^2 \to \mathcal{H}Taylor feature map for κ=exp(xy)\kappa = \exp(x^\top y)
deg 0
11
deg 1
x1x_1x2x_2
deg 2
x12/2x_1^2 / \sqrt{2}x1x2x_1 x_2x22/2x_2^2 / \sqrt{2}
deg 3
x13/6x_1^3 / \sqrt{6}x12x2/2x_1^2 x_2 / \sqrt{2}x1x22/2x_1 x_2^2 / \sqrt{2}x23/6x_2^3 / \sqrt{6}
···· · ·
x1x_1 onlyx2x_2 onlymixed

There is no canonical way to assign a single gate value from DtD_t to a multi-channel monomial, so the scalar absorption trick that enables FoX has no analogue for diagonal gates in the exponential kernel.

The Induced Action

Instead of lifting DtD_t to the feature space directly, we may let it act on the input space and induce the action through ϕ\phi. For any input-space operator AA, define its induced action A~:HH\tilde{A}: \mathcal{H} \to \mathcal{H} by

A~ϕ(x)  :=  ϕ(Ax)\tilde{A}\, \phi(x) \;:=\; \phi(A\, x)
xRdx \in \mathbb{R}^d
ϕ(x)H\phi(x) \in \mathcal{H}
AxA\,x
ϕ(Ax)\phi(A\,x)
ϕ\phi
ϕ\phi
AA
A~\tilde{A}

The induced action lifts any linear RNN gate structure to softmax attention by gating in input space before embedding.

Intuitively, this definition says "gating the input before embedding equals gating the embedded state". It is well-defined for any feature map and any input-space linear operator, beyond just diagonal gates. Note that this implicitly requires A~\tilde{A} to be a linear operator on H\mathcal{H}, i.e. that ϕ(Ax)\phi(Ax) is linear in ϕ(x)\phi(x). This holds for the Taylor feature map since (Ax)α(Ax)^\alpha is a linear combination of monomials in xx.

This resolves the obstruction from the previous section. The monomial x12x3x_1^2 x_3 had no single gate to inherit from D=diag(g)D = \mathrm{diag}(g), but the induced action D~\tilde{D} assigns it the product of its constituent gates g12g3g_1^2 g_3, since (Dx)α=gαxα(Dx)^\alpha = g^\alpha x^\alpha. In fact, for the Taylor feature map, the induced action coincides with the symmetric power lift Sym(A)\mathrm{Sym}(A), the unique functorial extension of a linear operator to the symmetric algebra (see Appendix F for the full treatment).

Crucially, the induced action is a group homomorphism [18]. For any linear operators A,BA, B on Rd\mathbb{R}^d, the induced action respects composition.

(AB)~ϕ(x)=ϕ(ABx)=A~ϕ(Bx)=A~B~ϕ(x)\widetilde{(AB)}\phi(x) = \phi(ABx) = \tilde{A}\phi(Bx) = \tilde{A}\tilde{B}\phi(x)

This means the cumulative induced transition equals the induced action of the cumulative product A~r=Ar~\prod \tilde{A}_r = \widetilde{\prod A_r}. When combined with the kernel trick, we obtain the following simplification for the exponential kernel

wij=ϕ(qi)  A~ji  ϕ(kj)(feature-space score)=ϕ(qi)  ϕ ⁣(Ajikj)(induced action)=exp ⁣(qiAjikj)(kernel trick)\begin{aligned} w_{ij} &= \phi(q_i)^\top\; \tilde{A}_{j \to i}\; \phi(k_j) & &\text{(feature-space score)} \\ &= \phi(q_i)^\top\; \phi\!\left(A_{j \to i}\, k_j\right) & &\text{(induced action)} \\ &= \exp\!\left(q_i^\top A_{j \to i}\, k_j\right) & &\text{(kernel trick)} \end{aligned}

where Aji=r=j+1iArA_{j \to i} = \prod_{r=j+1}^i A_r and A~ji\tilde{A}_{j \to i} is its induced action on H\mathcal{H}.

This holds for any sequence of linear operators ArA_r. If Ar=(Iβww)A_r = (I-\beta ww^\top) as a generalized Householder product, for example, we recover PaTH attention. Notably, if we restrict Ar=gtIA_r = g_t I we recover a different object than the FoX formulation discussed above, which we discuss in Appendix C.

Specializing to Diagonal Gates & Wall Attention

For diagonal gates where Ar=DrA_r = D_r, the cumulative product is also diagonal. From our induced action formulation, the attention score is then as follows

wij=qidiag(Fij)kj=nFij,nqi,nkj,nFij=r=j+1igrw_{ij} = q_i^\top \mathrm{diag}(F_{ij})\, k_j = \sum_n F_{ij,n}\, q_{i,n}\, k_{j,n} \qquad F_{ij} = \prod_{r=j+1}^i g_r

We can thus define Wall Attention.

Definition: Wall Attentionot=jsoftmaxj ⁣(nFij,nqi,nkj,n)vj=jsoftmaxj ⁣(n(r=j+1igr,n)qi,nkj,n)vj\begin{aligned} o_t &= \sum_j \mathrm{softmax}_j \!\left( \sum_n F_{ij,n}\, q_{i,n}\, k_{j,n} \right) v_j \\[4pt] &= \sum_j \mathrm{softmax}_j \!\left( \sum_n \left( \prod_{r=j+1}^i g_{r,n} \right) q_{i,n}\, k_{j,n} \right) v_j \end{aligned}

Wall attention is an extremely powerful attention variant with the ability to selectively forget and remember per-channel.

Per-channel forgetting
fast
medium
slow
combined
q0q_{0} channel weights
1
0.8
2
0.5
3
0.2
Token 0's query emphasizes the fast channel (recency)
sequence length1

As the sequence grows, each channel builds its own attention pattern. The fast channel retains only recent tokens. The slow channel preserves signal from anchor tokens (0, 4, 8) that had strong keys, visible as persistent bright columns.

Define Pt=utlogguP_t = \sum_{u \leq t} \log g_u as the prefix sum in log-space of gate vectors. An equivalent, important formulation of Wall Attention is vanilla attention with modified queries and keys.

Factorized formPt=utlogguP_t = \sum_{u \leq t} \log g_uq~i=exp(Pi)qi,k~j=exp(Pj)kj\tilde{q}_i = \exp(P_i) \odot q_i, \qquad \tilde{k}_j = \exp(-P_j) \odot k_jot=Attn(q~,k~,v)o_t = \mathrm{Attn}(\tilde{q},\, \tilde{k},\, v)

This decoupled formulation works because the cumulative diagonal product is trivially invertible, allowing a clean factorization into independent Q and K transforms. This factorization is what makes Wall fast in practice (Section 3).

Perspectives on Wall Attention

There are many ways to understand Wall beyond diagonal forget gates in attention. We present five further framings to build intuition.

Beyond simply FoX with more channels

Wall is a fundamentally different creature from FoX with more channels. Even when the gate is a constant vector (Fij,n=FijF_{ij,n} = F_{ij} for all nn), Wall does not reduce to FoX.

FoX absorbs the scalar decay as an additive logit bias, placing it in the ALiBi family:

FoX:softmax ⁣(qikj+logFij)\text{FoX:} \quad \text{softmax}\!\left(q_i^\top k_j + \log F_{ij}\right)

Wall instead applies a per-channel multiplicative rescaling of the dot product, placing it in the RoPE family:

Wall:softmax ⁣(nFij,nqi,nkj,n)\text{Wall:} \quad \text{softmax}\!\left(\sum_n F_{ij,n}\, q_{i,n}\, k_{j,n}\right)

One shifts the logit; the other rescales the inner product per channel. These are algebraically and geometrically distinct (see Appendix C).

3: Kernel Implementation & Practical Considerations

Recall, in the factorized form from Section 2, the attention score sij=q~ik~js_{ij} = \tilde q_i^\top \tilde k_j where q~i=exp(Pi)qi\tilde{q}_i = \exp(P_i) \odot q_i and k~j=exp(Pj)kj\tilde{k}_j = \exp(-P_j) \odot k_j. This factorization is what makes Wall practical. The per-channel gate reduces to a one-pass rescaling of QQ and KK, after which the attention kernel is algorithmically identical to FlashAttention [19][20].

This section describes how to make this work at scale: the design choices that control overhead, the numerical problems that arise in finite precision, and the kernel that overcomes them.

Design Choices

Two (orthogonal) decisions control the cost of Wall's gate before we even touch the kernel.

Subdim gating

The gate applies to the first KgK_g of dkd_k head dimensions. The rest are ungated (vanilla attention).

dkd_k
F1F_1
F2F_2
F3F_3
F4F_4
F5F_5
11
11
11
Wall (Kg=5K_g = 5)
vanilla
Gate tying (GQA)

One gate vector per KV head, shared across all query heads in the group. Reduces gate overhead by the GQA ratio.

KV headgRdkg \in \mathbb{R}^{d_k}
shares
Q₁
Q₂
Q₃
Q₄

Same gg applied to all four query heads → 4× fewer gate parameters

Gate Tying. Wall produces a gate per query head, resulting in HQ×dkH_Q \times d_k scalars per token (where HQH_Q is the number of query heads and dkd_k is the head dimension). In GQA, we can instead produce one gate per KV head and share it across the query group. This reduces the gate projection & overhead by a factor of the GQA ratio (typically 4–8x) and creates a shared decay geometry across query heads within a KV group. We find empirically that KV-head gating matches Q-head gating (Section 4).

Subdim Gating. Wall can be applied to only the first KgK_g of KK head dimensions, leaving the rest ungated.

sij=n=1KgFij,nqi,nkj,nWall+n=Kg+1Kqi,nkj,nVanillas_{ij}=\underbrace{\sum_{n=1}^{K_g} F_{ij,n}q_{i,n} k_{j,n}}_{\mathrm{Wall}} + \underbrace{\sum_{n=K_g+1}^K q_{i,n} k_{j,n}}_{\mathrm{Vanilla}}

Unlike RoPE, where partial application maintains or improves performance, reducing KgK_g for Wall degrades performance uniformly (Section 4).

Training Kernel

We now describe the development of our Triton training kernel, from inception to final version. Let BT=128B_T=128 be the tile size along the query dimension and BS=64B_S=64 the tile size along the key dimension.

Work per Q tile block
Setup
Load
Gate
Accum
Write
K0
K1
K2
K3
Q0
Q1
1/14Load Q, P → compute R, Q̃
RP[istart],    Q~iQiexp2(PiR)R \leftarrow P[i_{\text{start}}],\;\; \tilde{Q}_i \leftarrow Q_i \odot \exp_2(P_i - R)

Approach 1: Offline Rescale

The simplest approach is to compute Q~=Qexp2(P)\tilde{Q} = Q \odot \exp_2(P) and K~=Kexp2(P)\tilde{K} = K \odot \exp_2(-P), then run vanilla FlashAttention.

Naive Wall Attention (offline rescale)
Qt <- Q * exp2(P)
Kt <- K * exp2(-P)
O, L <- FlashAttention(Qt, Kt, V)
Click lines to jump to animation frame ↑

Unfortunately, this approach results in overflow and catastrophic cancellation. Pt=ntlog2gnP_t = \sum_{n \leq t} \log_2 g_n is a cumulative sum where the magnitude grows with sequence length monotonically. For example, at T=8192T=8192, log2g0.02|\log_2 g| \approx 0.02, PT|P_T| exceeds 160, which passes bf16 exponent range and is uncomfortably close to fp32 saturation. The factors exp2(Pi)\exp_2 (P_i) and exp2(Pj)\exp_2(-P_j) underflow and overflow respectively when decomposed, even though the score they reconstruct is bounded by the per-tile difference PiPjP_i-P_j.

Approach 2: Per-Tile Anchor

To mitigate this instability, we introduce RRdR \in \mathbb R^d and split the exponent into two bounded halves.

exp2(PiPj)=exp2(PiR)exp2(RPj)\exp_2(P_i - P_j) = \exp_2(P_i - R) \cdot \exp_2(R - P_j)

We set R=P[tile_start]R=P[\mathrm{tile\_start}] so that each factor is bounded by the accumulated gate within a single tile, which is at most BSlog2gB_S \cdot | \log_2 g|.

Wall Attention forward (per-tile anchor)
Input: Q, K, V, gate prefix P in HBM (all N x d)
Output: O (N x d), L (N) in HBM
parallel for i = 1 .. T_r doone thread block per Q tile
Load Q_i, P_i to SRAM
R <- P[i_start]per-tile anchor
Qt <- Q_i * exp2(P_i - R)rescaled queries (bounded)
O_i <- 0, l_i <- 0, m_i <- -inf
for j = 1 .. T_c dosequential over K/V tiles
Load K_j, V_j, P_j to SRAM
Kt <- K_j * exp2(R - P_j)rescaled keys (bounded)
S <- Qt @ Kt^T * scale
m' <- max(m_i, rowmax(S))
P_e <- exp(S - m')
a <- exp(m_i - m')
l_i <- a * l_i + rowsum(P_e)
O_i <- a * O_i + P_e * V_j
m_i <- m'
O_i <- O_i / l_i
Write O_i, m_i + log(l_i) to HBM
Click lines to jump to animation frame ↑

This is numerically stable, and the forward overhead versus FlashAttention is under 5%. On top of FA, Wall only requires one elementwise multiply per tile, which is fully hidden by the matmul.

Gate Clamping. For numerical stability, we soft-clamp the log-gate g^=lnσ(l)\hat{g} = \ln\sigma(l) via f(g^)=gmax(1eg^/gmax)f(\hat{g}) = -g_{\max}(1 - e^{\hat{g}/g_{\max}}), bounding each step's decay to (gmax,0](-g_{\max}, 0] with a minimum per-step retention of egmax0.42e^{-g_{\max}} \approx 0.42. Empirically, looser bounds improve performance. See Appendix G for the full justification.

Approach 3: Fused Gate Gradient

The forward kernel from Approach 2 is correct and fast, but the corresponding backward is slow. Profiling localizes the bottleneck to backward register pressure. The naive backward maintains three [BS,d][B_S, d] accumulators (dK~j,dVj,Pˉjd\tilde{K}_j, dV_j, \bar{P}_j) alongside the usual softmax statistics. This exceeds the register budget for our preferred block size, and the Triton compiler silently shrinks the tile from 64 to 32, collapsing SM occupancy.

Backward - naive gate gradient (register-bound)
parallel for j = 1 .. T_c do
dK_j <- 0, dV_j <- 0
dP_j <- 0extra accumulator (3rd register)
for i = j .. T_r do
< recompute Pe, compute dS, accumulate dV_j >
dKt <- dS^T @ Qt
dK_j <- dK_j + dKt * exp2(R - P_j)
dP_j <- dP_j - ln2 * Kt * dKtper-iteration update
dQt <- dS @ Kt
dQ_i <- dQ_i + dQt * exp2(P_i - R)
dP_i <- dP_i + ln2 * Qt * dQtquery gate grad
Scatter dQ_i, dP_i
Write dK_j, dV_j, dP_j
Click lines to jump to animation frame ↑

The critical insight is that dPj=ln2KjdKjdP_j = -\ln{2} \cdot K_j \odot dK_j, which are both already resident when the loop terminates. The gate gradient can be derived once, post-loop, from tensors already present.

Backward - fused gate gradient
parallel for j = 1 .. T_c do
dK_j <- 0, dV_j <- 0no gate accumulator
for i = j .. T_r do
< recompute Pe, compute dS, accumulate dV_j >
dK_j <- dK_j + (dS^T @ Qt) * exp2(R - P_j)
dQt <- dS @ Kt
dQ_i <- dQ_i + dQt * exp2(P_i - R)
dP_i <- dP_i + ln2 * Qt * dQtquery gate grad
Scatter dQ_i, dP_i
dP_j <- -ln2 * K_j * dK_jkey gate grad (fused, post-loop)
Write dK_j, dV_j, dP_j
Click lines to jump to animation frame ↑

Eliminating one live accumulator frees enough registers for the autotuner to select BT=128,BS=64B_T=128, B_S=64 and the matmuls saturate tensor cores again.

The Final Kernel

The fused gradient yields a 12% wall-clock reduction at T=8192T=8192. We further implement several standard kernel-level optimizations, detailed in Appendix D, including splitting the inner loop into separate off-diagonal and diagonal passes, casting the diagonal matmul inputs to bf16 for WGMMA tensor cores, and autotuning block shapes. Together, these bring the cumulative backward speedup to 41%, closing the gap to FoX from 2.2x to 1.4x.

The remaining gap is structural: per-channel decay requires more HBM traffic (loading PP) and K×K\times more gate-gradient output than FoX's scalar approach. Finer control over the memory hierarchy (e.g., a native CUDA implementation) can close this gap further.

Our Triton implementation is roughly 2x the cost of FlashAttention-2 in isolation. In practice, this overhead is amortized because attention FLOPs are a small fraction of end-to-end training cost (dominated by MLP and embedding layers), so the wall-clock impact on full training is small.

Loading...
Figure 1. Training kernel profiling. Our fused Triton kernel closes the gap to FlashAttention-2 through successive optimizations, reaching roughly 2x overhead at large sequence lengths.

Decode

A naive approach would require storing PtP_t for all tt, increasing the KV cache by up to 50%. We describe three approaches that avoid extra storage while maintaining arithmetic intensity.

Prefill
Absorb
Decode query
FlashDecode
Output
KV Cache
K0K_{0}
chunk 0
K1K_{1}
chunk 1
K2K_{2}
chunk 2
K3K_{3}
chunk 3
1/8Prefill: store K, V and chunk anchors

Approach 1: Absorb + Store References

Rather than storing all PtP_t in the cache, we can absorb the gate into the cached keys themselves and retain only a per-chunk anchor. We partition the sequence into chunks of size CC, freeze an anchor Rc=P[cC]R_c = P[c\cdot C] at each chunk boundary and cache K~j=Kjexp2 ⁣(Rc(j)Pj)\tilde K_j = K_j \odot \exp_2\!\big(R_{c(j)} - P_j\big) in place of KjK_j.

At the decode step tt, we have:

exp2(PtPj)QtKj  =  (Qtexp2(PtRc(j)))K~j\exp_2(P_t - P_j) \cdot Q_t \cdot K_j^\top \;=\; \big(Q_t \odot \exp_2(P_t - R_{c(j)})\big) \cdot \tilde K_j^\top

The query-side rescaling is computed once per chunk, and the inner loop is byte-for-byte a standard FlashDecoding [21] inner loop, with no per-key gate arithmetic and no PjP_j read from HBM.

Wall Decode is comparable to FlashDecode (FA3) across all context lengths, and significantly faster than a naive Wall decode that recomputes gates from stored prefixes.

Loading...
Figure 2. Decode kernel profiling. Wall Decode matches FlashDecode (FA3) in both latency and throughput, while naive Wall decode incurs substantial overhead from redundant gate recomputation.

Approach 2: Key-Projected Gates

Instead of projecting gate vectors from the hidden state, we can project head-wise from the key vectors as gt=σ(Wgkt+bg)g_t = \sigma(W_gk_t+b_g). Since ktk_t is already in the KV cache, the gate can be materialized on the fly during attention with zero additional storage. Inside the kernel, each KV block computes the gate from the loaded keys in registers.

b_log_f = tl.sum(b_k * b_W_g[:, None], axis=0) + b_bias_g
b_log_f = log_sigmoid(b_log_f)
b_c = c_running + tl.cumsum(b_log_f, axis=0)
c_running = b_c[-1]
b_k_til = b_k * exp2((c_ref - b_c)[:, None] * RCP_LN2)

The overhead is one small matmul and one cumsum per block, both compute-bound on data already in SRAM. The only extra state is a running scalar ctc_t per head.

Approach 3: WallMLA

In MLA [22], keys and values are materialized from a compressed latent ctc_t as kt=WK,upct,vt=WV,upctk_t = W_{K, \mathrm{up}}c_t, v_t = W_{V, \mathrm{up}}c_t. WallMLA derives the gate from the same latent: gt=σ(Wgct+bg)g_t = \sigma(W_gc_t+b_g), requiring one extra small matmul in MLA's existing pipeline. The decode kernel materializes kk, vv and the gate correction entirely from the cached latent in a single fused loop.

4: Empirical Results

Pretraining

We pretrained 400M and 1B transformers with Wall Attention on Nemotron CC v2 [23][28]. Training details are in the Appendix and broadly mirror the Aurora release [24].

At 400M, Wall (NoPE) outperforms the RoPE baseline under both Muon and Aurora optimizers. The consistency across optimizers confirms that the gains from diagonal gating are orthogonal to optimizer choice.

Loading...
Figure 3. 400M pretraining convergence (10B tokens). Wall (NoPE) outperforms the baseline under both Muon and Aurora optimizers.

At 1B, Wall achieves a 0.01 nats gain over the RoPE baseline and defeats both FoX and Wall (RoPE), validating it as a standalone PE strategy.

Loading...
Figure 4. 1B pretraining convergence (70B tokens). Wall (NoPE) and Wall+RoPE both outperform the baseline and FoX.

We evaluate the 1B models on standard downstream benchmarks for small model pretraining.

1B Downstream Evaluation (70B tokens)20304050607068.066.967.168.9HellaSwag10-shot46.545.046.647.1ARC-C25-shot31.128.833.139.8MMLU5-shot62.461.263.163.0Winogrande5-shot52.050.452.554.7AvgAccuracy (%)
RoPE baseline
FoX
Wall + RoPE
Wall
Benchmark
55%65%75%100B1T10T100Tpretraining tokens (log scale)HellaSwag (%)~30–500x fewer tokensTilde65.167.668.068.962.967.065.767.967.1
Muon (old gen)
Aurora (old gen)
RoPE baseline
Wall
Gemma3-1B
Qwen2-1.5B
Llama3.2-1B
Qwen2.5-1.5B
Qwen3-1.7B

Dashed = older training generation (different arch & optimizer tuning)

Figure 5. Top: downstream evaluation across PE strategies, all trained on 70B tokens. Wall (NoPE) achieves the strongest performance. Bottom: token efficiency vs. publicly available models trained on 30-500x more data and previous generation Aurora models.

Wall (NoPE) achieves the strongest performance across both pretraining and (short-context) downstream evals. Wall 1B is also a new SOTA on the pretraining tokens vs performance plot from Aurora, beating our old generation Aurora runs despite using only Muon.

The magnitude of pretraining gains is surprising. Many positional embedding variants do not show strong gains over RoPE in pretraining, since their focus is length extrapolation. Wall achieves significant improvements in pretraining convergence over RoPE.

Ablations

Gate Biases

Wall's per-channel retention is rt=exp ⁣(f(lnσ(Wgxt+b)))r_t = \exp\!\bigl(f\bigl(\ln\sigma(W_g x_t + \colorbox{#ede9fe}{b})\bigr)\bigr), where ff is the soft clamp that bounds each step's log-gate to (gmax,0](-g_{\max}, 0]. The bias b\colorbox{#ede9fe}{b} sets the gate's operating point at initialization. Large positive bb starts retention near 1 (fully open, vanilla-attention-like); b=0b = 0 starts at σ(0)=0.5\sigma(0) = 0.5 before clamping, corresponding to retention 0.62\approx 0.62 after the soft clamp.

FoX prefers a gate bias of 0, matching the authors' findings. Wall prefers conservative biases of 6-8, starting fully open and gradually learning to close so that behavior matches vanilla attention at init.

Loading...
Figure 6. Gate bias ablation for FoX (left) and Wall (right). FoX prefers low biases; Wall prefers high biases that start with fully open gates.

Subdim & GQA

We ablated both efficiency strategies from Section 3: subdim gating (applying Wall to only Kg<dkK_g < d_k head dims) and KV-head gate tying (one gate per KV head instead of per Q head under GQA).

Reducing the subdim degrades performance monotonically, but 2-4x subdim still substantially outperforms vanilla attention. KV-head gating matches the more expensive Q-head gating, confirming that gate tying is a free efficiency win.

Figure 7. Validation loss for checkpoint at step 5K. Left: Per KV-head gating matches Q-head gating while being inference-aligned. Right: Performance degrades monotonically with fewer gated dimensions.

Long-Context Extrapolation

We tested long-context extrapolation on CodeParrot, NarrativeQA, and PG-19 for the 1B models. Notably, these models were trained with a 4k maximum sequence length on a dataset where most sequences are significantly shorter than 4k tokens.

Loading...
Figure 8. Long-context extrapolation across CodeParrot, NarrativeQA, and PG-19. Wall (NoPE) achieves the strongest extrapolation results across all benchmarks and extrapolates to 128k+ context length.

Wall (NoPE) generalizes to 160k+ sequence lengths, despite only having seen a maximum sequence length of 4k during training. At no point does Wall diverge in loss as context increases.

We further analyzed Wall on the Needle-in-a-Haystack (NIAH) retrieval task. We found Wall was able to extrapolate to 4x its training context length without difficulty, where all other methods failed.

Loading...
Figure 9. Needle-in-a-Haystack retrieval accuracy (magic-number substring match). Wall (NoPE) maintains near-perfect retrieval well beyond the 4K training context, while all RoPE or FoX-based models collapse past 8K.

Finally, for LongBench v1, Wall was SOTA over RoPE and FoX across categories including coding, summarization, and few-shot learning.

Loading...
Figure 10. LongBench v1 evaluation (0-shot, 32K context). Wall (NoPE) achieves the highest overall average across 14 tasks spanning QA, summarization, few-shot, and code.

All the tests we ran here tested zero-shot extreme length generalization, which is an exceedingly difficult test of PE methods. With simple long-context midtraining, Wall is well-poised to generalize to enormous context lengths.

Mechanistic Analysis

We analyzed the learned gate distributions on LongCrawl64 (65K-token documents, 100 documents, 256-token bins).

Figure 11. NLL vs. token position across 65K-token documents. Wall (NoPE) achieves the lowest NLL in extrapolation, with the gap widening as context grows.

Wall (NoPE) achieves lower NLL than RoPE at every position, with the gap widening as context grows. Wall (RoPE) falls between the two, suggesting the diagonal gate alone captures most of the positional structure that RoPE provides. We collected and analyzed per-channel gate values (retention scores) for LongCrawl64 document samples.

Figure 12. Per-layer gate retention across token positions, averaged across heads and channels.

Wall learns a multi-timescale memory hierarchy across layers, with some layers acting as tight local windows and others maintaining signal over much longer ranges. The NoPE model shows wider variation than RoPE, consistent with the gates taking over more of the positional encoding role.

These layer-level aggregates, however, mask a more interesting finding at the channel level. Per-channel retention reveals two distinct populations - "always-on" channels with retention identically 1.0 (zero variance), and highly dynamic channels whose retention swings between the clamp floor and near-1 on a per-token basis. The dynamic channels respond to content, closing hard at semantic boundaries and opening elsewhere. The static channels provide unconditional long-range memory.

Figure 13. Per-channel gate dynamics at layer 12. Top: each dot is one channel, plotted by its mean and standard deviation of retention. Middle: per-token retention traces for selected channels from each population. Bottom: cumulative product of retention for dynamic channels, showing how small per-step fluctuations compound into rapid signal decay.

This bimodal structure is learned end-to-end. All channels start with effectively identical initialization due to strong gate biases, and the model discovers that some dimensions should serve as permanent memory while others should gate based on content. The cumulative product (Panel C) demonstrates how dynamic channels create sharp memory cutoffs whereas constant retention (dashed lines) gives only smooth exponential decay.

5: Inflection

Four things excite us about this work.

1. Sufficiency. Wall outperforms every positional variant we tested and does not benefit from stacking RoPE or FoX on top. This suggests Wall is sufficient for strong positional and temporal understanding.

2. Empirical success. Wall's pretraining gains are coupled with extreme length extrapolation. It improves upon both the convergence benefits normally afforded by RoPE and allows models trained at 4k sequence length extrapolate to 200k+ tokens.

3. Cross-pollination. Wall started as an attempt to bring diagonal gating from linear RNNs into softmax attention. The investigation forced us through a series of questions that, to our knowledge, have not been formalized before: how does a finite-dimensional gate act on an infinite-dimensional feature space? What does the induced action look like?

xRdx \in \mathbb{R}^d
ϕ(x)H\phi(x) \in \mathcal{H}
AxA\,x
ϕ(Ax)\phi(A\,x)
ϕ\phi
ϕ\phi
AA
A~\tilde{A}

The induced action lifts any linear RNN gate structure to softmax attention by gating in input space before embedding.

The induced action framework connects modern linear RNNs and softmax attention in a way that, to our knowledge, has not been formalized before. PaTH motivates the qMkq^\top Mk score from the linear RNN analogy; our framework derives it as a consequence of the kernel's feature map.

4. Production-ready. Many proposed attention alternatives claim superior expressivity but are difficult to scale to production training and inference. They require extensive kernel design and force users to accept overhead.

Wall retains the embarrassingly parallel structure of vanilla attention, is directly compatible with GQA and MLA, and can be made faster and lighter-weight through the parametrization and kernel strategies we developed.

Parallel trainingFA compatibleGQA / MLANo chunkingStd KV cachePer-channelLinear RNNsPaTH~~FoXWall

Future Work

We list here future directions that may be valuable to the community.

Continued pretraining. Models pretrained with RoPE can be upcycled into Wall models for little cost. We would be excited to see larger-scale open-source models with Wall.

Kernel optimizations. Our open-source Triton implementation is correct and reasonably fast, but leaves performance on the table. Internal profiling suggests a prototype CuTE implementation of Wall/WallMLA closes much of the remaining gap to FlashAttention. An open-source CuTE kernel for Wall training and decoding would be a high-impact contribution.

Broader PE comparisons. We did not compare to PaTH because PaTH focuses on state tracking, whereas our focus is forgetting and long-context. We would like to see Wall evaluated against a wider set of positional embedding methods across diverse tasks and scales.

Cite this work

@article{pai2026wall,
  title   = {Wall Attention: Diagonal Gates for Softmax Attention},
  author  = {Pai, Dhruv and Averbuch, Timor and Zhang, Ashley and Keigwin, Ben and Dewulf, Alec},
  year    = {2026},
  url     = {https://blog.tilderesearch.com/blog/wall-attn}
}

Appendix

A: Open Source Release

We release reference Triton kernels for Wall Attention training and decoding at github.com/tilde-research/wall-attention-release.

B: Training Details

We report exact configurations and training settings for our 400M and 1B runs. We used an internal tokenizer with 128k vocab size for all experiments. For reproducibility, we trained on fully open-source internet data from NVIDIA Nemotron CC v2 [23][28]. Note the training setting broadly matches those used in Aurora [24]. We use per-head Muon (MuonSplit) [27] as the optimizer for all runs.

Transformer 400M:

CategoryDetails
Training Configuration800k tokens/batch, 8192 seq len, WSD schedule
Data10.5B tokens, NemotronCCv2 HQ split
OptimizerPer-head Muon (MuonSplit)
Architectured=1024, L=24, MHA, QKNorm, ShortConv, Gated Attention

Transformer 1B:

CategoryDetails
Training Configuration4M tokens/batch, 2048 seq len, WSD + cosine decay
Data70B tokens, NemotronCCv2 HQ split
OptimizerPer-head Muon (MuonSplit)
Architectured=2048, L=24, MHA, QKNorm, ShortConv, Gated Attention

C: Induced Scalar Formulation

For a scalar gate Dt=gtIdD_t = g_t I_d, the general formula gives

wij=Fijqikjoi=softmax(Fijqikj)w_{ij} = F_{ij}\cdot q_i^\top k_j \qquad o_i = \mathrm{softmax}(F_{ij} \cdot q_i^\top k_j)

This is very different from FoX. It's a multiplicative positional embedding, similar to the scalar case of Wall Attention. It is somewhat degenerate, as it amounts to merely a path-dependent global temperature.

PaTH [8] (footnote 3) observed that this formulation greatly underperforms FoX. This could be due to the bias initialization differences from our ablation study, whereby multiplicative gates require open-gate initialization.

D: Kernel

Off-diagonal / diagonal split. Each causal Wall output tile attends to many fully-causal key tiles plus one diagonal tile straddling the causal boundary. For Wall, this split coincides with a numerical split. Since Pt=ntgnP_t=\sum_{n\le t} g_n (cumulative log-gate) is monotonically non-increasing, an off-diagonal tile anchored at RR has PiR0P_i-R\le 0 and RPj0R-P_j\le 0, so both rescale factors lie in (0,1](0,1]. The diagonal tile straddles RR: exp2(RPj)\exp_2(R-P_j) can grow to the within-tile budget, and masked entries can produce NaN. A single fused loop would force conservative handling (mask + clamp) onto all O(T)O(T) tiles. Splitting lets the off-diagonal pass run branch-free and mask-free, confining the clamp to one BT×BSB_T\times B_S block per query.

bf16 diagonal matmuls. The off-diagonal matmuls ran in bf16 while the diagonal pass defaulted to fp32, an artifact of the reference FlashAttention implementation. This caused a throughput cliff: H100 WGMMA runs bf16 at ~2x TF32 and the backward is matmul-bound. We cast only the matmul inputs (Q~,K~,dS\tilde Q,\tilde K,dS) to bf16 and keep fp32 accumulation. The diagonal cast leaves the error indistinguishable from the fp32 path.

Together with the optimizations discussed in Section 3, and block-shape autotuning, these reduce backward wall-clock by ~20% over the stable per-tile kernel.

The naive implementation is often faster than our Triton kernel, but it is numerically unstable and causes early model divergence during training.

E: KV Cache Optimization Improvements

Wall Attention offers a few unique opportunities for KV cache optimization.

Eviction. Per-channel gate values provide a principled heuristic for KV cache eviction. When all channels in an old key have decayed below a threshold, the key has been effectively forgotten and can be evicted at no cost.

Chunk Safety & MXFP8 Alignment. Within a chunk of size CC, the maximum rescaling factor applied to any cached key is exp2(PjRc)exp2(Clog2gmax)\exp_2(|P_{j} - R_c|) \leq \exp_2(C \cdot |\log_2 g|_{\max}). For typical learned gate magnitudes (log2g0.010.02|\log_2 g| \approx 0.01\text{–}0.02), we have

Chunk size vs. maximum rescaling factor for typical gate magnitudes.
Chunk Size CMax Exponent 0.02CMax Rescaling Factor
320.641.56
641.282.43
1282.565.91

Even at chunk size 128 with aggressive gates, the rescaling is under 6x, and well within the representable range of fp8 E4M3 with negligible quantization error relative to the key magnitudes themselves.

The rescaled values within a chunk are monotonically ordered by distance from the anchor: keys near the chunk boundary have rescaling close to 1, keys far from it have larger rescaling. This smooth, predictable dynamic range aligns naturally with NVIDIA's MXFP8 format [25][26], which assigns a shared block-wise scaling factor to a small group of values. A single MXFP8 scale factor per block captures the local dynamic range with minimal quantization loss.

Reduced dynamic range for key-projected gates. Key-projected gates offer a distinct advantage for low-precision KV caches: since the cached keys are stored unmodified, they have the same dynamic range as vanilla attention and existing KV quantization techniques transfer directly. This contrasts with block-wise anchor decoding, where absorbed keys carry the gate rescaling in their magnitudes.

F: Symmetric Algebra Derivation

The feature space H\mathcal{H} spanned by the monomials xα/α!x^\alpha / \sqrt{\alpha!} is (a completion of) the symmetric algebra Sym(Rd)=k0Symk(Rd)\mathrm{Sym}(\mathbb{R}^d) = \bigoplus_{k \ge 0} \mathrm{Sym}^k(\mathbb{R}^d), and the Taylor feature map ϕ\phi is the canonical exponential map xkxk/k!x \mapsto \sum_k x^{\odot k} / \sqrt{k!} sending an input vector to its coherent state.

Any linear map AA on Rd\mathbb{R}^d extends functorially to this algebra, acting on each graded piece by

Symk(A)(v1vk)=Av1Avk\mathrm{Sym}^k(A)\,(v_1 \odot \cdots \odot v_k) = Av_1 \odot \cdots \odot Av_k

and Sym(A):=kSymk(A)\mathrm{Sym}(A) := \bigoplus_k \mathrm{Sym}^k(A) is the symmetric power lift of AA. The induced action is exactly this lift: Aˉ=Sym(A)\bar{A} = \mathrm{Sym}(A), since ϕ(Ax)=Sym(A)ϕ(x)\phi(Ax) = \mathrm{Sym}(A)\,\phi(x) holds by construction. This makes the induced action the canonical object rather than an ad hoc definition, as it is the unique functorial extension of AA to the feature space.

On the monomial basis, Sym(D)\mathrm{Sym}(D) is diagonal with eigenvalue gα=ngnαng^\alpha = \prod_n g_n^{\alpha_n}, since (Dx)α=n(gnxn)αn=gαxα(Dx)^\alpha = \prod_n (g_n x_n)^{\alpha_n} = g^\alpha x^\alpha. This resolves the obstruction from the main text: each monomial receives the product of its constituent gates.

Because Sym()\mathrm{Sym}(\cdot) is a functor, the induced action respects composition.

AB=Sym(AB)=Sym(A)Sym(B)=AˉBˉ\overline{AB} = \mathrm{Sym}(AB) = \mathrm{Sym}(A)\,\mathrm{Sym}(B) = \bar{A}\,\bar{B}

In particular, the cumulative induced transition equals the induced action of the cumulative product, rAˉr=rAr\prod_r \bar{A}_r = \overline{\prod_r A_r}. Combined with the kernel identity, this collapses the feature-space inner product back to a finite-dimensional one.

wij=ϕ(qi),(r=j+1iAr)ϕ(kj)=exp ⁣(qi(r=j+1iAr)kj)w_{ij} = \left\langle \phi(q_i),\, \left(\prod_{r=j+1}^i A_r\right)\phi(k_j) \right\rangle = \exp\!\left(q_i^\top \left(\prod_{r=j+1}^i A_r\right) k_j\right)

This holds for any sequence of linear operators ArA_r, recovering PaTH when Ar=(Iβrwrwr)A_r = (I - \beta_r w_r w_r^\top) and Wall when Ar=Dr=diag(gr)A_r = D_r = \mathrm{diag}(g_r).

G: Gate Clamping Derivation

Even with per-tile anchoring, aggressive gates can overflow within a tile. The kernel accumulates prefix sums Pt=ntg^nP_t = \sum_{n \le t} \hat{g}_n where g^n=lnσ(ln)\hat{g}_n = \ln\sigma(l_n) is the per-step log-gate in natural-log domain. The maximum exponent of any exp2 call in the backward is (BTBS)gmax/ln2(B_T - B_S) \cdot |g_{\max}| / \ln 2; staying within fp32 range requires gmax<0.87|g_{\max}| < 0.87 for our block sizes (BT=128,BS=64B_T = 128, B_S = 64).

We enforce this with a soft clamp applied after logsigmoid:

f(g^)=gmax ⁣(1eg^/gmax)f(\hat{g}) = -g_{\max}\!\left(1 - e^{\hat{g}/g_{\max}}\right)

This maps g^(,0]\hat{g} \in (-\infty, 0] to (gmax,0](-g_{\max}, 0] with a steep barrier near the bound but no gradient discontinuity. In retention terms, rt=ef(g^t)r_t = e^{f(\hat{g}_t)}, so the clamp imposes a floor of egmax0.42e^{-g_{\max}} \approx 0.42: each channel forgets at most 58% per step.

-1.0-0.50.0-3-2-10−g_maxoutputinput (log-gate)soft clamphard clampidentity

The soft clamp is asymptotically identity near zero (where most gates operate after training with high bias) and smoothly saturates toward gmax-g_{\max} for aggressive decays. A hard clamp would achieve the same bound but introduces a gradient discontinuity at the transition, which destabilizes training when channels hover near the boundary. Empirically, looser bounds yield better performance, so any kernel improvement that raises the overflow budget directly improves Wall.

H: Optimizer Ablation (Muon vs Aurora)

At 400M scale, we compared Muon and Aurora optimizers for both the RoPE baseline and Wall (NoPE). Both optimizers achieve comparable final loss on the baseline, and Wall (NoPE) outperforms the baseline under both optimizers. Aurora provides an edge over Muon, consistent with the findings in the Aurora post [24]. Since the gains from Wall are orthogonal to optimizer choice, all results in the main text use Muon except Figure 3, which shows both optimizers at 400M scale.

References

  1. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y. (2024).
  2. Kazemnejad, A., Padhi, I., Natesan, K., Das, P., Reddy, S. (2023).
  3. Peng, B., Quesnelle, J., Fan, H., Shippole, E. (2024).
  4. Ding, Y., Zhang, L. L., Zhang, C., et al. (2024).
  5. Lin, Z., Nikishin, E., He, X., Courville, A. (2025).
  6. Yang, S., Shen, Y., Wen, K., Tan, S., Mishra, M., Ren, L., Panda, R., Kim, Y. (2025).
  7. Yang, S., Kautz, J., Hatamizadeh, A. (2025).
  8. Yang, S., Wang, B., Shen, Y., Panda, R., Kim, Y. (2024).
  9. Peng, B., Zhang, R., Goldstein, D., et al. (2025).
  10. Kimi Team, Zhang, Y., Lin, Z., et al. (2025).
  11. Yang, S., Wang, B., Zhang, Y., Shen, Y., Kim, Y. (2024).
  12. Wacker, J., Kanagawa, M., Filippone, M. (2024).
  13. Clift, J., Doryn, D., Murfet, D., Wallbridge, J. (2020).
  14. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., Ré, C. (2022).
  15. Dao, T., Haziza, D., Massa, F., Sizov, G. (2023).
  16. Dewulf, A., Pai, D., Yang, L., Zhang, A., Keigwin, B. (2026).
  17. GLM-5 Team, Zeng, A., Lv, X., et al. (2026).
  18. Roy, A., Chou, T., Duvvuri, S. S., et al. (2025).