warmupSteps=4000
Post-LN
Pre-LN
\(\mathcal{L_i}()\) be a loss value for one position
\(\mathcal{L}()\) be a total loss
\(LN(x)\) be a layer normalization of \(x\) with \(\beta=0, \gamma=1\)
\(\mathbf{J}_{LN(x)}=\frac{\partial LN(x)}{\partial x}\) be a Jacobian of \(LN(x)\)