Common conception, shattered again. Normalization isn’t fundamental to Transformers, and this approach keeps getting better This paper introduces Dynamic erf (Derf), which further improves normalization-free Transformers With strong empirical results while generalizing better
Dynamic erf improves normalization-free Transformers architecture
By
–
