What if you could replace a core part of a Transformer with something simpler and stronger? Researchers from Princeton, NYU, and CMU present Derf. They swapped the standard "normalization" layer with a simple, element-by-element function called Derf (based on a Gaussian error
Derf: Replacing Transformer Normalization with Simpler Function
By
–
Leave a Reply