Thanks! The relative position of normalization is one of the few things that changed about the original transformer architecture. I think it's not exactly clear where it should be placed. Most transformers for texts use post-norm(one above) whereas vision transformers tends to
Normalization Placement in Transformer Architectures: Text vs Vision
By
–
Leave a Reply