> llama-like architecture
> dense transformer > rotary only (no positional embeddings) > qk norm > untied embedding/unembedding > norm after token embedding > relu² mlp > no biases in linears > no learnable rmsnorm params > mqa > logit softcap > optimizer =
Advanced Llama Architecture: Rotary Embeddings and ReLU² MLP
By
–
Leave a Reply