a random detail that I found cool is that AdamW seems to "obfuscate" weights a bit better than Adam or vanilla SGD guessing this is bc weight decay does something nonlinear during optimization, which decreases the amount of information available in weights
AdamW Weight Decay Obfuscation Effects on Neural Networks
By
–
