In this research note, we explore a few different theories for this divergence. We looked at layer normalization and floating-point rounding as potential causes, and we were able to rule them out. This leaves the Adam optimizer as the most likely culprit.
Investigating Model Divergence: Adam Optimizer as Primary Cause
By
–
Leave a Reply