You are right. But yeah the point is that these are based on assumptions and produce approximations. You'd hurt modeling performance and people don't want that, which is why the "efficient alternatives that do approximations" for/during pretraining haven't caught on.
Trade-offs Between Efficiency and Model Performance in Pretraining
By
–
Leave a Reply