Why does it matter for fp8? Are grads and weights different data types in that case? (Sorry if it's a dumb question – I've never done any fp8 training)
FP8 Training: Understanding Gradient and Weight Data Types
By
–
By
–
Why does it matter for fp8? Are grads and weights different data types in that case? (Sorry if it's a dumb question – I've never done any fp8 training)