I’m sure my thrice-replaced DGX station is an outlier, but the reliability reports I hear from big GPU cluster people are still grim. They have over an order of magnitude more failures than conventional systems. ML training is tolerant, but it is at a point that gets noticed.
GPU cluster reliability issues exceed conventional systems by orders of magnitude
By
–
Leave a Reply