Everyone assumed semantic encoders couldn’t reconstruct images.
Turns out, they can and better than VAEs. RAE + DINOv2-B achieves 0.49 rFID vs 0.62 for SD-VAE, with 6× less compute. That’s fewer FLOPs and sharper reconstructions.
RAE + DINOv2-B beats VAEs with sharper images
By
–
