They ablate the embeddings in the paper and found that T5 and CLIP together are superior to either on their own. Not clear just how much difference it makes on text though… Imagen is bigger than SD, hard to disentangle the size from embedding on text. Good hypothesis though!
T5 and CLIP Embeddings: Superior Combined Performance Analysis
By
–