Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint directly at inference time without any modifications?" So naive repetition pushes hidden states outside the distribution later layers expect, so performance
Looped Transformers: Frozen Checkpoint Inference Optimization
By
–
