what i am confused that more people dont seem to be analysing is how does the conditioning improve (or not) with increasing attention heads and layers. i dont know if anyone is formally quantifying amount of conditioning/ICL as an explicit loss function/training goal. in other
Conditioning and In-Context Learning: Architecture’s Impact Analysis
By
–