4). Moral Self-Correction in Large Language Models – finds strong evidence that language models trained with RLHF have the capacity for moral self-correction. The capability emerges at 22B model parameters and typically improves with scale.
Moral Self-Correction Emerges in Large Language Models at 22B
By
–
Leave a Reply