I haven’t gone back to the paper, but looking at this equation alone: When the heads are independent, W_o introduces correlations among them, allowing for information sharing and a denser representation.
@nandodf
-
Data Selection and Distribution Functions in Large Language Models
By
–
The expectation is replaced by an average over tokens (a few trillion for the largest LMs) so F can be very general. If a human is selecting the y’s among other y’s to create a dataset, then the human is F, hopefully being sensible
-
Learning from Preferences: RLHF, Policy Gradients, and Dagger
By
–
Finally, when learning from preferences, one learns an F(x,y) that enables one to rank and select or do policy gradients (e.g. PPO) as in most RLHF. When the interface allows for corrections (e.g. rewriting the response in a chat agent), then we are in the domain of Dagger.
-
Dagger Imitation Learning: Human Feedback for Agent Training
By
–
Imitation with Dagger: In counterfactual learning F is typically the identity. The agent acting with policy p(y|x) determines the x’s as in RL, but humans (or other agents) provide corrections in the form of y’s. The new data is used for retraining.
-
Self-Training: Filtering Functions and Model Ranking Systems
By
–
Self-training: F(x,y) is a filtering/ranking function, eg., what we call a reward/return. The input x may be chosen by humans, but the model generates the y’s and F ranks and selects for further rounds of self-training. F can be explicit or implicit (human in the loop as in RLHF)
-
Policy Gradients: Q-Functions and State-Action Value Learning
By
–
Policy gradients: F = Q(x,y) (the state-action value function), and x and y are generated by the model acting on an environment with policy p(y|x). The x’s are from the invariant state distribution as in the policy gradients theorem.
-
Supervised Learning: Function Identity and Human-Labeled Data
By
–
Supervised learning: F = I (identity), and x and y are produced by humans. E.g. x is images taken by humans and y are corresponding labels. E.g. 2, x is text and y is the next text token.
-
Learning Methods Unified Through Gradient Optimization Framework
By
–
Funny @sirbayes Learning methods — supervised, RLHF, policy gradients, Dagger, self-training — can be seen as optimisation with the following gradient: grad = Expectation_x,y [ F(x,y) grad log p(y|x) ] Choices of F and how x and y are produced determine the learning type 1/n
-
Chain-of-Thought as External Memory for AI Models
By
–
Chain-of-thought can also be interpreted as a simple external memory tool: The model writes to a scratchpad, then reads it, and answers the question. The scratchpads to which the model outputs text and inputs text are powerful tools that need further exploration.
-
Tools for AI: APIs, Memory, Autonomous Systems and Agents
By
–
Tools in this context can mean many things. It could be APIs of software products, external memory, and even self-driving cars! It could even be other agents, eg we rely on people to remind us of facts, or to teach us about a new topic.