Policy gradients: F = Q(x,y) (the state-action value function), and x and y are generated by the model acting on an environment with policy p(y|x). The x’s are from the invariant state distribution as in the policy gradients theorem.
Policy Gradients: Q-Functions and State-Action Value Learning
By
–
Leave a Reply