You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am aware of this smart trick of implementing policy gradient (see his for a reference: https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py). Specifically, categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i)). For the action taken, a, we can set p_a = advantage * [index of action a in 1-hot-vector representation). Meanwhile, q_a is the output of the policy network, which is the probability of taking the action a, i.e. policy(s, a).
However, when the classes of output is huge (e.g. as in machine translation or language modeling), I simply cannot convert the output into one hot vector in the first place, using to_categorical(output, num_classes=output_class) function in keras.
Because of this, I cannot apply the trick to compute p_a.
So how to implement policy gradient in this case?
I hope I make my question in a clear way!
Many thanks for your help!
Best,
Cuong
@fredcallaway: I saw you commented on the code so I tagged you here as well. If you can give me an answer, I would really appreciate it ...
The text was updated successfully, but these errors were encountered:
Hello,
I am aware of this smart trick of implementing policy gradient (see his for a reference: https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py). Specifically, categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i)). For the action taken, a, we can set p_a = advantage * [index of action a in 1-hot-vector representation). Meanwhile, q_a is the output of the policy network, which is the probability of taking the action a, i.e. policy(s, a).
However, when the classes of output is huge (e.g. as in machine translation or language modeling), I simply cannot convert the output into one hot vector in the first place, using to_categorical(output, num_classes=output_class) function in keras.
Because of this, I cannot apply the trick to compute p_a.
So how to implement policy gradient in this case?
I hope I make my question in a clear way!
Many thanks for your help!
Best,
Cuong
@fredcallaway: I saw you commented on the code so I tagged you here as well. If you can give me an answer, I would really appreciate it ...
The text was updated successfully, but these errors were encountered: