Implementing policy gradient when number of output classes is large #87

hoangcuong2011 · 2019-02-10T06:31:23Z

Hello,

I am aware of this smart trick of implementing policy gradient (see his for a reference: https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py). Specifically, categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i)). For the action taken, a, we can set p_a = advantage * [index of action a in 1-hot-vector representation). Meanwhile, q_a is the output of the policy network, which is the probability of taking the action a, i.e. policy(s, a).

However, when the classes of output is huge (e.g. as in machine translation or language modeling), I simply cannot convert the output into one hot vector in the first place, using to_categorical(output, num_classes=output_class) function in keras.

Because of this, I cannot apply the trick to compute p_a.

So how to implement policy gradient in this case?

I hope I make my question in a clear way!

Many thanks for your help!

Best,

Cuong

@fredcallaway: I saw you commented on the code so I tagged you here as well. If you can give me an answer, I would really appreciate it ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing policy gradient when number of output classes is large #87

Implementing policy gradient when number of output classes is large #87

hoangcuong2011 commented Feb 10, 2019

Implementing policy gradient when number of output classes is large #87

Implementing policy gradient when number of output classes is large #87

Comments

hoangcuong2011 commented Feb 10, 2019