这是indexloc提供的服务,不要输入任何密码
Skip to content

Implementing policy gradient when number of output classes is large #87

@hoangcuong2011

Description

@hoangcuong2011

Hello,

I am aware of this smart trick of implementing policy gradient (see his for a reference: https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py). Specifically, categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i)). For the action taken, a, we can set p_a = advantage * [index of action a in 1-hot-vector representation). Meanwhile, q_a is the output of the policy network, which is the probability of taking the action a, i.e. policy(s, a).

However, when the classes of output is huge (e.g. as in machine translation or language modeling), I simply cannot convert the output into one hot vector in the first place, using to_categorical(output, num_classes=output_class) function in keras.

Because of this, I cannot apply the trick to compute p_a.

So how to implement policy gradient in this case?

I hope I make my question in a clear way!

Many thanks for your help!

Best,

Cuong

@fredcallaway: I saw you commented on the code so I tagged you here as well. If you can give me an answer, I would really appreciate it ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions