Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continous control #3

Open
muupan opened this issue May 8, 2016 · 6 comments
Open

Continous control #3

muupan opened this issue May 8, 2016 · 6 comments

Comments

@muupan
Copy link
Owner

muupan commented May 8, 2016

No description provided.

@igrekun
Copy link

igrekun commented May 14, 2016

I'm working on LSTM implementation (neon based) for the continuous case, sadly I failed to get any response from authors.

It is variance and entropy that puzzles me. Any thoughts on how that is implemented code-wise?
Currently it shows no signs of convergence on mujoco domain for me and most likely there are errors in learnt variance for gaussian policy.

@muupan
Copy link
Owner Author

muupan commented May 15, 2016

Thanks for information. I haven't tried it yet, but the paper provides some information as below. Did you find it is not sufficient?

µ is modeled by a linear layer and σ2 by a SoftPlus operation, log(1 + exp(x)), as the activation computed as a function of the output of a linear layer.

we used a cost on the differential entropy of the normal distribution defined by the output of the actor network, −1/2 (log(2πσ2)+1), we used a constant multiplier of 10−4 for this cost across all of the tasks examined.

@etienne87
Copy link

It is a bit vague for me so I will try to summarize in order to be corrected : we need a fully connected layer outputting 2 values, add 1 softplus operation for second value (so that variance is > 0 I suppose), sample according to this gaussian (use numpy.randn * sigma + mu ?) in each dimension of action space, and finally send −1/2 (log(2πσ2)+1 as logprob instead of log(softmax) ?

@loofahcus
Copy link

hi, @muupan , do you have a plan to implement continous control? : )

@etienne87
Copy link

Here is an example:

class GaussianPolicyOutput(PolicyOutput):
    def __init__(self, logits_mu, logits_var):
        self.logits_mu = logits_mu
        self.logits_var = logits_var
        
        #print("self.logits_mu.data: ", self.logits_mu.data)
        
    @cached_property
    def action_indices(self):
        # the function has same name as for SoftmaxPolicyOutput so that the function
        # can be called from a3c.py without changes
        # however, the function samples from gaussian distributions
        
        mu, sigma2 = self.activation
        
        action = np.zeros(mu.data.shape, dtype = 'float32')
        
        #print("mu.data: ", mu.data)
        #print("sigma2.data: ", sigma2.data)
        for i in xrange(mu.data.shape[0]):
            action[i] = np.random.normal(mu.data[i], np.sqrt(sigma2.data[i]))
        #print("action: ", action)
        return action
    
    @cached_property
    def activation(self):
        mu = F.tanh(self.logits_mu) # output is in [-1,1]
        sigma2 = F.softplus(self.logits_var) #rectified output
        return mu, sigma2
        
    @cached_property
    def sampled_actions_log_probs(self):
        # returns chainer variable with log prob of the sampled action
    
        # activation
        mu, sigma2 = self.activation
        
        # sample action
        action = self.action_indices
        
        # compute neg. log likelihood
        #print("chainer.Variable(action).dtype: ", chainer.Variable(action).dtype)
        #print("mu.dtype: ", mu.dtype)
        #print("F.log(sigma2).dtype: ", F.log(sigma2).dtype)
        
        return -F.gaussian_nll(chainer.Variable(action), mu, F.log(sigma2))
    
    @cached_property
    def entropy(self):
        mu, sigma2 = self.activation
        return - F.sum(0.5*(np.log(2*np.pi*sigma2.data[0])+1))

haven't tested yet, so feel free to test/ correct

@loofahcus
Copy link

Thanks! @etienne87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants