-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem while using the code #1
Comments
I'm really glad to know that you've succeeded to reproduce continuous model. About the reverse(), as you say, I've forgot to add
in addition to
and I've pushed the fix just now. Let me explain why there are reverse() of lists.
This means that "i" will decrease like
This is why I put these reverse() for lists collected in the loop. As far as I tried with Atari model, I still couldn't reproduce good learning result. I was planning to implement LSTM after being able to reproduce good Atari result, but as you say that you've succeeded with continuous model, I should try LSTM now. About the batch, let me think about whether it is possible to replace it with batch or not. Thank you for suggestions!! |
Many thanks for your reply, and glad to hear that you also plan to work on the LSTM model. I just uploaded the testing codes (based on this repo) for the "batch" update that mentioned. I only tested it with the cartpole balance domain, but somehow I found it actually take longer time to reach the desired score than your implementation. I will try to investigate this later, as now I will continue to work with your implementation to study the LSTM model, which I am not familiar with. Also Instead of constant learning rate:
Don't know whether the random initialization of learning rate that mentioned in the paper can help to improve the results:
|
Thanks! I'll try. About the LSTM, I'm also new to LSTM and just started studying it recently, so please don't expect too much! However I'm really interested in 3D labyrinth model with LSTM, so I would like to try it. About the randomizing learning rate with log_uniform, I also used to randomize initial learning rate for each threads with log_uniform before. However when I looked at the figure on page 23, I found that learning rate varies from 10^-4 to 10^-2, uniformly distributed with log scale sampling. So my understanding of log_uniform function is that they are using log_uniform for finding best hyper parameter when they use grid search for it. (In the graph on page 14 and page 22, they are also using log scale for grid searching parameters.) However, I'm not sure my understanding is correct. |
Thanks for pointing out. After re-think about the random initialization, I think you are right about it, the initial learning rates sampled from the LogUniform range were used to demonstrate the sensitivity of their methods. And it makes sense that a constant (or best choice) of learning rate is apply for RMSPROP and decay to zero over time. Sorry, it is my bad, just got confused when they use the phrase in the paper.
|
No problem. Any suggestion and discussions are always welcomed. Thanks! |
Hi, tuning in again.
I am thinking of calculating the loss separately, by making 2 of the above policy loss function for the outputs. Does this make sense to you? |
I have never tried continues model, so today I looked into other Simple Cart-Pole Actor-Critic sample without NN to learn about it. By the way, even with the discrete action model I'm implementing now, the policy loss function is the most difficult part for me and still I'm not certain this is 100% correct. However, when I tried simple 2D grid maze model (which I implemented in debug_maze branch), this program succeeded to find shortest path with this policy loss function. So the loss function for discrete action seems fine. Anyway, I'll report here if I found any result with continuous model. |
Thanks for the reply. After doing some searches, may be I can apply the same loss function eg. negative log-likelihood, but instead of softmax function, a Gaussian (Normal) distribution function can be used instead since the outputs have mean and variance. So I think the loss function looks like by following the formula, where sigma2: variance, and mu: mean,
Sorry for the messy typing, I will try this out to see whether it works for the continuous cartpole domain, and let you know how this goes. |
Is this the explanation of this loss function? http://docs.chainer.org/en/stable/reference/functions.html#chainer.functions.gaussian_nll I really want to know the result. There is a lot to learn from this thread for me. Super thanks!! |
Yes, that's right, the negative log-likelihood of normal distribution is from the chainer site, but I also found another called maximum log-likelihood, I think they are the same thing by looking into the formula alone. Same here, there are lots methods out there waiting to be learned and get confused. Tried the loss function based on your code, it works moderately well with cart-pole balance task of continuous action domain, at least it is able to converge (or said reach the desired score). But possibly need some more examples to study the codes in order to draw conclusion that the loss function actually works for continuous action. So keep working on it!! Thanks. However when I turned back to try it with the "batch" method, it reached a score of around 2000(desired score was 3000), the network somehow diverged immediately(I am not pretty sure it was diverged or not, or explode, the network just gave "NaN" for its output all the time). |
Thank you for reporting. |
@miyosuda: |
@aravindsrinivas I have beed trying pong with A3C with CPU 8 threads. (As far as I'm trying with pong, the network does not diverges like NaN) To confirm whether my implementation has a problem or not, I tried easier task. I tried this in "debug_maze" branch. After confirming that this algorithm can solve easy RL task, I'm changing hyper parameter little by little to check whether the game score will increase like paper shows. I once heard that the DQN is very sensitive with hyper parameters, and as far as I see the paper, hyper parameters of this method seems sensitive. Along with tuning hyper parameters, I'm also planning to try another task, task that doesn't use CNN. By the way, the key concept of this method is to get the stability of the network by running multiple threads at the same time, not to diverge or oscillate. I have never tried Theano, but if you would like to run it with TensorFlow, I can help you. |
@miyosuda The decay parameter (called alpha in the paper) for RMSProp was 0.99 and the regularization constant (called epsilon in the paper) was 0.1. The maximum allowed gradient norm was 40. The best learning rates were around 7*10^-4. Backups of length 20 were used which corresponds to setting the t_max parameter to 20. Also, I am not sure if you used the frame skip in your implementation. From what I saw in the game_state.py, you just have a reward = ale.act(action)? Shouldn't it be in a for loop like Also, are you clipping the reward to lie between -1 and 1? In DQN, rewards were clipped between -1 and 1. I am not sure what the rewards are for Pong from the ALE src. |
Wowwowow! They are the parameters that I really wanted!!! Super thanks!! I was always using t_max with 5 and didn't use gradient norm clipping.
I've set the frame skipping with every 4 frames at "ale.cfg" file, but as you say it might better to put loop as you suggested. I'm not clipping the score, but ALE pong gives reward 1 or -1, so it will be ok. Anyway, super thanks for giving me such a valuable information!!! I'll try these parameters. |
@miyosuda |
I'm accumulating gradient t_max times in each thread, and after that I'm applying these accumulated gradients with shared RMSProp. When applying accumulated gradient, "rms" parameter is shared among threads. (The "rms" parameter in TensorFlow corresponds to "g" in the paper.) When applying accumulated gradients with shared RMSProp, I'm not using any synchronization like mutual exclusion among threads. (Is this what you are asking?) As far as I see the source code of TensorFlow, it seems ok to apply gradient without lock when running on CPU. |
Shouldn't we lock another thread from updating the parameters of the global network, when one particular thread is already updating it with its accumulated gradient from t_max steps? My question was related to RMSProp previous gradient values. We do a moving average of the RMS of the gradients right? And the RMS is used to determine our update of the parameters. My question is: Would the gradient values of different threads all be used together to update the moving average of the RMS? Or do we have separate moving averages for each thread, which is used when that corresponding thread is updating the parameters using its accumulated gradient? In the paper, they consider both the approaches, but say that having separate RMSProp parameters (mainly the moving average) has less robustness than sharing the moving average. But they don't reveal how exactly they synchronize the moving average across threads. Could you explain what you are doing? |
@aravindsrinivas I've created RMSPropApplier class in rmsprop_applier.py (The "rms" slot parameter will be passed to native code around here) I created this class to share this "rms" parameter among threads, but I found that RMSPropApplier class is created in each thread in a3c_training_thead.py. So the moving average is calculated differently in each thread. I need to fix this. |
Hi, I also confirmed that
A question: Could you tell me at what speed (steps per second) [where step refers to a decision taken from the network in gameplay] the code runs, for the 8 thread version? DeepMind says they get 1000 steps/sec from 16 threads, and thus for a single thread, it should be 70. But I was never able to run at 70 for my single thread code.. It used to run at 30. |
Thank you for providing such a valuable information again.
I got it. I'll set the LR for actor starting from 7*10^-4, and 3.5 * 10 ^ -4 for critic.
I got it. I've already implemented LR annealing.
I see. I'm now testing sharing 'g' in "shared_rmsprop" branch. Later I'll merge this to master branch after confirming.
I wanted to ask about this too. So I was storing frames for each state during one backup sequence (sequence of 5 states) like this.
With this pattern, each adjacent states are sharing three perceived frames. Another way to store frames with 4 frame skipping is
If we choose pattern B, one chosen action will continue along with 16 frames. About the speed of running steps, I'll check it on my environment and please wait a minute! |
@aravindsrinivas It was 106 steps per second with 8 threads. So it runs 13 steps per thread. Anyway, speed on my environment is much slower than DeepMind's. |
That's quite slow I guess... Maybe I got 30 steps per second for single thread, because of the GPU. I can't understand how DeepMind got it working with 70 steps/sec for a single thread. That's actually almost as fast as running DQN on GPU. |
@miyosuda What we should do is - say we are '0', we take an action, repeat it 4 times. We would execute 0->1, 1->2, 2->3 and 3->4 using the same action that was decided at 0. We now again decide on action at '4', execute 4->5, 5->6, 6->7, 7->8 (4 repetitions) and decide on an action at '8', .. and so on. Our states would be (0,4,8,12); (4,8,12,16); (8,12,16,20) .... . Since they say t_max is equivalent to 20 perceived frames, we must stop at (64,68,72,76). ie , you stop once you decide on an action at 76th frame, and repeat it 4 times to get to the 80th frame. 80th frame (with past 3 perceived frames 68,72,76) would be our s_{t_max} , which is used to calculate our target through V(s_{t_max}). We would have 17 tuples (0,4,8,12) , (4,8,12,16) ..., (64,68,72,76) for s_{t} for t = 0 to t_max - 1. s_t_max would be (68,72,76,80).. I will actually try to implement a Theano version of this now that so many details are clear.. Please keep updating on whether you are able to implement it. |
Hi, I also found this repo after trying to implement A3C from the DeepMind article, it's nice to see progress! However, when running the implementation the agents seem to only perform three actions from the legal action set given from the ALE interface and these actions correspond to idle, fire and right. Could this be a result from the provided pong binary being problematic or the set ACTION_SIZE of 3? The reason I'm asking is that when displaying the results from training a few hours, the paddle is stuck at the environment edge of the pong playing field. |
@joabim |
@aravindsrinivas |
@aravindsrinivas @joabim I tried this code,
and I got the result
[0, 3, 4] means [idle, right, left] |
@joabim
and the result was
So I think getLegalActionSet() is just returning all default actions. Which function are you using, getMinimalActionSet() or getLegalActionSet()? |
@miyosuda
so I'm wondering what has happened. I have tried rebuilding ALE but it doesn't change. When just running the code, the score remains at -21. By setting (forcing)
I actually get some results from the training as to be expected. It's really weird that I can't get the real action set from ALE! |
We also don't need to output the thread score (even if it is only thread 0) This is just following the same convention as DQN in DeepMind's code or On Fri, May 6, 2016 at 8:46 PM, Aravind Srinivas L <
Aravind |
@miyosuda If using multiprocessing module, I think we can make the validation thread to handle all the I/O scheduling to speed up the training, I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well. |
@aravindsrinivas @originholic
Great! If the performance increases, let me know! I think 80% of current CPU usage has a room to improve. |
@joabim In my environment the result on my machine during this 4 days is and the step size was around 53 million. (The learning rate becomes zero after 60 million steps) Let try one more learning to check. |
I didn't noticed, but I found that, in muupan's project, https://github.com/muupan/async-rl the learning rates of actor/critic are opposite to mine. I'll try his setting in this branch. https://github.com/miyosuda/async_deep_reinforce/tree/muupan_lr_setting |
I realized I forgot to install the ale in my new anaconda environment after building it (the ALE fork that you made with correct pong support and multithreading) in my previous run which resulted in me using the incorrect version of ALE... I am redoing the test for 16 threads using the muupan_lr_setting branch now! I'll let you know how it goes |
@joabim |
@miyosuda For everyone's information, I summarized about their settings here: https://github.com/muupan/async-rl/wiki |
@muupan Thank you! |
@miyosuda Hi. I’ve got an LSTM working with your code. I’ve only tested it on a toy problem (4 state MDP) rather than an Atari game but it seems to be working properly and as well as the feedforward net does. The code is at https://github.com/cjratcliff/async_deep_reinforce. I’ve made quite a few changes for my own version, many of them outside the LSTM parts so I’m happy to answer any questions. For using it on Atari, in addition to increasing the RNN size, I’d recommend changing the cell type from BasicRNNCell to BasicLSTMCell and removing the activation function argument to that function. |
@cjratcliff Thank you for sharing your LSTM version!!! Let me try it!!! |
@cjratcliff I've pushed my LSTM version.
With LSTM, the score of pong hit the maximum score easily. Thanks. |
@miyosuda Great to see it working so well, thanks. |
@miyosuda BTW, in training time, I encountered some trouble. The reason is that the "pi" has a possibility to become 0.0 and your code does not treat it correctly.
entropy = -tf.reduce_sum(self.pi * tf.log(self.pi), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )
entropy = -tf.reduce_sum(self.pi * tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta ) |
@Itsukara Sorry for late in reply (I didn't notice your post until now), and thank you for suggestion! |
The code performs really well on some games but on others it doesn't quite get the same level of scores as those reported in the paper. I wonder why that is. For example in Space Invaders, the reported score is 23846.0. The model I trained comes nowhere near that. :( Did anyone else manage to get better than around 1500 for Space Invaders? |
Just saw some discussion on using Multi-processing in this thread, I wonder what's the current status? I opened a dedicated ticket on this: |
Hi @miyosuda , thanks for sharing the code. I have a question about A3C LSTM implementation. At Thanks! |
Hello @miyosuda,
Thanks for sharing the code, please ignore the title, I tried out your code with the control problem of cartpole balance experiment instead of Atari game, it works well. But few questions want to ask.
I am curious, in the asynchronous paper, they also used another model implementation with 1 linear layer, 1 LSTM, layer, and softmax output, I am thinking of using this model to see whether improve the result, can you suggest how the LSTM can be implemented using tensorflow in the case of playing atari game?
Also wondering that the accumulated states and reward were reversed, do you need to reverse the actions and values as well? Although it did not make any different when I tried out, just wondering why.
Last, do you really need to accumulate the gradient and then apply the update, since tensorflow can handle the 'batch' for update.
The text was updated successfully, but these errors were encountered: