Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem while using the code #1

Open
originholic opened this issue Apr 15, 2016 · 78 comments
Open

Problem while using the code #1

originholic opened this issue Apr 15, 2016 · 78 comments

Comments

@originholic
Copy link

originholic commented Apr 15, 2016

Hello @miyosuda,

Thanks for sharing the code, please ignore the title, I tried out your code with the control problem of cartpole balance experiment instead of Atari game, it works well. But few questions want to ask.

I am curious, in the asynchronous paper, they also used another model implementation with 1 linear layer, 1 LSTM, layer, and softmax output, I am thinking of using this model to see whether improve the result, can you suggest how the LSTM can be implemented using tensorflow in the case of playing atari game?

Also wondering that the accumulated states and reward were reversed, do you need to reverse the actions and values as well? Although it did not make any different when I tried out, just wondering why.

states.reverse()
rewards.reverse()

Last, do you really need to accumulate the gradient and then apply the update, since tensorflow can handle the 'batch' for update.

@miyosuda
Copy link
Owner

miyosuda commented Apr 15, 2016

I'm really glad to know that you've succeeded to reproduce continuous model.

About the reverse(), as you say, I've forgot to add

actions.reverse()
values.reverse()

in addition to

states.reverse()
rewards.reverse()

and I've pushed the fix just now. Let me explain why there are reverse() of lists.
In pseudo code of A3C algorithm in deep mind paper, there is

for i in {t-1, ...., tstart} do

This means that "i" will decrease like

t-1, t-2, t-3 ... tstart

This is why I put these reverse() for lists collected in the loop.

As far as I tried with Atari model, I still couldn't reproduce good learning result. I was planning to implement LSTM after being able to reproduce good Atari result, but as you say that you've succeeded with continuous model, I should try LSTM now.
Please wait.

About the batch, let me think about whether it is possible to replace it with batch or not.
Just a moment please.

Thank you for suggestions!!

@originholic
Copy link
Author

originholic commented Apr 15, 2016

Many thanks for your reply, and glad to hear that you also plan to work on the LSTM model.

I just uploaded the testing codes (based on this repo) for the "batch" update that mentioned.
https://github.com/originholic/a3c_vrep.git

I only tested it with the cartpole balance domain, but somehow I found it actually take longer time to reach the desired score than your implementation. I will try to investigate this later, as now I will continue to work with your implementation to study the LSTM model, which I am not familiar with.

Also Instead of constant learning rate:

math.exp( log_lo * ( 1-rate ) + log_hi * rate)

Don't know whether the random initialization of learning rate that mentioned in the paper can help to improve the results:

math.exp( random.uniform( log_lo, log_hi ) )

@miyosuda
Copy link
Owner

miyosuda commented Apr 15, 2016

I just uploaded the testing codes (based on this repo) for the "batch" update that mentioned.
https://github.com/originholic/a3c_vrep.git

Thanks! I'll try.

About the LSTM, I'm also new to LSTM and just started studying it recently, so please don't expect too much! However I'm really interested in 3D labyrinth model with LSTM, so I would like to try it.

About the randomizing learning rate with log_uniform, I also used to randomize initial learning rate for each threads with log_uniform before. However when I looked at the figure on page 23, I found that learning rate varies from 10^-4 to 10^-2, uniformly distributed with log scale sampling.

So my understanding of log_uniform function is that they are using log_uniform for finding best hyper parameter when they use grid search for it.

(In the graph on page 14 and page 22, they are also using log scale for grid searching parameters.)

However, I'm not sure my understanding is correct.

@originholic
Copy link
Author

Thanks for pointing out.

After re-think about the random initialization, I think you are right about it, the initial learning rates sampled from the LogUniform range were used to demonstrate the sensitivity of their methods. And it makes sense that a constant (or best choice) of learning rate is apply for RMSPROP and decay to zero over time.

Sorry, it is my bad, just got confused when they use the phrase in the paper.

"each using a different random initialization and initial learning rate"

@miyosuda
Copy link
Owner

No problem. Any suggestion and discussions are always welcomed. Thanks!

@originholic
Copy link
Author

Hi, tuning in again.
May I ask that in the case of continuous action domain of the asynchronous paper, they used two policy outputs of a linear layer and a Softplus activation + linear layer to represent the mean and variance. I am wondering how policy loss can be calculated with two outputs?

self.policy_loss = -( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ) ) * self.td + entropy * entropy_beta )

I am thinking of calculating the loss separately, by making 2 of the above policy loss function for the outputs. Does this make sense to you?
Sorry this might be out the scope of your interest, as 3D labyrinth doesn't require continuous actions, but any suggestion is highly appreciated. Many thanks!

@miyosuda
Copy link
Owner

miyosuda commented Apr 18, 2016

I have never tried continues model, so today I looked into other Simple Cart-Pole Actor-Critic sample without NN to learn about it.
How to define loss for policy for continues action is still difficult for me, so I'll try continues model with creating branch.
(I'm also interested in continuous model too)
Maybe it will be natural to make two different loss function for mean and variance, but I'm not sure now.
I'll try to figure out.

By the way, even with the discrete action model I'm implementing now, the policy loss function is the most difficult part for me and still I'm not certain this is 100% correct.

However, when I tried simple 2D grid maze model (which I implemented in debug_maze branch), this program succeeded to find shortest path with this policy loss function. So the loss function for discrete action seems fine.

Anyway, I'll report here if I found any result with continuous model.

@originholic
Copy link
Author

originholic commented Apr 19, 2016

Thanks for the reply.
As far as I know from your codes, the policy loss function for discrete domain is calculated using the negative log-likelihood of the softmax function.

After doing some searches, may be I can apply the same loss function eg. negative log-likelihood, but instead of softmax function, a Gaussian (Normal) distribution function can be used instead since the outputs have mean and variance. So I think the loss function looks like by following the formula, where sigma2: variance, and mu: mean,

D = tf.to_float(tf.size(self.a))
x_prec = tf.exp(-tf.log(self.sigma2))
x_diff = tf.sub(self.a, self.mu)
x_power = tf.square(x_diff) * x_prec * -0.5
gaussian_nll = (tf.reduce_sum(tf.log(self.sigma2)) + D * tf.log(2 * np.pi)) / 2 - tf.reduce_sum(x_power)
self.policy_loss = gaussian_nll * self.td + entropy_beta * entropy

Sorry for the messy typing, I will try this out to see whether it works for the continuous cartpole domain, and let you know how this goes.
Thanks

@miyosuda
Copy link
Owner

Is this the explanation of this loss function?

http://docs.chainer.org/en/stable/reference/functions.html#chainer.functions.gaussian_nll

I really want to know the result. There is a lot to learn from this thread for me. Super thanks!!

@originholic
Copy link
Author

originholic commented Apr 20, 2016

Yes, that's right, the negative log-likelihood of normal distribution is from the chainer site, but I also found another called maximum log-likelihood, I think they are the same thing by looking into the formula alone. Same here, there are lots methods out there waiting to be learned and get confused.

Tried the loss function based on your code, it works moderately well with cart-pole balance task of continuous action domain, at least it is able to converge (or said reach the desired score). But possibly need some more examples to study the codes in order to draw conclusion that the loss function actually works for continuous action. So keep working on it!! Thanks.

However when I turned back to try it with the "batch" method, it reached a score of around 2000(desired score was 3000), the network somehow diverged immediately(I am not pretty sure it was diverged or not, or explode, the network just gave "NaN" for its output all the time).

@miyosuda
Copy link
Owner

Thank you for reporting.
I was trying batch with my discrete action code in "batch" and "debug_maze_batch" branch.
I'm checking whether gradient accumulation is working correctly when batched.

@ghost
Copy link

ghost commented Apr 25, 2016

@miyosuda:
Hey, I had been trying to implement the same, on Theano. Implemented an A2C version (single thread), which obviously never converges inspite of training on GPU, for even a week or so... Came across your git source. Could you please let me know what exactly are the issues that you are facing right now, that makes your learning still not as good as required? Is it NaNs and stability; or no-convergence of the network? We can try to catch up on this, as I am also in urgent need to have an Actor Critic learner on Pong.

@miyosuda
Copy link
Owner

miyosuda commented Apr 25, 2016

@aravindsrinivas
Thank you for joining the discussion.
Let me explain what I tried, what I succeeded and what I have not succeeded yet.

I have beed trying pong with A3C with CPU 8 threads.
The problem is that the score of the game does not increase even with one or two days learning.
The AI can hit back tree or four times in one game, but the score does not increase like the deep mind paper shows.

(As far as I'm trying with pong, the network does not diverges like NaN)

To confirm whether my implementation has a problem or not, I tried easier task.
I implemented 10x10 grid 2D maze, and let this A3C algorithm find the shortest path.
After running two or thee minutes, the AI converged to optimal result. (It succeeded to find the shortest path)

I tried this in "debug_maze" branch.

After confirming that this algorithm can solve easy RL task, I'm changing hyper parameter little by little to check whether the game score will increase like paper shows.
But the result is still same.

I once heard that the DQN is very sensitive with hyper parameters, and as far as I see the paper, hyper parameters of this method seems sensitive.

Along with tuning hyper parameters, I'm also planning to try another task, task that doesn't use CNN.

By the way, the key concept of this method is to get the stability of the network by running multiple threads at the same time, not to diverge or oscillate.
So if you have problem with single thread, how about trying multiple threads?

I have never tried Theano, but if you would like to run it with TensorFlow, I can help you.

@ghost
Copy link

ghost commented Apr 25, 2016

@miyosuda
I mailed the authors (from DeepMind). These are some hyper parameters that they explicitly told me in the mail:

The decay parameter (called alpha in the paper) for RMSProp was 0.99 and the regularization constant (called epsilon in the paper) was 0.1. The maximum allowed gradient norm was 40. The best learning rates were around 7*10^-4. Backups of length 20 were used which corresponds to setting the t_max parameter to 20.

Also, I am not sure if you used the frame skip in your implementation. From what I saw in the game_state.py, you just have a reward = ale.act(action)? Shouldn't it be in a for loop like
for _ in range(frame_skip):
reward += ale.act(action)

Also, are you clipping the reward to lie between -1 and 1? In DQN, rewards were clipped between -1 and 1. I am not sure what the rewards are for Pong from the ALE src.

@miyosuda
Copy link
Owner

miyosuda commented Apr 25, 2016

Wowwowow! They are the parameters that I really wanted!!! Super thanks!!

I was always using t_max with 5 and didn't use gradient norm clipping.
(In the paper there was only one line just referring gradient norm clipping, so I didn't tried it)

also, I am not sure if you used the frame skip in your implementation

I've set the frame skipping with every 4 frames at "ale.cfg" file, but as you say it might better to put loop as you suggested.

I'm not clipping the score, but ALE pong gives reward 1 or -1, so it will be ok.

Anyway, super thanks for giving me such a valuable information!!! I'll try these parameters.

@ghost
Copy link

ghost commented Apr 25, 2016

@miyosuda
Another question: How exactly are you synchronizing the RMSProp parameters?

@miyosuda
Copy link
Owner

miyosuda commented Apr 25, 2016

I'm accumulating gradient t_max times in each thread, and after that I'm applying these accumulated gradients with shared RMSProp. When applying accumulated gradient, "rms" parameter is shared among threads. (The "rms" parameter in TensorFlow corresponds to "g" in the paper.)
The "momentum" parameter in RMSProp can be shared, but I'm not using momentum in RMSProp because there was no referring with momentum in RMSProp in the paper.
(I'm applying 0.0 as momentum constant in RMSProp)

When applying accumulated gradients with shared RMSProp, I'm not using any synchronization like mutual exclusion among threads.

(Is this what you are asking?)

As far as I see the source code of TensorFlow, it seems ok to apply gradient without lock when running on CPU.
(To run it on GPU, I need to research more to check whether we can implement shared RMSProp with GPU or not, because memory handling on GPU might be different from CPU.)

@ghost
Copy link

ghost commented Apr 25, 2016

Shouldn't we lock another thread from updating the parameters of the global network, when one particular thread is already updating it with its accumulated gradient from t_max steps?

My question was related to RMSProp previous gradient values. We do a moving average of the RMS of the gradients right? And the RMS is used to determine our update of the parameters. My question is: Would the gradient values of different threads all be used together to update the moving average of the RMS? Or do we have separate moving averages for each thread, which is used when that corresponding thread is updating the parameters using its accumulated gradient?

In the paper, they consider both the approaches, but say that having separate RMSProp parameters (mainly the moving average) has less robustness than sharing the moving average. But they don't reveal how exactly they synchronize the moving average across threads.

Could you explain what you are doing?

@miyosuda
Copy link
Owner

miyosuda commented Apr 25, 2016

@aravindsrinivas
Sorry my mistake, while checking my code, I found that the moving average of RMSProp is not shared. So my current implementation is not shared RMSProp.

I've created RMSPropApplier class in rmsprop_applier.py
In this class, slot named "rms" corresponds to parameter "g" in the paper.

(The "rms" slot parameter will be passed to native code around here)
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.cc#L143

I created this class to share this "rms" parameter among threads, but I found that RMSPropApplier class is created in each thread in a3c_training_thead.py.

So the moving average is calculated differently in each thread. I need to fix this.
Sorry about that.

@ghost
Copy link

ghost commented Apr 26, 2016

@miyosuda

Hi, I also confirmed that

  1. the critic learning rate must be half the actor's..
  2. the LR must be linearly annealed to 0 over the course of training.
  3. the parameters 'g' and 'theta' (moving average of RMS of gradients and of course the parameters) are shared across the threads. (Unlike your earlier version of having separate RMS moving averages). Also, there is no need of locking and updating.
  4. t_max = 20 means 20 perceived frames (80 with Frame skip as per game)... Not 20 states .. ie not 20 84_84_4 tensors, but rather 20 84*84 frames...

A question: Could you tell me at what speed (steps per second) [where step refers to a decision taken from the network in gameplay] the code runs, for the 8 thread version? DeepMind says they get 1000 steps/sec from 16 threads, and thus for a single thread, it should be 70. But I was never able to run at 70 for my single thread code.. It used to run at 30.

@miyosuda
Copy link
Owner

miyosuda commented Apr 26, 2016

@aravindsrinivas

Thank you for providing such a valuable information again.

  1. the critic learning rate must be half the actor's..

I got it. I'll set the LR for actor starting from 7*10^-4, and 3.5 * 10 ^ -4 for critic.

  1. the LR must be linearly annealed to 0 over the course of training.

I got it. I've already implemented LR annealing.

  1. the parameters 'g' and 'theta' are shared across the threads.

I see. I'm now testing sharing 'g' in "shared_rmsprop" branch. Later I'll merge this to master branch after confirming.
In my implementation, 'theta' corresponds to variables in global_network instance.

  1. t_max = 20 means 20 perceived frames

I wanted to ask about this too.
I used to implement frame skipping with "ale.cfg" file with "frame_skip=4" option.
When using this option, every time we call ale.act(chosen_action), frame will advance 4 frames.

So I was storing frames for each state during one backup sequence (sequence of 5 states) like this.

(pattern A)
state[0] = { 0  4  8 12}     <- frame0, 4, 8, 16
state[1] = { 4  8 12 16}
state[2] = { 8 12 16 20}
state[3] = {12 16 20 24}
state[4] = {16 20 24 28}

With this pattern, each adjacent states are sharing three perceived frames.

Another way to store frames with 4 frame skipping is

(pattern B)
state[0] = { 0  4  8 12}
state[1] = {16 20 24 28}
state[2] = {32 36 40 44}
state[3] = {48 52 56 60}
state[4] = {64 68 72 76}

If we choose pattern B, one chosen action will continue along with 16 frames.
How should we implement frame skipping with t_max=20?
If you have any idea about this, please teach me.

About the speed of running steps, I'll check it on my environment and please wait a minute!

@miyosuda
Copy link
Owner

@aravindsrinivas
I've checked the running speed.
I'm outside now, so I checked it with my MacBookPro (Intel Core i7 2.5GHz).

It was 106 steps per second with 8 threads. So it runs 13 steps per thread.
I have another Core i7-6700 Desktop machine, and I remember it was x1.5 (or x2?) times faster than my MacBookPro.
(I'll check with Core i7 machine later)

Anyway, speed on my environment is much slower than DeepMind's.

@ghost
Copy link

ghost commented Apr 26, 2016

@miyosuda

That's quite slow I guess... Maybe I got 30 steps per second for single thread, because of the GPU. I can't understand how DeepMind got it working with 70 steps/sec for a single thread. That's actually almost as fast as running DQN on GPU.
So, your code is about 5 times slower than DeepMind's I guess... But we can still reproduce results with 1-2 days of running.....

@ghost
Copy link

ghost commented Apr 26, 2016

@miyosuda
When I implemented, I had it the same way as Pattern - A (0,4,8,12), (4,8,12,16), (8,12,16,20) , .... Even in DQN, that's the way they do it.

What we should do is - say we are '0', we take an action, repeat it 4 times. We would execute 0->1, 1->2, 2->3 and 3->4 using the same action that was decided at 0. We now again decide on action at '4', execute 4->5, 5->6, 6->7, 7->8 (4 repetitions) and decide on an action at '8', .. and so on.

Our states would be (0,4,8,12); (4,8,12,16); (8,12,16,20) .... . Since they say t_max is equivalent to 20 perceived frames, we must stop at (64,68,72,76). ie , you stop once you decide on an action at 76th frame, and repeat it 4 times to get to the 80th frame. 80th frame (with past 3 perceived frames 68,72,76) would be our s_{t_max} , which is used to calculate our target through V(s_{t_max}). We would have 17 tuples (0,4,8,12) , (4,8,12,16) ..., (64,68,72,76) for s_{t} for t = 0 to t_max - 1. s_t_max would be (68,72,76,80)..

I will actually try to implement a Theano version of this now that so many details are clear.. Please keep updating on whether you are able to implement it.

@joabim
Copy link

joabim commented Apr 26, 2016

Hi, I also found this repo after trying to implement A3C from the DeepMind article, it's nice to see progress! However, when running the implementation the agents seem to only perform three actions from the legal action set given from the ALE interface and these actions correspond to idle, fire and right. Could this be a result from the provided pong binary being problematic or the set ACTION_SIZE of 3? The reason I'm asking is that when displaying the results from training a few hours, the paddle is stuck at the environment edge of the pong playing field.

@ghost
Copy link

ghost commented Apr 26, 2016

@joabim
I think it is because the ACTION_SIZE is set to 3. He is using only the legal actions allowed for the Pong game, and Pong has only 3 actions (moving up/down/staying idle).

@joabim
Copy link

joabim commented Apr 26, 2016

@aravindsrinivas
You're right! But for some reason, instead of up/down/idle my runtime printouts seem to suggest that the agents perform the actions noop/idle (0), fire (1) and right (3) (which corresponds to up when testing pong.bin in the the stella emulator) according to the arcade learning environment documentation but maybe I'm misinterpreting the minimal action set. Do you get a moving paddle?

@miyosuda
Copy link
Owner

@aravindsrinivas
Now I understand what you mean. I'll try that way too. Thanks!

@joabim
Thank you for joining the discussion.
It seems strange to get [0, 1, 3] from pong game rom.

I tried this code,

from ale_python_interface import ALEInterface
ale = ALEInterface()
ale.loadROM("pong.bin")
real_actions = ale.getMinimalActionSet()
print "minimal actions=", real_actions

and I got the result

minimal actions= [0 3 4]

[0, 3, 4] means [idle, right, left]
Could you try the code above?

@miyosuda
Copy link
Owner

miyosuda commented Apr 26, 2016

@joabim
Ah there is another function named getLegalActionSet() in ale, and I also tried it.

from ale_python_interface import ALEInterface
ale = ALEInterface()
ale.loadROM("pong.bin")
leagl_actions = ale.getLegalActionSet()
print "legal actions=", leagl_actions

and the result was

 legal actions= [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]

So I think getLegalActionSet() is just returning all default actions. Which function are you using, getMinimalActionSet() or getLegalActionSet()?

@joabim
Copy link

joabim commented Apr 26, 2016

@miyosuda
Exactly, when I invoke getMinimalActionSet() I get

[ 0 1 3 4 11 12]

so I'm wondering what has happened. I have tried rebuilding ALE but it doesn't change. When just running the code, the score remains at -21. By setting (forcing)

self.real_actions = [0, 3, 4]

I actually get some results from the training as to be expected. It's really weird that I can't get the real action set from ALE!

@ghost
Copy link

ghost commented May 6, 2016

We also don't need to output the thread score (even if it is only thread 0)
during train phase.

This is just following the same convention as DQN in DeepMind's code or
Nathan Sprague's Lasagne implementation.

On Fri, May 6, 2016 at 8:46 PM, Aravind Srinivas L <
[email protected]> wrote:

I think we should follow the epoch convention for testing, and only with
respect to the global network parameters. Not the thread parameters.

That is, we must train using all threads and update the global parameters
and periodically test only the global network. That is what has been done
in the paper. Every 4 million frames (1 mill steps - value of T), a testing
epoch must be conducted that will last 500000 frames (125000 steps). I
think this will make it better... What do you think?

On Fri, May 6, 2016 at 8:26 PM, Kosuke Miyoshi [email protected]
wrote:

@joabim https://github.com/joabim
Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi
threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the
average of the 16 threads?

I was thinking the same thing.
I've added modification to record scores from all threads and pushed it
to "all_scores" branch.
I'll test this branch, and if there is no problem, I'll merge it to
master.

(I'm not averaging the score, but does this help?)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1 (comment)

Aravind

Aravind

@originholic
Copy link
Author

originholic commented May 6, 2016

@miyosuda
Very glad to hear that have learning going on with the ale environment!!
Also I agree with the idea of @aravindsrinivas, how about we add one more thread specified for testing the global net, that 16 training thread + 1 validation thread, so just need to monitor the validation thread for the score according to the global T.

If using multiprocessing module, I think we can make the validation thread to handle all the I/O scheduling to speed up the training, I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

@miyosuda
Copy link
Owner

miyosuda commented May 6, 2016

@aravindsrinivas @originholic
Thank you for the suggestion. Let me think about how to do validation with global network efficiently.

I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

Great! If the performance increases, let me know! I think 80% of current CPU usage has a room to improve.

@joabim
Copy link

joabim commented May 9, 2016

Alright, now I'm back and I get that we need to change the implementation. The results from my run during the weekend follows (even though they don't matter anymore)

a3c2

@miyosuda
Copy link
Owner

miyosuda commented May 9, 2016

@joabim
Thank you for testing.

In my environment the result on my machine during this 4 days is

4days

and the step size was around 53 million. (The learning rate becomes zero after 60 million steps)
Hmm what's the difference between your environment and mine...

Let try one more learning to check.

@miyosuda
Copy link
Owner

miyosuda commented May 9, 2016

I didn't noticed, but I found that, in muupan's project,

https://github.com/muupan/async-rl

the learning rates of actor/critic are opposite to mine.
The learning rate of actor is half of critic's, and his result is better than mine.

I'll try his setting in this branch.

https://github.com/miyosuda/async_deep_reinforce/tree/muupan_lr_setting

@miyosuda
Copy link
Owner

miyosuda commented May 9, 2016

After learning of 59 million steps, I visualized the weights of first convolution layer.

$ python a3c_visualize.py

The result was like this.

weights_with_note

I think the second column represents the upside movement of the paddle. (One column represents 4 frames of the input.)

@joabim
Copy link

joabim commented May 10, 2016

I realized I forgot to install the ale in my new anaconda environment after building it (the ALE fork that you made with correct pong support and multithreading) in my previous run which resulted in me using the incorrect version of ALE... I am redoing the test for 16 threads using the muupan_lr_setting branch now! I'll let you know how it goes

@miyosuda
Copy link
Owner

@joabim
I see.
BTW, I'm going to ask muupan about his setting in his issues thread. He said that he asked DeepMind authors about tuning, and I hope I can apply his feedback to mine later.

@muupan
Copy link

muupan commented May 10, 2016

@miyosuda For everyone's information, I summarized about their settings here: https://github.com/muupan/async-rl/wiki

@miyosuda
Copy link
Owner

@muupan Thank you!

@cjratcliff
Copy link

@miyosuda Hi. I’ve got an LSTM working with your code. I’ve only tested it on a toy problem (4 state MDP) rather than an Atari game but it seems to be working properly and as well as the feedforward net does. The code is at https://github.com/cjratcliff/async_deep_reinforce. I’ve made quite a few changes for my own version, many of them outside the LSTM parts so I’m happy to answer any questions. For using it on Atari, in addition to increasing the RNN size, I’d recommend changing the cell type from BasicRNNCell to BasicLSTMCell and removing the activation function argument to that function.

@miyosuda
Copy link
Owner

@cjratcliff Thank you for sharing your LSTM version!!! Let me try it!!!

@miyosuda
Copy link
Owner

miyosuda commented Jul 3, 2016

@cjratcliff I've pushed my LSTM version.
To make pong work with LSTM, I added

  1. Unrolling LSTM cell up to 5 time steps (LOCAL_T_MAX time steps).
    Now the back-prop calculation is batched with unrolled LSTM.
  2. Call actions.reverse(), states.reverse() etc.... again to change the input order as normal.
    When calculating "R", I'm calling reverse() to make the calculation easier. (Because from the last state, R can be calculated recursively as written in the original paper.) So I called reverse() again to fix the order.

With LSTM, the score of pong hit the maximum score easily. Thanks.

@cjratcliff
Copy link

@miyosuda Great to see it working so well, thanks.

@Itsukara
Copy link
Contributor

@miyosuda
Your work is great!
I tried your program on a game "Breakout".
In one day training, I got 833 point as maximum score.

BTW, in training time, I encountered some trouble.
"pi" become NaN sometimes and the saved data was not usable for demo play.

The reason is that the "pi" has a possibility to become 0.0 and your code does not treat it correctly.
I think that you'd better to change following code in the file "game_ac_network.py".
I changed the code as follows and has no problems so far.

  • Current code:
entropy = -tf.reduce_sum(self.pi * tf.log(self.pi), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )
  • My proposal:
entropy = -tf.reduce_sum(self.pi * tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

@miyosuda
Copy link
Owner

miyosuda commented Aug 1, 2016

@Itsukara Sorry for late in reply (I didn't notice your post until now), and thank you for suggestion!
As you suggest, my code can't treat zero pi value. I'll test it and apply your fix to my repo later.
Thanks!

@sahiliitm
Copy link

The code performs really well on some games but on others it doesn't quite get the same level of scores as those reported in the paper. I wonder why that is. For example in Space Invaders, the reported score is 23846.0. The model I trained comes nowhere near that. :( Did anyone else manage to get better than around 1500 for Space Invaders?

@mw66
Copy link

mw66 commented Jan 22, 2017

Just saw some discussion on using Multi-processing in this thread, I wonder what's the current status?

I opened a dedicated ticket on this:

#27

muupan added a commit to chainer/chainerrl that referenced this issue Jan 30, 2017
@dongleecsu
Copy link

Hi @miyosuda , thanks for sharing the code. I have a question about A3C LSTM implementation.

At class GameACLSTMNetwork, line 217, why to share the lstm weights among thread? Maybe it makes sense to create "no reuse" lstm weights for every worker and global_network and synchronize all the variables from global_network's.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants