Problem while using the code #1

originholic · 2016-04-15T07:19:06Z

Thanks for sharing the code, please ignore the title, I tried out your code with the control problem of cartpole balance experiment instead of Atari game, it works well. But few questions want to ask.

I am curious, in the asynchronous paper, they also used another model implementation with 1 linear layer, 1 LSTM, layer, and softmax output, I am thinking of using this model to see whether improve the result, can you suggest how the LSTM can be implemented using tensorflow in the case of playing atari game?

Also wondering that the accumulated states and reward were reversed, do you need to reverse the actions and values as well? Although it did not make any different when I tried out, just wondering why.

states.reverse()
rewards.reverse()

Last, do you really need to accumulate the gradient and then apply the update, since tensorflow can handle the 'batch' for update.

The text was updated successfully, but these errors were encountered:

miyosuda · 2016-04-15T08:27:10Z

I'm really glad to know that you've succeeded to reproduce continuous model.

About the reverse(), as you say, I've forgot to add

actions.reverse()
values.reverse()

in addition to

states.reverse()
rewards.reverse()

and I've pushed the fix just now. Let me explain why there are reverse() of lists.
In pseudo code of A3C algorithm in deep mind paper, there is

for i in {t-1, ...., tstart} do

This means that "i" will decrease like

t-1, t-2, t-3 ... tstart

This is why I put these reverse() for lists collected in the loop.

As far as I tried with Atari model, I still couldn't reproduce good learning result. I was planning to implement LSTM after being able to reproduce good Atari result, but as you say that you've succeeded with continuous model, I should try LSTM now.
Please wait.

About the batch, let me think about whether it is possible to replace it with batch or not.
Just a moment please.

Thank you for suggestions!!

originholic · 2016-04-15T12:06:42Z

Many thanks for your reply, and glad to hear that you also plan to work on the LSTM model.

I just uploaded the testing codes (based on this repo) for the "batch" update that mentioned.
https://github.com/originholic/a3c_vrep.git

I only tested it with the cartpole balance domain, but somehow I found it actually take longer time to reach the desired score than your implementation. I will try to investigate this later, as now I will continue to work with your implementation to study the LSTM model, which I am not familiar with.

Also Instead of constant learning rate:

math.exp( log_lo * ( 1-rate ) + log_hi * rate)

Don't know whether the random initialization of learning rate that mentioned in the paper can help to improve the results:

math.exp( random.uniform( log_lo, log_hi ) )

miyosuda · 2016-04-15T16:36:58Z

I just uploaded the testing codes (based on this repo) for the "batch" update that mentioned.
https://github.com/originholic/a3c_vrep.git

Thanks! I'll try.

About the LSTM, I'm also new to LSTM and just started studying it recently, so please don't expect too much! However I'm really interested in 3D labyrinth model with LSTM, so I would like to try it.

About the randomizing learning rate with log_uniform, I also used to randomize initial learning rate for each threads with log_uniform before. However when I looked at the figure on page 23, I found that learning rate varies from 10^-4 to 10^-2, uniformly distributed with log scale sampling.

So my understanding of log_uniform function is that they are using log_uniform for finding best hyper parameter when they use grid search for it.

(In the graph on page 14 and page 22, they are also using log scale for grid searching parameters.)

However, I'm not sure my understanding is correct.

originholic · 2016-04-16T06:38:38Z

Thanks for pointing out.

After re-think about the random initialization, I think you are right about it, the initial learning rates sampled from the LogUniform range were used to demonstrate the sensitivity of their methods. And it makes sense that a constant (or best choice) of learning rate is apply for RMSPROP and decay to zero over time.

Sorry, it is my bad, just got confused when they use the phrase in the paper.

"each using a different random initialization and initial learning rate"

miyosuda · 2016-04-16T13:10:38Z

No problem. Any suggestion and discussions are always welcomed. Thanks!

originholic · 2016-04-18T05:27:45Z

Hi, tuning in again.
May I ask that in the case of continuous action domain of the asynchronous paper, they used two policy outputs of a linear layer and a Softplus activation + linear layer to represent the mean and variance. I am wondering how policy loss can be calculated with two outputs?

self.policy_loss = -( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ) ) * self.td + entropy * entropy_beta )

I am thinking of calculating the loss separately, by making 2 of the above policy loss function for the outputs. Does this make sense to you?
Sorry this might be out the scope of your interest, as 3D labyrinth doesn't require continuous actions, but any suggestion is highly appreciated. Many thanks!

miyosuda · 2016-04-18T17:13:21Z

I have never tried continues model, so today I looked into other Simple Cart-Pole Actor-Critic sample without NN to learn about it.
How to define loss for policy for continues action is still difficult for me, so I'll try continues model with creating branch.
(I'm also interested in continuous model too)
Maybe it will be natural to make two different loss function for mean and variance, but I'm not sure now.
I'll try to figure out.

By the way, even with the discrete action model I'm implementing now, the policy loss function is the most difficult part for me and still I'm not certain this is 100% correct.

However, when I tried simple 2D grid maze model (which I implemented in debug_maze branch), this program succeeded to find shortest path with this policy loss function. So the loss function for discrete action seems fine.

Anyway, I'll report here if I found any result with continuous model.

originholic · 2016-04-19T05:10:02Z

Thanks for the reply.
As far as I know from your codes, the policy loss function for discrete domain is calculated using the negative log-likelihood of the softmax function.

After doing some searches, may be I can apply the same loss function eg. negative log-likelihood, but instead of softmax function, a Gaussian (Normal) distribution function can be used instead since the outputs have mean and variance. So I think the loss function looks like by following the formula, where sigma2: variance, and mu: mean,

D = tf.to_float(tf.size(self.a))
x_prec = tf.exp(-tf.log(self.sigma2))
x_diff = tf.sub(self.a, self.mu)
x_power = tf.square(x_diff) * x_prec * -0.5
gaussian_nll = (tf.reduce_sum(tf.log(self.sigma2)) + D * tf.log(2 * np.pi)) / 2 - tf.reduce_sum(x_power)
self.policy_loss = gaussian_nll * self.td + entropy_beta * entropy

Sorry for the messy typing, I will try this out to see whether it works for the continuous cartpole domain, and let you know how this goes.
Thanks

miyosuda · 2016-04-19T14:24:59Z

Is this the explanation of this loss function?

http://docs.chainer.org/en/stable/reference/functions.html#chainer.functions.gaussian_nll

I really want to know the result. There is a lot to learn from this thread for me. Super thanks!!

originholic · 2016-04-20T04:25:30Z

Yes, that's right, the negative log-likelihood of normal distribution is from the chainer site, but I also found another called maximum log-likelihood, I think they are the same thing by looking into the formula alone. Same here, there are lots methods out there waiting to be learned and get confused.

Tried the loss function based on your code, it works moderately well with cart-pole balance task of continuous action domain, at least it is able to converge (or said reach the desired score). But possibly need some more examples to study the codes in order to draw conclusion that the loss function actually works for continuous action. So keep working on it!! Thanks.

However when I turned back to try it with the "batch" method, it reached a score of around 2000(desired score was 3000), the network somehow diverged immediately(I am not pretty sure it was diverged or not, or explode, the network just gave "NaN" for its output all the time).

miyosuda · 2016-04-20T14:13:32Z

Thank you for reporting.
I was trying batch with my discrete action code in "batch" and "debug_maze_batch" branch.
I'm checking whether gradient accumulation is working correctly when batched.

ghost · 2016-04-25T12:32:08Z

@miyosuda:
Hey, I had been trying to implement the same, on Theano. Implemented an A2C version (single thread), which obviously never converges inspite of training on GPU, for even a week or so... Came across your git source. Could you please let me know what exactly are the issues that you are facing right now, that makes your learning still not as good as required? Is it NaNs and stability; or no-convergence of the network? We can try to catch up on this, as I am also in urgent need to have an Actor Critic learner on Pong.

miyosuda · 2016-04-25T14:15:56Z

@aravindsrinivas
Thank you for joining the discussion.
Let me explain what I tried, what I succeeded and what I have not succeeded yet.

I have beed trying pong with A3C with CPU 8 threads.
The problem is that the score of the game does not increase even with one or two days learning.
The AI can hit back tree or four times in one game, but the score does not increase like the deep mind paper shows.

(As far as I'm trying with pong, the network does not diverges like NaN)

To confirm whether my implementation has a problem or not, I tried easier task.
I implemented 10x10 grid 2D maze, and let this A3C algorithm find the shortest path.
After running two or thee minutes, the AI converged to optimal result. (It succeeded to find the shortest path)

I tried this in "debug_maze" branch.

After confirming that this algorithm can solve easy RL task, I'm changing hyper parameter little by little to check whether the game score will increase like paper shows.
But the result is still same.

I once heard that the DQN is very sensitive with hyper parameters, and as far as I see the paper, hyper parameters of this method seems sensitive.

Along with tuning hyper parameters, I'm also planning to try another task, task that doesn't use CNN.

By the way, the key concept of this method is to get the stability of the network by running multiple threads at the same time, not to diverge or oscillate.
So if you have problem with single thread, how about trying multiple threads?

I have never tried Theano, but if you would like to run it with TensorFlow, I can help you.

ghost · 2016-04-25T14:27:29Z

@miyosuda
I mailed the authors (from DeepMind). These are some hyper parameters that they explicitly told me in the mail:

The decay parameter (called alpha in the paper) for RMSProp was 0.99 and the regularization constant (called epsilon in the paper) was 0.1. The maximum allowed gradient norm was 40. The best learning rates were around 7*10^-4. Backups of length 20 were used which corresponds to setting the t_max parameter to 20.

Also, I am not sure if you used the frame skip in your implementation. From what I saw in the game_state.py, you just have a reward = ale.act(action)? Shouldn't it be in a for loop like
for _ in range(frame_skip):
reward += ale.act(action)

Also, are you clipping the reward to lie between -1 and 1? In DQN, rewards were clipped between -1 and 1. I am not sure what the rewards are for Pong from the ALE src.

miyosuda · 2016-04-25T15:11:48Z

Wowwowow! They are the parameters that I really wanted!!! Super thanks!!

I was always using t_max with 5 and didn't use gradient norm clipping.
(In the paper there was only one line just referring gradient norm clipping, so I didn't tried it)

also, I am not sure if you used the frame skip in your implementation

I've set the frame skipping with every 4 frames at "ale.cfg" file, but as you say it might better to put loop as you suggested.

I'm not clipping the score, but ALE pong gives reward 1 or -1, so it will be ok.

Anyway, super thanks for giving me such a valuable information!!! I'll try these parameters.

ghost · 2016-04-25T16:52:09Z

@miyosuda
Another question: How exactly are you synchronizing the RMSProp parameters?

miyosuda · 2016-04-25T17:41:44Z

I'm accumulating gradient t_max times in each thread, and after that I'm applying these accumulated gradients with shared RMSProp. When applying accumulated gradient, "rms" parameter is shared among threads. (The "rms" parameter in TensorFlow corresponds to "g" in the paper.)
The "momentum" parameter in RMSProp can be shared, but I'm not using momentum in RMSProp because there was no referring with momentum in RMSProp in the paper.
(I'm applying 0.0 as momentum constant in RMSProp)

When applying accumulated gradients with shared RMSProp, I'm not using any synchronization like mutual exclusion among threads.

(Is this what you are asking?)

As far as I see the source code of TensorFlow, it seems ok to apply gradient without lock when running on CPU.
(To run it on GPU, I need to research more to check whether we can implement shared RMSProp with GPU or not, because memory handling on GPU might be different from CPU.)

ghost · 2016-04-25T22:20:17Z

Shouldn't we lock another thread from updating the parameters of the global network, when one particular thread is already updating it with its accumulated gradient from t_max steps?

My question was related to RMSProp previous gradient values. We do a moving average of the RMS of the gradients right? And the RMS is used to determine our update of the parameters. My question is: Would the gradient values of different threads all be used together to update the moving average of the RMS? Or do we have separate moving averages for each thread, which is used when that corresponding thread is updating the parameters using its accumulated gradient?

In the paper, they consider both the approaches, but say that having separate RMSProp parameters (mainly the moving average) has less robustness than sharing the moving average. But they don't reveal how exactly they synchronize the moving average across threads.

Could you explain what you are doing?

miyosuda · 2016-04-25T23:26:16Z

@aravindsrinivas
Sorry my mistake, while checking my code, I found that the moving average of RMSProp is not shared. So my current implementation is not shared RMSProp.

I've created RMSPropApplier class in rmsprop_applier.py
In this class, slot named "rms" corresponds to parameter "g" in the paper.

(The "rms" slot parameter will be passed to native code around here)
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.cc#L143

I created this class to share this "rms" parameter among threads, but I found that RMSPropApplier class is created in each thread in a3c_training_thead.py.

So the moving average is calculated differently in each thread. I need to fix this.
Sorry about that.

ghost · 2016-04-26T05:26:06Z

@miyosuda

Hi, I also confirmed that

the critic learning rate must be half the actor's..
the LR must be linearly annealed to 0 over the course of training.
the parameters 'g' and 'theta' (moving average of RMS of gradients and of course the parameters) are shared across the threads. (Unlike your earlier version of having separate RMS moving averages). Also, there is no need of locking and updating.
t_max = 20 means 20 perceived frames (80 with Frame skip as per game)... Not 20 states .. ie not 20 84_84_4 tensors, but rather 20 84*84 frames...

A question: Could you tell me at what speed (steps per second) [where step refers to a decision taken from the network in gameplay] the code runs, for the 8 thread version? DeepMind says they get 1000 steps/sec from 16 threads, and thus for a single thread, it should be 70. But I was never able to run at 70 for my single thread code.. It used to run at 30.

miyosuda · 2016-04-26T07:20:13Z

@aravindsrinivas

Thank you for providing such a valuable information again.

the critic learning rate must be half the actor's..

I got it. I'll set the LR for actor starting from 7*10^-4, and 3.5 * 10 ^ -4 for critic.

the LR must be linearly annealed to 0 over the course of training.

I got it. I've already implemented LR annealing.

the parameters 'g' and 'theta' are shared across the threads.

I see. I'm now testing sharing 'g' in "shared_rmsprop" branch. Later I'll merge this to master branch after confirming.
In my implementation, 'theta' corresponds to variables in global_network instance.

t_max = 20 means 20 perceived frames

I wanted to ask about this too.
I used to implement frame skipping with "ale.cfg" file with "frame_skip=4" option.
When using this option, every time we call ale.act(chosen_action), frame will advance 4 frames.

So I was storing frames for each state during one backup sequence (sequence of 5 states) like this.

(pattern A)
state[0] = { 0  4  8 12}     <- frame0, 4, 8, 16
state[1] = { 4  8 12 16}
state[2] = { 8 12 16 20}
state[3] = {12 16 20 24}
state[4] = {16 20 24 28}

With this pattern, each adjacent states are sharing three perceived frames.

Another way to store frames with 4 frame skipping is

(pattern B)
state[0] = { 0  4  8 12}
state[1] = {16 20 24 28}
state[2] = {32 36 40 44}
state[3] = {48 52 56 60}
state[4] = {64 68 72 76}

If we choose pattern B, one chosen action will continue along with 16 frames.
How should we implement frame skipping with t_max=20?
If you have any idea about this, please teach me.

About the speed of running steps, I'll check it on my environment and please wait a minute!

miyosuda · 2016-04-26T08:30:44Z

@aravindsrinivas
I've checked the running speed.
I'm outside now, so I checked it with my MacBookPro (Intel Core i7 2.5GHz).

It was 106 steps per second with 8 threads. So it runs 13 steps per thread.
I have another Core i7-6700 Desktop machine, and I remember it was x1.5 (or x2?) times faster than my MacBookPro.
(I'll check with Core i7 machine later)

Anyway, speed on my environment is much slower than DeepMind's.

ghost · 2016-04-26T08:43:23Z

@miyosuda

That's quite slow I guess... Maybe I got 30 steps per second for single thread, because of the GPU. I can't understand how DeepMind got it working with 70 steps/sec for a single thread. That's actually almost as fast as running DQN on GPU.
So, your code is about 5 times slower than DeepMind's I guess... But we can still reproduce results with 1-2 days of running.....

ghost · 2016-04-26T09:22:40Z

@miyosuda
When I implemented, I had it the same way as Pattern - A (0,4,8,12), (4,8,12,16), (8,12,16,20) , .... Even in DQN, that's the way they do it.

What we should do is - say we are '0', we take an action, repeat it 4 times. We would execute 0->1, 1->2, 2->3 and 3->4 using the same action that was decided at 0. We now again decide on action at '4', execute 4->5, 5->6, 6->7, 7->8 (4 repetitions) and decide on an action at '8', .. and so on.

Our states would be (0,4,8,12); (4,8,12,16); (8,12,16,20) .... . Since they say t_max is equivalent to 20 perceived frames, we must stop at (64,68,72,76). ie , you stop once you decide on an action at 76th frame, and repeat it 4 times to get to the 80th frame. 80th frame (with past 3 perceived frames 68,72,76) would be our s_{t_max} , which is used to calculate our target through V(s_{t_max}). We would have 17 tuples (0,4,8,12) , (4,8,12,16) ..., (64,68,72,76) for s_{t} for t = 0 to t_max - 1. s_t_max would be (68,72,76,80)..

I will actually try to implement a Theano version of this now that so many details are clear.. Please keep updating on whether you are able to implement it.

joabim · 2016-04-26T09:25:57Z

Hi, I also found this repo after trying to implement A3C from the DeepMind article, it's nice to see progress! However, when running the implementation the agents seem to only perform three actions from the legal action set given from the ALE interface and these actions correspond to idle, fire and right. Could this be a result from the provided pong binary being problematic or the set ACTION_SIZE of 3? The reason I'm asking is that when displaying the results from training a few hours, the paddle is stuck at the environment edge of the pong playing field.

ghost · 2016-04-26T09:33:31Z

@joabim
I think it is because the ACTION_SIZE is set to 3. He is using only the legal actions allowed for the Pong game, and Pong has only 3 actions (moving up/down/staying idle).

joabim · 2016-04-26T09:40:44Z

@aravindsrinivas
You're right! But for some reason, instead of up/down/idle my runtime printouts seem to suggest that the agents perform the actions noop/idle (0), fire (1) and right (3) (which corresponds to up when testing pong.bin in the the stella emulator) according to the arcade learning environment documentation but maybe I'm misinterpreting the minimal action set. Do you get a moving paddle?

miyosuda · 2016-04-26T10:06:34Z

@aravindsrinivas
Now I understand what you mean. I'll try that way too. Thanks!

@joabim
Thank you for joining the discussion.
It seems strange to get [0, 1, 3] from pong game rom.

I tried this code,

from ale_python_interface import ALEInterface
ale = ALEInterface()
ale.loadROM("pong.bin")
real_actions = ale.getMinimalActionSet()
print "minimal actions=", real_actions

and I got the result

minimal actions= [0 3 4]

[0, 3, 4] means [idle, right, left]
Could you try the code above?

miyosuda · 2016-04-26T10:11:10Z

@joabim
Ah there is another function named getLegalActionSet() in ale, and I also tried it.

from ale_python_interface import ALEInterface
ale = ALEInterface()
ale.loadROM("pong.bin")
leagl_actions = ale.getLegalActionSet()
print "legal actions=", leagl_actions

and the result was

 legal actions= [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]

So I think getLegalActionSet() is just returning all default actions. Which function are you using, getMinimalActionSet() or getLegalActionSet()?

joabim · 2016-04-26T10:11:28Z

@miyosuda
Exactly, when I invoke getMinimalActionSet() I get

[ 0 1 3 4 11 12]

so I'm wondering what has happened. I have tried rebuilding ALE but it doesn't change. When just running the code, the score remains at -21. By setting (forcing)

self.real_actions = [0, 3, 4]

I actually get some results from the training as to be expected. It's really weird that I can't get the real action set from ALE!

ghost · 2016-05-06T15:18:05Z

We also don't need to output the thread score (even if it is only thread 0)
during train phase.

This is just following the same convention as DQN in DeepMind's code or
Nathan Sprague's Lasagne implementation.

On Fri, May 6, 2016 at 8:46 PM, Aravind Srinivas L <
[email protected]> wrote:

I think we should follow the epoch convention for testing, and only with
respect to the global network parameters. Not the thread parameters.

That is, we must train using all threads and update the global parameters
and periodically test only the global network. That is what has been done
in the paper. Every 4 million frames (1 mill steps - value of T), a testing
epoch must be conducted that will last 500000 frames (125000 steps). I
think this will make it better... What do you think?

On Fri, May 6, 2016 at 8:26 PM, Kosuke Miyoshi [email protected]
wrote:

@joabim https://github.com/joabim
Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi
threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the
average of the 16 threads?

I was thinking the same thing.
I've added modification to record scores from all threads and pushed it
to "all_scores" branch.
I'll test this branch, and if there is no problem, I'll merge it to
master.

(I'm not averaging the score, but does this help?)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1 (comment)

Aravind

Aravind

originholic · 2016-05-06T16:08:24Z

@miyosuda
Very glad to hear that have learning going on with the ale environment!!
Also I agree with the idea of @aravindsrinivas, how about we add one more thread specified for testing the global net, that 16 training thread + 1 validation thread, so just need to monitor the validation thread for the score according to the global T.

If using multiprocessing module, I think we can make the validation thread to handle all the I/O scheduling to speed up the training, I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

miyosuda · 2016-05-06T20:34:57Z

@aravindsrinivas @originholic
Thank you for the suggestion. Let me think about how to do validation with global network efficiently.

I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

Great! If the performance increases, let me know! I think 80% of current CPU usage has a room to improve.

joabim · 2016-05-09T07:31:26Z

Alright, now I'm back and I get that we need to change the implementation. The results from my run during the weekend follows (even though they don't matter anymore)

miyosuda · 2016-05-09T07:51:09Z

@joabim
Thank you for testing.

In my environment the result on my machine during this 4 days is

and the step size was around 53 million. (The learning rate becomes zero after 60 million steps)
Hmm what's the difference between your environment and mine...

Let try one more learning to check.

miyosuda · 2016-05-09T20:49:39Z

I didn't noticed, but I found that, in muupan's project,

https://github.com/muupan/async-rl

the learning rates of actor/critic are opposite to mine.
The learning rate of actor is half of critic's, and his result is better than mine.

I'll try his setting in this branch.

https://github.com/miyosuda/async_deep_reinforce/tree/muupan_lr_setting

miyosuda · 2016-05-09T22:06:13Z

After learning of 59 million steps, I visualized the weights of first convolution layer.

$ python a3c_visualize.py

The result was like this.

I think the second column represents the upside movement of the paddle. (One column represents 4 frames of the input.)

joabim · 2016-05-10T11:24:52Z

I realized I forgot to install the ale in my new anaconda environment after building it (the ALE fork that you made with correct pong support and multithreading) in my previous run which resulted in me using the incorrect version of ALE... I am redoing the test for 16 threads using the muupan_lr_setting branch now! I'll let you know how it goes

miyosuda · 2016-05-10T11:38:24Z

@joabim
I see.
BTW, I'm going to ask muupan about his setting in his issues thread. He said that he asked DeepMind authors about tuning, and I hope I can apply his feedback to mine later.

muupan · 2016-05-10T12:17:39Z

@miyosuda For everyone's information, I summarized about their settings here: https://github.com/muupan/async-rl/wiki

miyosuda · 2016-05-10T14:07:20Z

@muupan Thank you!

cjratcliff · 2016-06-16T21:44:14Z

@miyosuda Hi. I’ve got an LSTM working with your code. I’ve only tested it on a toy problem (4 state MDP) rather than an Atari game but it seems to be working properly and as well as the feedforward net does. The code is at https://github.com/cjratcliff/async_deep_reinforce. I’ve made quite a few changes for my own version, many of them outside the LSTM parts so I’m happy to answer any questions. For using it on Atari, in addition to increasing the RNN size, I’d recommend changing the cell type from BasicRNNCell to BasicLSTMCell and removing the activation function argument to that function.

miyosuda · 2016-06-17T04:12:17Z

@cjratcliff Thank you for sharing your LSTM version!!! Let me try it!!!

miyosuda · 2016-07-03T17:47:55Z

@cjratcliff I've pushed my LSTM version.
To make pong work with LSTM, I added

Unrolling LSTM cell up to 5 time steps (LOCAL_T_MAX time steps).
Now the back-prop calculation is batched with unrolled LSTM.
Call actions.reverse(), states.reverse() etc.... again to change the input order as normal.
When calculating "R", I'm calling reverse() to make the calculation easier. (Because from the last state, R can be calculated recursively as written in the original paper.) So I called reverse() again to fix the order.

With LSTM, the score of pong hit the maximum score easily. Thanks.

cjratcliff · 2016-07-07T15:59:14Z

@miyosuda Great to see it working so well, thanks.

Itsukara · 2016-07-30T17:45:34Z

@miyosuda
Your work is great!
I tried your program on a game "Breakout".
In one day training, I got 833 point as maximum score.

BTW, in training time, I encountered some trouble.
"pi" become NaN sometimes and the saved data was not usable for demo play.

The reason is that the "pi" has a possibility to become 0.0 and your code does not treat it correctly.
I think that you'd better to change following code in the file "game_ac_network.py".
I changed the code as follows and has no problems so far.

Current code:

entropy = -tf.reduce_sum(self.pi * tf.log(self.pi), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

My proposal:

entropy = -tf.reduce_sum(self.pi * tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

miyosuda · 2016-08-01T05:59:16Z

@Itsukara Sorry for late in reply (I didn't notice your post until now), and thank you for suggestion!
As you suggest, my code can't treat zero pi value. I'll test it and apply your fix to my repo later.
Thanks!

sahiliitm · 2016-09-16T10:47:53Z

The code performs really well on some games but on others it doesn't quite get the same level of scores as those reported in the paper. I wonder why that is. For example in Space Invaders, the reported score is 23846.0. The model I trained comes nowhere near that. :( Did anyone else manage to get better than around 1500 for Space Invaders?

mw66 · 2017-01-22T07:54:41Z

Just saw some discussion on using Multi-processing in this thread, I wonder what's the current status?

I opened a dedicated ticket on this:

#27

dongleecsu · 2017-04-10T08:26:23Z

Hi @miyosuda , thanks for sharing the code. I have a question about A3C LSTM implementation.

At class GameACLSTMNetwork, line 217, why to share the lstm weights among thread? Maybe it makes sense to create "no reuse" lstm weights for every worker and global_network and synchronize all the variables from global_network's.

Thanks!

lake4790k mentioned this issue May 7, 2016

sharedRmsProp and async nature params Kaixhin/Atari#15

Merged

miyosuda mentioned this issue Aug 8, 2016

Running speed #7

Open

miyosuda referenced this issue Aug 11, 2016

swap actor/critic learning rate ratio

cbbdf10

miyosuda mentioned this issue Aug 20, 2016

how to test the model? #9

Open

zhuchiheng mentioned this issue Nov 29, 2016

Change code for a different problem #10

Open

miyosuda mentioned this issue Dec 11, 2016

Added actual LogUniform(x, y) miyosuda/unreal#1

Closed

miyosuda mentioned this issue Jan 4, 2017

Any reason for choosing ACTION_SIZE = 3? Extension for continuous action? #24

Closed

muupan added a commit to chainer/chainerrl that referenced this issue Jan 30, 2017

Follow miyosuda/async_deep_reinforce#1

877cefc

This was referenced Feb 20, 2018

Require HEAD version of universe pathak22/noreward-rl#22

Open

Parameters for Breakout openai/universe-starter-agent#125

Closed

Problem while using the code #1

Problem while using the code #1

Comments

originholic commented Apr 15, 2016 • edited Loading

miyosuda commented Apr 15, 2016 • edited Loading

originholic commented Apr 15, 2016 • edited Loading

miyosuda commented Apr 15, 2016 • edited Loading

originholic commented Apr 16, 2016

miyosuda commented Apr 16, 2016

originholic commented Apr 18, 2016

miyosuda commented Apr 18, 2016 • edited Loading

originholic commented Apr 19, 2016 • edited Loading

miyosuda commented Apr 19, 2016

originholic commented Apr 20, 2016 • edited Loading

miyosuda commented Apr 20, 2016

ghost commented Apr 25, 2016

miyosuda commented Apr 25, 2016 • edited Loading

ghost commented Apr 25, 2016 • edited by ghost Loading

miyosuda commented Apr 25, 2016 • edited Loading

ghost commented Apr 25, 2016

miyosuda commented Apr 25, 2016 • edited Loading

ghost commented Apr 25, 2016

miyosuda commented Apr 25, 2016 • edited Loading

ghost commented Apr 26, 2016

miyosuda commented Apr 26, 2016 • edited Loading

miyosuda commented Apr 26, 2016

ghost commented Apr 26, 2016

ghost commented Apr 26, 2016 • edited by ghost Loading

joabim commented Apr 26, 2016 • edited Loading

ghost commented Apr 26, 2016

joabim commented Apr 26, 2016 • edited Loading

miyosuda commented Apr 26, 2016

miyosuda commented Apr 26, 2016 • edited Loading

joabim commented Apr 26, 2016 • edited Loading

ghost commented May 6, 2016

originholic commented May 6, 2016 • edited Loading

miyosuda commented May 6, 2016

joabim commented May 9, 2016

miyosuda commented May 9, 2016 • edited Loading

miyosuda commented May 9, 2016

miyosuda commented May 9, 2016

joabim commented May 10, 2016

miyosuda commented May 10, 2016

muupan commented May 10, 2016

miyosuda commented May 10, 2016

cjratcliff commented Jun 16, 2016

miyosuda commented Jun 17, 2016

miyosuda commented Jul 3, 2016 • edited Loading

cjratcliff commented Jul 7, 2016

Itsukara commented Jul 30, 2016

miyosuda commented Aug 1, 2016

sahiliitm commented Sep 16, 2016

mw66 commented Jan 22, 2017

dongleecsu commented Apr 10, 2017

originholic commented Apr 15, 2016 •

edited

Loading

miyosuda commented Apr 15, 2016 •

edited

Loading

originholic commented Apr 15, 2016 •

edited

Loading

miyosuda commented Apr 15, 2016 •

edited

Loading

miyosuda commented Apr 18, 2016 •

edited

Loading

originholic commented Apr 19, 2016 •

edited

Loading

originholic commented Apr 20, 2016 •

edited

Loading

miyosuda commented Apr 25, 2016 •

edited

Loading

ghost commented Apr 25, 2016 •

edited by ghost

Loading

miyosuda commented Apr 25, 2016 •

edited

Loading

miyosuda commented Apr 25, 2016 •

edited

Loading

miyosuda commented Apr 25, 2016 •

edited

Loading

miyosuda commented Apr 26, 2016 •

edited

Loading

ghost commented Apr 26, 2016 •

edited by ghost

Loading

joabim commented Apr 26, 2016 •

edited

Loading

joabim commented Apr 26, 2016 •

edited

Loading

miyosuda commented Apr 26, 2016 •

edited

Loading

joabim commented Apr 26, 2016 •

edited

Loading

originholic commented May 6, 2016 •

edited

Loading

miyosuda commented May 9, 2016 •

edited

Loading

miyosuda commented Jul 3, 2016 •

edited

Loading