Policy util rework #314

kengz · 2019-04-29T03:20:44Z

Policy util rework for training speedup

This PR reworks the way pdparam and action_pd are computed to produce near 5x speedup (200 FPS to 1000 FPS for A2C). The key is in noticing that forward and backpropagation are the most costly operations.

enforce no_grad during agent action to speed up forward propagation
compute gradient in batch during training, and minimize forward passes to only single passes without wastage.
improve overall pdparam -> action_pd -> log_probs/entropy logic

Remove variable tracking

As a result from above, all the related variables no longer need to be tracked via the body. This allows us to significantly simplify the API.

remove body attributes: action_tensor, action_pd, entropies, log_probs, mean_log_prob
remove body methods: action_pd_update, epi_reset, flush, space_fix_stats
propagate their removals to everywhere in the code

Generalization from A2C

Following #310 , REINFORCE, PPO, SIL are updated and generalized to work on vector environments like A2C.

NOTE: PPO still needs the minibatch sampling. Delegating to @lgraesser . Otherwise, it should work properly
continued work in commit working A2C vec env Pong #315

kengz added 3 commits April 28, 2019 20:18

rework policy_util for efficient action_pd compute in training

e7584bc

remove unused calc_action_pd

c9f090d

update SIL

6670a53

kengz merged commit 9ee3dc0 into v4-dev Apr 29, 2019

kengz deleted the policy-rework branch April 29, 2019 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy util rework #314

Policy util rework #314

kengz commented Apr 29, 2019 •

edited

Loading

Policy util rework #314

Policy util rework #314

Conversation

kengz commented Apr 29, 2019 • edited Loading

Policy util rework for training speedup

Remove variable tracking

Generalization from A2C

kengz commented Apr 29, 2019 •

edited

Loading