[rllib] Refactor rllib to have a common sample collection pathway by ericl · Pull Request #2149 · ray-project/ray

ericl · 2018-05-28T07:40:28Z

What do these changes do?

Currently RLlib algorithms have disparate sample collection pathways. This makes supporting common functionality such as LSTMs, env vectorization, batch norm, and multi-agent hard to do in a generic way.

This PR adds a CommonPolicyEvaluator class which is responsible for routing observations to Policy and TFPolicy instances. In the multi-agent case, this many involve batching and routing observations to several local policies. It will also handle the vectorized env case.

Related issue number

#2053

AmplabJenkins · 2018-05-28T08:11:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5659/
Test FAILed.

AmplabJenkins · 2018-05-28T08:24:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5660/
Test FAILed.

AmplabJenkins · 2018-05-28T08:27:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5661/
Test FAILed.

AmplabJenkins · 2018-05-28T08:35:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5662/
Test FAILed.

AmplabJenkins · 2018-05-28T21:14:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5665/
Test FAILed.

AmplabJenkins · 2018-05-29T00:45:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5669/
Test FAILed.

richardliaw · 2018-05-31T16:31:18Z

python/ray/rllib/utils/common_policy_evaluator.py

+            observation_filter (str): Name of observation filter to use.
+            registry (tune.Registry): Tune object registry. Pass in the value
+                from tune.registry.get_registry() if you're having trouble
+                resolving objects registered in tune.


If I were only using RLlib, I wouldn't know what Tune is, and this wouldn't be that informative. I think we should fix up this documentation so that it describes the functionality of the registry (and refer to Tune for more information)

richardliaw · 2018-05-31T16:33:22Z

python/ray/rllib/utils/common_policy_evaluator.py

+                resolving objects registered in tune.
+            env_config (dict): Config to pass to the env creator.
+            model_config (dict): Config to use when creating the policy model.
+            policy_config (dict): Config to pass to the policy.


Given that this is one of the core primitives that people would use, it makes sense to include examples for usage.

richardliaw · 2018-05-31T16:36:28Z

python/ray/rllib/a3c/a3c_tf_policy.py

+            loss_inputs=self.loss_in, is_training=self.is_training,
+            state_inputs=self.state_in, state_outputs=self.state_out)
+
+        # TODO(ekl) move session creation and init to CommonPolicyEvaluator


isn't session creation already in CommonPolicyEvaluator?

richardliaw · 2018-05-31T16:37:18Z

python/ray/rllib/a3c/tfpolicy.py

-
-        # TODO(rliaw): Can consider exposing these parameters
-        self.sess = tf.Session(graph=self.g, config=tf.ConfigProto(
-            intra_op_parallelism_threads=1, inter_op_parallelism_threads=2,


Can you leave a TODO somewhere to make sure A3C creates a session with these such parameters? It affects performance quite a bit.

richardliaw · 2018-05-31T16:49:32Z

python/ray/rllib/bc/policy.py

        if self.summarize:
            bs = tf.to_float(tf.shape(self.x)[0])
-            tf.summary.scalar("model/policy_loss", self.pi_loss / bs)
+            tf.summary.scalar("model/policy_graph", self.pi_loss / bs)


this doesn't make that much sense

richardliaw · 2018-05-31T16:52:49Z

python/ray/rllib/utils/tf_policy_graph.py

+    def extra_apply_grad_fetches(self):
+        return {}  # e.g., batch norm updates
+
+    def optimizer(self):


this creates a new Optimizer every time?

It's only called once.

Would it make sense to make it private (or idempotent)?

It's not stateless, and if public, it will show up on autocomplete tools (IPython, Jupyter, etc) and cause headaches. People are using rllib on the notebook setting already and presumably a lot more after this refactor.

This is just not an issue. We are calling this ourselves, not the user, so it's impossible for them to screw it up.

richardliaw · 2018-05-31T16:56:02Z

python/ray/rllib/utils/tf_policy_graph.py

+        feed_dict = self.extra_compute_action_feed_dict()
+        feed_dict[self._obs_input] = obs_batch
+        feed_dict[self._is_training] = is_training
+        for ph, value in zip(self._state_inputs, state_batches):


is this guaranteed to be ordered correctly?

Yeah, it's a list

richardliaw · 2018-05-31T17:51:23Z

python/ray/rllib/dqn/dqn.py

-            DQNEvaluator)
+            num_gpus=self.config["num_gpus_per_worker"])
        self.remote_evaluators = [
            remote_cls.remote(


side thought: it might even be cleaner if

remote_cls = CommonPolicyEvaluator.as_remote( ... ) remote_evaluators = [remote_cls(args) for i in range(num_workers)]

where remote_cls hides the ray cls.remote functionality

richardliaw

Left some questions. Do we have a list or issue where all the refactoring is centered around?

General list of later todos:

PyTorch Policy Graph
Moving PPO onto the common evaluator
consider a better way of managing exploration
managing filters?

Other nit:

Perhaps consider exposing something instead of LocalSyncReplay - something that puts the for loop of evaluation up front, and then think about the process of going from single thread to multi-process/multi-machine and making that process easy to do.

richardliaw · 2018-05-31T18:01:57Z

python/ray/rllib/utils/common_policy_evaluator.py

+
+        return func(self)
+
+    def for_policy(self, func):


for_policy naming is a bit odd, but we can revisit this..

richardliaw · 2018-05-31T18:02:35Z

python/ray/rllib/utils/common_policy_evaluator.py

+            preprocessor_pref="rllib",
+            sample_async=False,
+            compress_observations=False,
+            consumer_buffer_size=0,


someone somewhere is going to need to explain what a "consumer" is to the user

Just removed for now.

richardliaw · 2018-05-31T18:06:44Z

python/ray/rllib/dqn/dqn.py

-            episode_len_mean=mean_100ep_length,
-            episodes_total=num_episodes,
-            timesteps_this_iter=self.global_timestep - start_timestep,
+        exp_vals = [self.exploration0.value(self.global_timestep)]


I wonder if it makes sense to have the evaluator manage exploration.

This is fine to do in a followup discussion...

Hmm, if we expose some "global stats" object then it could.

richardliaw · 2018-05-31T18:13:53Z

python/ray/rllib/ppo/ppo_evaluator.py

 from ray.rllib.utils.filter import get_filter, MeanStdFilter
 from ray.rllib.utils.process_rollout import process_rollout
-from ray.rllib.ppo.loss import ProximalPolicyLoss
+from ray.rllib.ppo.loss import ProximalPolicyGraph


so this will eventually be moved onto CommonPolicyEvaluator?

Yeah, we should do that.

richardliaw · 2018-05-31T18:26:22Z

python/ray/rllib/tuned_examples/pong-a3c.yaml

    env: PongDeterministic-v4
    run: A3C
    config:
-        num_workers: 16


this is for tuned examples right? ie, examples where our configurations are supposed to be SOTA?

richardliaw · 2018-05-31T18:27:12Z

python/ray/rllib/test/test_supported_spaces.py

        Box(0.0, 1.0, (5,), dtype=np.float32)]),
 }

-# (alg, action_space, obs_space)


why remove?

I'm just throwing unsupported now

richardliaw · 2018-05-31T18:28:23Z

python/ray/rllib/test/test_checkpoint_restore.py

        a2 = get_mean_action(alg2, obs)
        print("Checking computed actions", alg1, obs, a1, a2)
-        assert abs(a1 - a2) < .1, (a1, a2)
+        if abs(a1 - a2) > .1:


np.allclose is probably the better thing to use

richardliaw · 2018-05-31T18:33:47Z

python/ray/rllib/a3c/a3c_evaluator.py

-        self.config = config
-
-        # Technically not needed when not remote
-        self.obs_filter = get_filter(


are these functionalities completely supported in the refactoring (ie, saving restoring)? if not, we should probably leave a couple notes/warnings

They should be

ericl · 2018-05-31T20:08:43Z

Perhaps consider exposing something instead of LocalSyncReplay - something that puts the for loop of evaluation up front, and then think about the process of going from single thread to multi-process/multi-machine and making that process easy to do.

You can always copy paste the code and run it directly right? I don't think policy optimizers are required unless your actually putting your algorithm into rllib.

I kind of imagine the process as follows:

Start with single process, use CommonPolicyEvaluator as a util to produce batches.
Move to multi process with CommonPolicyEvaluator.as_remote()
Use a pre-built policy optimize or roll your own.

You could also imagine an even lower level step where you use VectorEnv directly.

AmplabJenkins · 2018-06-07T23:25:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5914/
Test PASSed.

AmplabJenkins · 2018-06-07T23:26:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5915/
Test FAILed.

ericl · 2018-06-07T23:42:25Z

@richardliaw this is ready for review

robertnishihara · 2018-06-08T00:15:33Z

python/ray/rllib/a3c/a3c.py

        self.local_evaluator.restore(extra_data["local_state"])

-    def compute_action(self, observation):
+    def compute_action(self, observation, state=[]):


better to avoid using mutable objects as default values, perhaps

state=None and state = [] if state is None else state

AmplabJenkins · 2018-06-08T00:16:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5916/
Test PASSed.

robertnishihara · 2018-06-08T00:17:20Z

python/ray/rllib/dqn/dqn.py

-    def compute_action(self, observation):
-        return self.local_evaluator.dqn_graph.act(
-            self.local_evaluator.sess, np.array(observation)[None], 0.0)[0]
+    def compute_action(self, observation, state=[]):


same comment here about default arguments

robertnishihara · 2018-06-08T00:17:42Z

python/ray/rllib/optimizers/policy_optimizer.py

-        remote_evaluators = [
-            remote_cls.remote(*evaluator_args)
-            for _ in range(num_workers)]
+        if type(evaluator_args) is list:


isinstance(evaluator_args, list)

robertnishihara · 2018-06-08T00:17:56Z

python/ray/rllib/pg/pg.py

-    def compute_action(self, obs):
-        action, info = self.optimizer.local_evaluator.policy.compute(obs)
-        return action
+    def compute_action(self, observation, state=[]):


mutable default arg

AmplabJenkins · 2018-06-08T01:53:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5926/
Test PASSed.

richardliaw

OK last comments - will merge by tonight after addressed.

richardliaw · 2018-06-08T23:39:48Z

python/ray/rllib/a3c/a3c_tf_policy.py

+
+import tensorflow as tf
+import gym
+from ray.rllib.utils.error import UnsupportedSpaceException


nit: space between ray imports and non-ray

richardliaw · 2018-06-08T23:57:20Z

python/ray/rllib/utils/tf_policy_graph.py

+    def extra_apply_grad_fetches(self):
+        return {}  # e.g., batch norm updates
+
+    def optimizer(self):


Would it make sense to make it private (or idempotent)?

It's not stateless, and if public, it will show up on autocomplete tools (IPython, Jupyter, etc) and cause headaches. People are using rllib on the notebook setting already and presumably a lot more after this refactor.

richardliaw · 2018-06-09T00:32:03Z

python/ray/rllib/utils/sampler.py

@@ -6,76 +6,7 @@
 import threading
 from collections import namedtuple
 import numpy as np


space between ray and non-ray imports?

richardliaw · 2018-06-09T00:38:59Z

python/ray/rllib/utils/sampler.py

                rewards=reward,
                dones=terminal,
-                features=last_features,
                new_obs=observation,


IDK where to leave this note, but we're actually doubling the number of states we need to send here (observation and last_observation). In a later optimization, we should consider addressing this (I can put this on TODOs)

We've been doing this all along -- but yeah, could optimize later.

richardliaw · 2018-06-09T00:47:00Z

python/ray/rllib/utils/policy_graph.py

+            actions (np.ndarray): batch of output actions, with shape like
+                [BATCH_SIZE, ACTION_SHAPE].
+            state_outs (list): list of RNN state output batches, if any, with
+                shape like [STATE_SIZE, BATCH_SIZE].


why is BATCH after STATE here?

This is so you have a small list of big lists and not a big list of small lists.

richardliaw · 2018-06-09T00:48:17Z

python/ray/rllib/utils/policy_graph.py

+        """Restores all local state.
+
+        Arguments:
+            state (obj): Serialized local state."""


nit: inconsistent quote placement

richardliaw · 2018-06-09T00:56:55Z

python/ray/rllib/ppo/loss.py

                                      1 + config["clip_param"]) * advantages
        self.surr = tf.minimum(self.surr1, self.surr2)
-        self.mean_policy_loss = tf.reduce_mean(-self.surr)
+        self.mean_policy_graph = tf.reduce_mean(-self.surr)


This naming change doesn't make sense.

ericl · 2018-06-09T01:29:04Z

Updated

AmplabJenkins · 2018-06-09T02:36:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5956/
Test PASSed.

richardliaw · 2018-06-09T07:21:25Z

Test failures look unrelated.

ericl added 10 commits May 27, 2018 14:36

wip

682ae7e

cls

846a3a6

re

a5e1416

wip

cfb77be

Merge branch 'fix-classmethod' into v2-refactor

7966e63

wip

3c07c29

a3c working

332683c

torch support

3cea2c9

pg works

d7472e5

lint

b4a782b

ericl requested a review from richardliaw May 28, 2018 07:40

ericl added 4 commits May 28, 2018 00:49

rm v2

8738fa3

consumer id

a88957c

clean up pg

370abf0

clean up more

6c2bcbb

ericl added 4 commits May 28, 2018 13:19

fix python 2.7

56429fb

Merge branch 'fix-classmethod' into v2-refactor

2380c8f

tf session management

f16f8f0

docs

71d78b5

ericl added 2 commits May 28, 2018 17:08

dqn wip

5ab8723

fix compile

c6d68ff

ericl added 3 commits May 28, 2018 20:00

dqn

fa015ff

apex runs

e2a41a9

up

84624fe

richardliaw reviewed May 31, 2018

View reviewed changes

ericl added 5 commits June 7, 2018 15:17

update

1f435f7

Merge remote-tracking branch 'upstream/master' into v2-refactor

6dbd0e8

test cpe

f910464

unit test

5685e32

fix ddpg2

f2af5dc

robertnishihara reviewed Jun 8, 2018

View reviewed changes

ericl added 3 commits June 7, 2018 17:38

args

ad9a205

faster test

1b9b192

common

21cecdd

richardliaw requested changes Jun 9, 2018

View reviewed changes

ericl added 2 commits June 8, 2018 18:27

updates

b4ef184

updates

80577c8

richardliaw approved these changes Jun 9, 2018

View reviewed changes

richardliaw merged commit 71eb558 into ray-project:master Jun 9, 2018

richardliaw deleted the v2-refactor branch June 9, 2018 07:21

alok mentioned this pull request Jun 9, 2018

[rllib] Misc rllib changes #2224

Closed

Conversation

ericl commented May 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented May 28, 2018

Uh oh!

AmplabJenkins commented May 28, 2018

Uh oh!

AmplabJenkins commented May 28, 2018

Uh oh!

AmplabJenkins commented May 28, 2018

Uh oh!

AmplabJenkins commented May 28, 2018

Uh oh!

AmplabJenkins commented May 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw May 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ericl commented May 28, 2018 •

edited

Loading

richardliaw May 31, 2018 •

edited

Loading

ericl commented May 31, 2018 •

edited

Loading