[rllib] Autoregressive action distributions #5304

ericl · 2019-07-29T05:12:34Z

What do these changes do?

Pass in the parent model reference into the action distribution class, so it can use model variables for auto-regressive sampling
Switch IMPALA / PPO to use the prev actions logp instead of logits for importance weighting
Add a simple example and documentation
Debug RNN nans

Right now the custom action distribution is injected via an override_action_dist hack, this will be removed once #5164 merges.

Related issue number

Closes #4939
Closes #5419

Linter

I've run scripts/format.sh to lint the changes in this PR.

AmplabJenkins · 2019-07-29T06:07:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15737/
Test FAILed.

…sive

AmplabJenkins · 2019-07-29T08:00:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15739/
Test FAILed.

AmplabJenkins · 2019-07-29T08:27:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15740/
Test FAILed.

AmplabJenkins · 2019-07-29T09:14:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15741/
Test FAILed.

AmplabJenkins · 2019-07-29T10:41:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15744/
Test FAILed.

ericl · 2019-07-30T04:15:11Z

I think that's due to a -inf value showing up for the log probabilities, likely due to a tf.log(0.) for some action_prob=0. Not sure if it's a bug introduced in the refactoring or numerical instability. Eric

…

On Mon, Jul 29, 2019 at 7:59 PM yangshanchao ***@***.***> wrote: Hi, @ericl <https://github.com/ericl>, thanks for your work. I compiled ray from your project source, and I found there is something wrong with running the examples, namely, cartpole_lstm.py, custom_keras_rnn_model.py, The error logs are shown as below /home/noone/anaconda3/envs/lab/bin/python /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/examples/cartpole_lstm.py 2019-07-30 10:57:13,442 INFO node.py:498 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-07-30_10-57-13_442445_18362/logs. 2019-07-30 10:57:13,546 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:18959 to respond... 2019-07-30 10:57:13,694 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:14994 to respond... 2019-07-30 10:57:13,699 INFO services.py:809 -- Starting Redis shard with 10.0 GB max memory. 2019-07-30 10:57:13,746 INFO node.py:512 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-07-30_10-57-13_442445_18362/logs. 2019-07-30 10:57:13,746 WARNING services.py:1301 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start. 2019-07-30 10:57:13,747 INFO services.py:1475 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm. == Status == Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/1 GPUs Memory usage on this node: 21.3/67.5 GB 2019-07-30 10:57:13,837 INFO trial_runner.py:176 -- Starting a new experiment. WARNING: Logging before flag parsing goes to stderr. W0730 10:57:15.464846 139644543887168 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term 2019-07-30 10:57:15,670 WARNING signature.py:108 -- The function with_updates has a **kwargs argument, which is currently not supported. 2019-07-30 10:57:15,670 WARNING logger.py:227 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping. 2019-07-30 10:57:15,670 ERROR log_sync.py:34 -- Log sync requires cluster to be setup with `ray up`. == Status == Using FIFO scheduling algorithm. Resources requested: 3/12 CPUs, 0/1 GPUs Memory usage on this node: 21.8/67.5 GB Result logdir: /home/noone/ray_results/PPO Number of trials: 1 ({'RUNNING': 1}) RUNNING trials: - PPO_cartpole_stateless_0: RUNNING (pid=18408) WARNING: Logging before flag parsing goes to stderr. (pid=18408) W0730 10:57:16.966650 140193080096576 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) non-resource variables are not supported in the long term (pid=18408) 2019-07-30 10:57:17,170 INFO rollout_worker.py:319 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors) (pid=18408) 2019-07-30 10:57:17.170824: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA (pid=18408) 2019-07-30 10:57:17.174655: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 (pid=18408) 2019-07-30 10:57:17.177082: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected (pid=18408) 2019-07-30 10:57:17.177110: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: noone-will-not-die (pid=18408) 2019-07-30 10:57:17.177117: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: noone-will-not-die (pid=18408) 2019-07-30 10:57:17.177160: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 430.14.0 (pid=18408) 2019-07-30 10:57:17.177179: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.14.0 (pid=18408) 2019-07-30 10:57:17.177184: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 430.14.0 (pid=18408) 2019-07-30 10:57:17.200560: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz (pid=18408) 2019-07-30 10:57:17.200991: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5572a2d2f200 executing computations on platform Host. Devices: (pid=18408) 2019-07-30 10:57:17.201010: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> (pid=18408) W0730 10:57:17.206105 140193080096576 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/fcnet_v1.py:48: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) Use keras.layers.dense instead. (pid=18408) W0730 10:57:17.439087 140193080096576 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/lstm_v1.py:47: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. (pid=18408) W0730 10:57:17.439558 140193080096576 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/lstm_v1.py:71: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) Please use `keras.layers.RNN(cell)`, which is equivalent to this API (pid=18408) W0730 10:57:17.488314 140193080096576 deprecation.py:506] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py:957: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) Call initializer instance with the dtype argument instead of passing it to the constructor (pid=18408) W0730 10:57:17.965165 140193080096576 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py:244: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) Use tf.where in 2.0, which has the same broadcast rule as np.where (pid=18408) W0730 10:57:17.986760 140193080096576 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:81: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) Use `tf.random.categorical` instead. (pid=18408) 2019-07-30 10:57:18.022115: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1483] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. (pid=18408) 2019-07-30 10:57:18,065 INFO dynamic_tf_policy.py:333 -- Initializing loss function with dummy input: (pid=18408) (pid=18408) { 'action_prob': <tf.Tensor 'default_policy/action_prob:0' shape=(?,) dtype=float32>, (pid=18408) 'actions': <tf.Tensor 'default_policy/actions:0' shape=(?,) dtype=int64>, (pid=18408) 'advantages': <tf.Tensor 'default_policy/advantages:0' shape=(?,) dtype=float32>, (pid=18408) 'behaviour_logits': <tf.Tensor 'default_policy/behaviour_logits:0' shape=(?, 2) dtype=float32>, (pid=18408) 'dones': <tf.Tensor 'default_policy/dones:0' shape=(?,) dtype=bool>, (pid=18408) 'new_obs': <tf.Tensor 'default_policy/new_obs:0' shape=(?, 2) dtype=float32>, (pid=18408) 'obs': <tf.Tensor 'default_policy/observation:0' shape=(?, 2) dtype=float32>, (pid=18408) 'prev_actions': <tf.Tensor 'default_policy/action:0' shape=(?,) dtype=int64>, (pid=18408) 'prev_rewards': <tf.Tensor 'default_policy/prev_reward:0' shape=(?,) dtype=float32>, (pid=18408) 'rewards': <tf.Tensor 'default_policy/rewards:0' shape=(?,) dtype=float32>, (pid=18408) 'seq_lens': <tf.Tensor 'default_policy/seq_lens_1:0' shape=(?,) dtype=int32>, (pid=18408) 'state_in_0': <tf.Tensor 'default_policy/state_in_0:0' shape=(?, 256) dtype=float32>, (pid=18408) 'state_in_1': <tf.Tensor 'default_policy/state_in_1:0' shape=(?, 256) dtype=float32>, (pid=18408) 'state_out_0': <tf.Tensor 'default_policy/state_out_0:0' shape=(?, 256) dtype=float32>, (pid=18408) 'state_out_1': <tf.Tensor 'default_policy/state_out_1:0' shape=(?, 256) dtype=float32>, (pid=18408) 'value_targets': <tf.Tensor 'default_policy/value_targets:0' shape=(?,) dtype=float32>, (pid=18408) 'vf_preds': <tf.Tensor 'default_policy/vf_preds:0' shape=(?,) dtype=float32>} (pid=18408) (pid=18408) W0730 10:57:18.073951 140193080096576 deprecation.py:506] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:68: calling reduce_max_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) keep_dims is deprecated, use keepdims instead (pid=18408) W0730 10:57:18.077138 140193080096576 deprecation.py:506] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:73: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) keep_dims is deprecated, use keepdims instead (pid=18408) 2019-07-30 10:57:18,902 INFO rollout_worker.py:742 -- Built policy map: {'default_policy': <ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x7f803be310f0>} (pid=18408) 2019-07-30 10:57:18,902 INFO rollout_worker.py:743 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x7f803be2ee80>} (pid=18408) 2019-07-30 10:57:18,903 INFO rollout_worker.py:356 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x7f803be2ecc0>} (pid=18408) 2019-07-30 10:57:18,924 INFO multi_gpu_optimizer.py:93 -- LocalMultiGPUOptimizer devices ['/cpu:0'] (pid=18407) WARNING: Logging before flag parsing goes to stderr. (pid=18407) W0730 10:57:20.316230 140478414169920 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) non-resource variables are not supported in the long term (pid=18414) WARNING: Logging before flag parsing goes to stderr. (pid=18414) W0730 10:57:20.306662 139887666513728 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) non-resource variables are not supported in the long term (pid=18407) 2019-07-30 10:57:20,521 INFO rollout_worker.py:319 -- Creating policy evaluation worker 1 on CPU (please ignore any CUDA init errors) (pid=18407) 2019-07-30 10:57:20.531500: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA (pid=18407) 2019-07-30 10:57:20.535347: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 (pid=18407) 2019-07-30 10:57:20.538141: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected (pid=18407) 2019-07-30 10:57:20.538183: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: noone-will-not-die (pid=18407) 2019-07-30 10:57:20.538190: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: noone-will-not-die (pid=18407) 2019-07-30 10:57:20.538259: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 430.14.0 (pid=18407) 2019-07-30 10:57:20.538279: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.14.0 (pid=18407) 2019-07-30 10:57:20.538284: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 430.14.0 (pid=18407) 2019-07-30 10:57:20.539690: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz (pid=18407) 2019-07-30 10:57:20.540061: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564917f83240 executing computations on platform Host. Devices: (pid=18407) 2019-07-30 10:57:20.540077: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> (pid=18407) W0730 10:57:20.544504 140478414169920 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/fcnet_v1.py:48: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) Use keras.layers.dense instead. (pid=18414) 2019-07-30 10:57:20,519 INFO rollout_worker.py:319 -- Creating policy evaluation worker 2 on CPU (please ignore any CUDA init errors) (pid=18414) 2019-07-30 10:57:20.529410: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA (pid=18414) 2019-07-30 10:57:20.533353: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 (pid=18414) 2019-07-30 10:57:20.536063: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected (pid=18414) 2019-07-30 10:57:20.536094: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: noone-will-not-die (pid=18414) 2019-07-30 10:57:20.536100: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: noone-will-not-die (pid=18414) 2019-07-30 10:57:20.536162: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 430.14.0 (pid=18414) 2019-07-30 10:57:20.536182: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.14.0 (pid=18414) 2019-07-30 10:57:20.536188: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 430.14.0 (pid=18414) 2019-07-30 10:57:20.556584: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz (pid=18414) 2019-07-30 10:57:20.557034: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559e99221020 executing computations on platform Host. Devices: (pid=18414) 2019-07-30 10:57:20.557054: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> (pid=18414) W0730 10:57:20.561483 139887666513728 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/fcnet_v1.py:48: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) Use keras.layers.dense instead. (pid=18407) W0730 10:57:20.769307 140478414169920 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/lstm_v1.py:47: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. (pid=18407) W0730 10:57:20.769812 140478414169920 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/lstm_v1.py:71: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) Please use `keras.layers.RNN(cell)`, which is equivalent to this API (pid=18407) W0730 10:57:20.819773 140478414169920 deprecation.py:506] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py:957: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) Call initializer instance with the dtype argument instead of passing it to the constructor (pid=18414) W0730 10:57:20.788882 139887666513728 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/lstm_v1.py:47: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. (pid=18414) W0730 10:57:20.789384 139887666513728 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/lstm_v1.py:71: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) Please use `keras.layers.RNN(cell)`, which is equivalent to this API (pid=18414) W0730 10:57:20.842777 139887666513728 deprecation.py:506] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py:957: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) Call initializer instance with the dtype argument instead of passing it to the constructor (pid=18407) W0730 10:57:21.309464 140478414169920 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py:244: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) Use tf.where in 2.0, which has the same broadcast rule as np.where (pid=18407) W0730 10:57:21.330303 140478414169920 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:81: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) Use `tf.random.categorical` instead. (pid=18407) 2019-07-30 10:57:21.365374: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1483] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. (pid=18414) W0730 10:57:21.335618 139887666513728 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py:244: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) Use tf.where in 2.0, which has the same broadcast rule as np.where (pid=18414) W0730 10:57:21.356825 139887666513728 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:81: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) Use `tf.random.categorical` instead. (pid=18407) 2019-07-30 10:57:21,412 INFO dynamic_tf_policy.py:333 -- Initializing loss function with dummy input: (pid=18407) (pid=18407) { 'action_prob': <tf.Tensor 'default_policy/action_prob:0' shape=(?,) dtype=float32>, (pid=18407) 'actions': <tf.Tensor 'default_policy/actions:0' shape=(?,) dtype=int64>, (pid=18407) 'advantages': <tf.Tensor 'default_policy/advantages:0' shape=(?,) dtype=float32>, (pid=18407) 'behaviour_logits': <tf.Tensor 'default_policy/behaviour_logits:0' shape=(?, 2) dtype=float32>, (pid=18407) 'dones': <tf.Tensor 'default_policy/dones:0' shape=(?,) dtype=bool>, (pid=18407) 'new_obs': <tf.Tensor 'default_policy/new_obs:0' shape=(?, 2) dtype=float32>, (pid=18407) 'obs': <tf.Tensor 'default_policy/observation:0' shape=(?, 2) dtype=float32>, (pid=18407) 'prev_actions': <tf.Tensor 'default_policy/action:0' shape=(?,) dtype=int64>, (pid=18407) 'prev_rewards': <tf.Tensor 'default_policy/prev_reward:0' shape=(?,) dtype=float32>, (pid=18407) 'rewards': <tf.Tensor 'default_policy/rewards:0' shape=(?,) dtype=float32>, (pid=18407) 'seq_lens': <tf.Tensor 'default_policy/seq_lens_1:0' shape=(?,) dtype=int32>, (pid=18407) 'state_in_0': <tf.Tensor 'default_policy/state_in_0:0' shape=(?, 256) dtype=float32>, (pid=18407) 'state_in_1': <tf.Tensor 'default_policy/state_in_1:0' shape=(?, 256) dtype=float32>, (pid=18407) 'state_out_0': <tf.Tensor 'default_policy/state_out_0:0' shape=(?, 256) dtype=float32>, (pid=18407) 'state_out_1': <tf.Tensor 'default_policy/state_out_1:0' shape=(?, 256) dtype=float32>, (pid=18407) 'value_targets': <tf.Tensor 'default_policy/value_targets:0' shape=(?,) dtype=float32>, (pid=18407) 'vf_preds': <tf.Tensor 'default_policy/vf_preds:0' shape=(?,) dtype=float32>} (pid=18407) (pid=18407) W0730 10:57:21.420790 140478414169920 deprecation.py:506] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:68: calling reduce_max_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) keep_dims is deprecated, use keepdims instead (pid=18407) W0730 10:57:21.423832 140478414169920 deprecation.py:506] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:73: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) keep_dims is deprecated, use keepdims instead (pid=18414) 2019-07-30 10:57:21.394307: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1483] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. (pid=18414) W0730 10:57:21.449012 139887666513728 deprecation.py:506] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:68: calling reduce_max_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) keep_dims is deprecated, use keepdims instead (pid=18414) W0730 10:57:21.452185 139887666513728 deprecation.py:506] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:73: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) keep_dims is deprecated, use keepdims instead (pid=18408) W0730 10:57:21.600737 140193080096576 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/policy/tf_policy.py:572: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. (pid=18408) Instructions for updating: (pid=18408) Prefer Variable.assign which has equivalent behavior in 2.X. (pid=18407) W0730 10:57:22.294743 140478414169920 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/policy/tf_policy.py:572: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. (pid=18407) Instructions for updating: (pid=18407) Prefer Variable.assign which has equivalent behavior in 2.X. (pid=18407) 2019-07-30 10:57:22,332 INFO rollout_worker.py:451 -- Generating sample batch of size 200 (pid=18407) 2019-07-30 10:57:22,332 INFO sampler.py:304 -- Raw obs from env: { 0: { 'agent0': np.ndarray((2,), dtype=float64, min=-0.027, max=-0.006, mean=-0.017)}} (pid=18407) 2019-07-30 10:57:22,332 INFO sampler.py:305 -- Info return from env: {0: {'agent0': None}} (pid=18407) 2019-07-30 10:57:22,333 INFO sampler.py:403 -- Preprocessed obs: np.ndarray((2,), dtype=float64, min=-0.027, max=-0.006, mean=-0.017) (pid=18407) 2019-07-30 10:57:22,333 INFO sampler.py:407 -- Filtered obs: np.ndarray((2,), dtype=float64, min=-0.027, max=-0.006, mean=-0.017) (pid=18407) 2019-07-30 10:57:22,333 INFO sampler.py:521 -- Inputs to compute_actions(): (pid=18407) (pid=18407) { 'default_policy': [ { 'data': { 'agent_id': 'agent0', (pid=18407) 'env_id': 0, (pid=18407) 'info': None, (pid=18407) 'obs': np.ndarray((2,), dtype=float64, min=-0.027, max=-0.006, mean=-0.017), (pid=18407) 'prev_action': np.ndarray((), dtype=int64, min=0.0, max=0.0, mean=0.0), (pid=18407) 'prev_reward': 0.0, (pid=18407) 'rnn_state': [ np.ndarray((256,), dtype=float32, min=0.0, max=0.0, mean=0.0), (pid=18407) np.ndarray((256,), dtype=float32, min=0.0, max=0.0, mean=0.0)]}, (pid=18407) 'type': 'PolicyEvalData'}]} (pid=18407) (pid=18407) 2019-07-30 10:57:22,333 INFO tf_run_builder.py:92 -- Executing TF run without tracing. To dump TF timeline traces to disk, set the TF_TIMELINE_DIR environment variable. (pid=18414) W0730 10:57:22.315722 139887666513728 deprecation.py:323] From /home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/policy/tf_policy.py:572: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. (pid=18414) Instructions for updating: (pid=18414) Prefer Variable.assign which has equivalent behavior in 2.X. (pid=18407) 2019-07-30 10:57:22,381 INFO sampler.py:548 -- Outputs of compute_actions(): (pid=18407) (pid=18407) { 'default_policy': ( np.ndarray((1,), dtype=int64, min=1.0, max=1.0, mean=1.0), (pid=18407) [ np.ndarray((1, 256), dtype=float32, min=-0.018, max=0.015, mean=-0.0), (pid=18407) np.ndarray((1, 256), dtype=float32, min=-0.009, max=0.007, mean=-0.0)], (pid=18407) { 'action_prob': np.ndarray((1,), dtype=float32, min=0.5, max=0.5, mean=0.5), (pid=18407) 'behaviour_logits': np.ndarray((1, 2), dtype=float32, min=-0.0, max=0.0, mean=0.0), (pid=18407) 'vf_preds': np.ndarray((1,), dtype=float32, min=-0.001, max=-0.001, mean=-0.001)})} (pid=18407) (pid=18407) 2019-07-30 10:57:22,401 INFO sample_batch_builder.py:161 -- Trajectory fragment after postprocess_trajectory(): (pid=18407) (pid=18407) { 'agent0': { 'data': { 'action_prob': np.ndarray((16,), dtype=float32, min=0.5, max=0.5, mean=0.5), (pid=18407) 'actions': np.ndarray((16,), dtype=int64, min=0.0, max=1.0, mean=0.375), (pid=18407) 'advantages': np.ndarray((16,), dtype=float32, min=1.083, max=14.855, mean=8.117), (pid=18407) 'agent_index': np.ndarray((16,), dtype=int64, min=0.0, max=0.0, mean=0.0), (pid=18407) 'behaviour_logits': np.ndarray((16, 2), dtype=float32, min=-0.001, max=0.0, mean=-0.0), (pid=18407) 'dones': np.ndarray((16,), dtype=bool, min=0.0, max=1.0, mean=0.062), (pid=18407) 'eps_id': np.ndarray((16,), dtype=int64, min=1907260751.0, max=1907260751.0, mean=1907260751.0), (pid=18407) 'infos': np.ndarray((16,), dtype=object, head={}), (pid=18407) 'new_obs': np.ndarray((16, 2), dtype=float32, min=-0.161, max=0.219, mean=-0.001), (pid=18407) 'obs': np.ndarray((16, 2), dtype=float32, min=-0.142, max=0.184, mean=-0.004), (pid=18407) 'prev_actions': np.ndarray((16,), dtype=int64, min=0.0, max=1.0, mean=0.312), (pid=18407) 'prev_rewards': np.ndarray((16,), dtype=float32, min=0.0, max=1.0, mean=0.938), (pid=18407) 'rewards': np.ndarray((16,), dtype=float32, min=1.0, max=1.0, mean=1.0), (pid=18407) 'state_in_0': np.ndarray((16, 256), dtype=float32, min=-0.297, max=0.309, mean=-0.002), (pid=18407) 'state_in_1': np.ndarray((16, 256), dtype=float32, min=-0.145, max=0.155, mean=-0.001), (pid=18407) 'state_out_0': np.ndarray((16, 256), dtype=float32, min=-0.344, max=0.363, mean=-0.002), (pid=18407) 'state_out_1': np.ndarray((16, 256), dtype=float32, min=-0.167, max=0.181, mean=-0.001), (pid=18407) 't': np.ndarray((16,), dtype=int64, min=0.0, max=15.0, mean=7.5), (pid=18407) 'unroll_id': np.ndarray((16,), dtype=int64, min=0.0, max=0.0, mean=0.0), (pid=18407) 'value_targets': np.ndarray((16,), dtype=float32, min=1.0, max=14.854, mean=8.089), (pid=18407) 'vf_preds': np.ndarray((16,), dtype=float32, min=-0.083, max=-0.001, mean=-0.028)}, (pid=18407) 'type': 'SampleBatch'}} (pid=18407) (pid=18407) 2019-07-30 10:57:22,556 INFO rollout_worker.py:485 -- Completed sample batch: (pid=18407) (pid=18407) { 'data': { 'action_prob': np.ndarray((200,), dtype=float32, min=0.5, max=0.5, mean=0.5), (pid=18407) 'actions': np.ndarray((200,), dtype=int64, min=0.0, max=1.0, mean=0.55), (pid=18407) 'advantages': np.ndarray((200,), dtype=float32, min=0.934, max=37.016, mean=11.662), (pid=18407) 'agent_index': np.ndarray((200,), dtype=int64, min=0.0, max=0.0, mean=0.0), (pid=18407) 'behaviour_logits': np.ndarray((200, 2), dtype=float32, min=-0.001, max=0.001, mean=0.0), (pid=18407) 'dones': np.ndarray((200,), dtype=bool, min=0.0, max=1.0, mean=0.045), (pid=18407) 'eps_id': np.ndarray((200,), dtype=int64, min=301938920.0, max=1907260751.0, mean=1179643498.465), (pid=18407) 'infos': np.ndarray((200,), dtype=object, head={}), (pid=18407) 'new_obs': np.ndarray((200, 2), dtype=float32, min=-0.235, max=0.219, mean=-0.013), (pid=18407) 'obs': np.ndarray((200, 2), dtype=float32, min=-0.208, max=0.196, mean=-0.012), (pid=18407) 'prev_actions': np.ndarray((200,), dtype=int64, min=0.0, max=1.0, mean=0.52), (pid=18407) 'prev_rewards': np.ndarray((200,), dtype=float32, min=0.0, max=1.0, mean=0.95), (pid=18407) 'rewards': np.ndarray((200,), dtype=float32, min=1.0, max=1.0, mean=1.0), (pid=18407) 'state_in_0': np.ndarray((200, 256), dtype=float32, min=-0.33, max=0.349, mean=0.0), (pid=18407) 'state_in_1': np.ndarray((200, 256), dtype=float32, min=-0.157, max=0.173, mean=0.0), (pid=18407) 'state_out_0': np.ndarray((200, 256), dtype=float32, min=-0.367, max=0.41, mean=0.0), (pid=18407) 'state_out_1': np.ndarray((200, 256), dtype=float32, min=-0.177, max=0.201, mean=0.0), (pid=18407) 't': np.ndarray((200,), dtype=int64, min=0.0, max=45.0, mean=11.825), (pid=18407) 'unroll_id': np.ndarray((200,), dtype=int64, min=0.0, max=0.0, mean=0.0), (pid=18407) 'value_targets': np.ndarray((200,), dtype=float32, min=1.0, max=37.018, mean=11.66), (pid=18407) 'vf_preds': np.ndarray((200,), dtype=float32, min=-0.101, max=0.066, mean=-0.001)}, (pid=18407) 'type': 'SampleBatch'} (pid=18407) (pid=18409) WARNING: Logging before flag parsing goes to stderr. (pid=18409) W0730 10:57:24.763863 140488989505280 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18409) Instructions for updating: (pid=18409) non-resource variables are not supported in the long term (pid=18411) WARNING: Logging before flag parsing goes to stderr. (pid=18411) W0730 10:57:24.746196 139662328850176 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18411) Instructions for updating: (pid=18411) non-resource variables are not supported in the long term (pid=18403) WARNING: Logging before flag parsing goes to stderr. (pid=18403) W0730 10:57:24.916916 140590455617280 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18403) Instructions for updating: (pid=18403) non-resource variables are not supported in the long term (pid=18413) WARNING: Logging before flag parsing goes to stderr. (pid=18413) W0730 10:57:24.874173 139748529788672 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18413) Instructions for updating: (pid=18413) non-resource variables are not supported in the long term (pid=18405) WARNING: Logging before flag parsing goes to stderr. (pid=18405) W0730 10:57:24.884049 140014912661248 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18405) Instructions for updating: (pid=18405) non-resource variables are not supported in the long term (pid=18406) WARNING: Logging before flag parsing goes to stderr. (pid=18406) W0730 10:57:24.923823 139687792867072 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18406) Instructions for updating: (pid=18406) non-resource variables are not supported in the long term (pid=18412) WARNING: Logging before flag parsing goes to stderr. (pid=18412) W0730 10:57:24.944961 139744931452672 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18412) Instructions for updating: (pid=18412) non-resource variables are not supported in the long term (pid=18410) WARNING: Logging before flag parsing goes to stderr. (pid=18410) W0730 10:57:24.940662 139820158273280 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18410) Instructions for updating: (pid=18410) non-resource variables are not supported in the long term (pid=18404) WARNING: Logging before flag parsing goes to stderr. (pid=18404) W0730 10:57:25.000994 140114953090816 deprecation.py:323] From /home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. (pid=18404) Instructions for updating: (pid=18404) non-resource variables are not supported in the long term (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/fc1/kernel:0' shape=(2, 256) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/fc1/bias:0' shape=(256,) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/fc2/kernel:0' shape=(256, 256) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/fc2/bias:0' shape=(256,) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/rnn/lstm_cell/kernel:0' shape=(512, 1024) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/rnn/lstm_cell/bias:0' shape=(1024,) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/action/w:0' shape=(256, 2) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/default_model/action/b:0' shape=(2,) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/value_function/value_function/w:0' shape=(256, 1) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,777 INFO tf_policy.py:357 -- Optimizing variable <tf.Variable 'default_policy/value_function/value_function/b:0' shape=(1,) dtype=float32_ref> (pid=18408) 2019-07-30 10:57:25,798 INFO tf_policy.py:549 -- Padded input for RNN: (pid=18408) (pid=18408) { 'features': [ np.ndarray((5800,), dtype=float64, min=0.0, max=1.0, mean=0.324), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=1.0, mean=0.657), (pid=18408) np.ndarray((5800, 2), dtype=float64, min=-0.316, max=0.325, mean=0.002), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=0.501, mean=0.345), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=1.0, mean=0.341), (pid=18408) np.ndarray((5800,), dtype=float64, min=-1.325, max=3.515, mean=-0.0), (pid=18408) np.ndarray((5800, 2), dtype=float64, min=-0.001, max=0.001, mean=-0.0), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=40.104, mean=7.999), (pid=18408) np.ndarray((5800,), dtype=float64, min=-0.141, max=0.14, mean=-0.001)], (pid=18408) 'initial_states': [ np.ndarray((290, 256), dtype=float32, min=-0.43, max=0.462, mean=-0.0), (pid=18408) np.ndarray((290, 256), dtype=float32, min=-0.216, max=0.216, mean=-0.0)], (pid=18408) 'max_seq_len': 20, (pid=18408) 'seq_lens': np.ndarray((290,), dtype=int64, min=1.0, max=20.0, mean=13.793)} (pid=18408) (pid=18408) 2019-07-30 10:57:25,799 INFO multi_gpu_impl.py:146 -- Training on concatenated sample batches: (pid=18408) (pid=18408) { 'inputs': [ np.ndarray((5800,), dtype=float64, min=0.0, max=1.0, mean=0.324), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=1.0, mean=0.657), (pid=18408) np.ndarray((5800, 2), dtype=float64, min=-0.316, max=0.325, mean=0.002), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=0.501, mean=0.345), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=1.0, mean=0.341), (pid=18408) np.ndarray((5800,), dtype=float64, min=-1.325, max=3.515, mean=-0.0), (pid=18408) np.ndarray((5800, 2), dtype=float64, min=-0.001, max=0.001, mean=-0.0), (pid=18408) np.ndarray((5800,), dtype=float64, min=0.0, max=40.104, mean=7.999), (pid=18408) np.ndarray((5800,), dtype=float64, min=-0.141, max=0.14, mean=-0.001)], (pid=18408) 'placeholders': [ <tf.Tensor 'default_policy/action:0' shape=(?,) dtype=int64>, (pid=18408) <tf.Tensor 'default_policy/prev_reward:0' shape=(?,) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/observation:0' shape=(?, 2) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/action_prob:0' shape=(?,) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/actions:0' shape=(?,) dtype=int64>, (pid=18408) <tf.Tensor 'default_policy/advantages:0' shape=(?,) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/behaviour_logits:0' shape=(?, 2) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/value_targets:0' shape=(?,) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/vf_preds:0' shape=(?,) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/Placeholder:0' shape=(?, 256) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/Placeholder_1:0' shape=(?, 256) dtype=float32>, (pid=18408) <tf.Tensor 'default_policy/seq_lens:0' shape=(?,) dtype=int32>], (pid=18408) 'state_inputs': [ np.ndarray((290, 256), dtype=float32, min=-0.43, max=0.462, mean=-0.0), (pid=18408) np.ndarray((290, 256), dtype=float32, min=-0.216, max=0.216, mean=-0.0), (pid=18408) np.ndarray((290,), dtype=int64, min=1.0, max=20.0, mean=13.793)]} (pid=18408) (pid=18408) 2019-07-30 10:57:25,799 INFO multi_gpu_impl.py:191 -- Divided 290 rollout sequences, each of length 20, among 1 devices. Result for PPO_cartpole_stateless_0: custom_metrics: {} date: 2019-07-30_10-57-29 done: false episode_len_mean: 21.015873015873016 episode_reward_max: 55.0 episode_reward_mean: 21.015873015873016 episode_reward_min: 8.0 episodes_this_iter: 189 episodes_total: 189 experiment_id: 9ffde182bcfe4609bfc8d6ad77aa4703 hostname: noone-will-not-die info: grad_time_ms: 3127.938 learner: default_policy: cur_kl_coeff: 0.20000000298023224 cur_lr: 4.999999873689376e-05 entropy: .nan entropy_coeff: 0.0 kl: .nan policy_loss: .nan total_loss: .nan vf_explained_var: -1.0 vf_loss: .nan load_time_ms: 72.371 num_steps_sampled: 4000 num_steps_trained: 5760 sample_time_ms: 3620.068 update_time_ms: 509.852 iterations_since_restore: 1 node_ip: 10.170.34.144 num_healthy_workers: 2 off_policy_estimator: {} pid: 18408 policy_reward_mean: {} sampler_perf: mean_env_wait_ms: 0.09694922668843232 mean_inference_ms: 1.4001183253509777 mean_processing_ms: 0.17961225506642745 time_since_restore: 7.389065742492676 time_this_iter_s: 7.389065742492676 time_total_s: 7.389065742492676 timestamp: 1564455449 timesteps_since_restore: 4000 timesteps_this_iter: 4000 timesteps_total: 4000 training_iteration: 1 trial_id: bc9a7408 == Status == Using FIFO scheduling algorithm. Resources requested: 3/12 CPUs, 0/1 GPUs Memory usage on this node: 24.3/67.5 GB Result logdir: /home/noone/ray_results/PPO Number of trials: 1 ({'RUNNING': 1}) RUNNING trials: - PPO_cartpole_stateless_0: RUNNING, [3 CPUs, 0 GPUs], [pid=18408], 7 s, 1 iter, 4000 ts, 21 rew 2019-07-30 10:57:29,064 ERROR trial_runner.py:550 -- Error processing event. Traceback (most recent call last): File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/tune/trial_runner.py", line 498, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/tune/ray_trial_executor.py", line 342, in fetch_result result = ray.get(trial_future[0]) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/worker.py", line 2246, in get raise value ray.exceptions.RayTaskError: ray_PPO:train() (pid=18408, host=noone-will-not-die) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/agents/trainer.py", line 369, in train raise e File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/agents/trainer.py", line 358, in train result = Trainable.train(self) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/tune/trainable.py", line 171, in train result = self._train() File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/agents/trainer_template.py", line 126, in _train fetches = self.optimizer.step() File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/optimizers/multi_gpu_optimizer.py", line 140, in step self.num_envs_per_worker, self.train_batch_size) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/optimizers/rollout.py", line 29, in collect_samples next_sample = ray_get_and_free(fut_sample) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/utils/memory.py", line 33, in ray_get_and_free result = ray.get(object_ids) ray.exceptions.RayTaskError: ray_RolloutWorker:sample() (pid=18414, host=noone-will-not-die) File "/home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 2 which is outside the valid range of [0, 2). Label values: 2 [[{{node default_policy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] During handling of the above exception, another exception occurred: ray_RolloutWorker:sample() (pid=18414, host=noone-will-not-die) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/utils/tf_run_builder.py", line 48, in get self.feed_dict, os.environ.get("TF_TIMELINE_DIR")) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/utils/tf_run_builder.py", line 94, in run_timeline fetches = sess.run(ops, feed_dict=feed_dict) File "/home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/home/noone/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 2 which is outside the valid range of [0, 2). Label values: 2 [[node default_policy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at /Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py:54) ]] Errors may have originated from an input operation. Input Source operations connected to node default_policy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits: default_policy/default_model_1/add (defined at /Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/misc.py:69) Original stack trace for 'default_policy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits': File "/Documents/New_TF/ray-autoregressive/python/ray/workers/default_worker.py", line 98, in <module> ray.worker.global_worker.main_loop() File "/Documents/New_TF/ray-autoregressive/python/ray/rllib/evaluation/rollout_worker.py", line 334, in __init__ self._build_policy_map(policy_dict, policy_config) File "/Documents/New_TF/ray-autoregressive/python/ray/rllib/evaluation/rollout_worker.py", line 738, in _build_policy_map policy_map[name] = cls(obs_space, act_space, merged_conf) File "/Documents/New_TF/ray-autoregressive/python/ray/rllib/policy/tf_policy_template.py", line 144, in __init__ obs_include_prev_action_reward=obs_include_prev_action_reward) File "/Documents/New_TF/ray-autoregressive/python/ray/rllib/policy/dynamic_tf_policy.py", line 178, in __init__ action_prob = self.action_dist.sampled_action_prob() File "/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py", line 41, in sampled_action_prob return tf.exp(self.logp(self.sample_op)) File "/Documents/New_TF/ray-autoregressive/python/ray/rllib/models/tf/tf_action_dist.py", line 54, in logp logits=self.inputs, labels=tf.cast(x, tf.int32)) File "/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 3338, in sparse_softmax_cross_entropy_with_logits precise_logits, labels, name=name) File "/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 11815, in sparse_softmax_cross_entropy_with_logits labels=labels, name=name) File "/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3296, in create_op op_def=op_def) File "/anaconda3/envs/lab/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1692, in __init__ self._traceback = tf_stack.extract_stack() During handling of the above exception, another exception occurred: ray_RolloutWorker:sample() (pid=18414, host=noone-will-not-die) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/evaluation/rollout_worker.py", line 453, in sample batches = [self.input_reader.next()] File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/evaluation/sampler.py", line 56, in next batches = [self.get_data()] File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/evaluation/sampler.py", line 97, in get_data item = next(self.rollout_provider) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/evaluation/sampler.py", line 321, in _env_runner active_episodes) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/evaluation/sampler.py", line 544, in _do_policy_eval eval_results[k] = builder.get(v) File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/utils/tf_run_builder.py", line 53, in get self.fetches, self.feed_dict)) ValueError: Error fetching: [<tf.Tensor 'default_policy/Squeeze:0' shape=(?,) dtype=int64>, <tf.Tensor 'default_policy/default_model_1/rnn/while/Exit_3:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/default_model_1/rnn/while/Exit_4:0' shape=(?, 256) dtype=float32>, {'action_prob': <tf.Tensor 'default_policy/Exp:0' shape=(?,) dtype=float32>, 'vf_preds': <tf.Tensor 'default_policy/value_function/Reshape:0' shape=(?,) dtype=float32>, 'behaviour_logits': <tf.Tensor 'default_policy/default_model_1/add:0' shape=(?, 2) dtype=float32>}], feed_dict={<tf.Tensor 'default_policy/observation:0' shape=(?, 2) dtype=float32>: [array([-0.01396689, 0.04697164])], <tf.Tensor 'default_policy/seq_lens:0' shape=(?,) dtype=int32>: array([1.]), <tf.Tensor 'default_policy/action:0' shape=(?,) dtype=int64>: [0], <tf.Tensor 'default_policy/prev_reward:0' shape=(?,) dtype=float32>: [1.0], <tf.Tensor 'default_policy/PlaceholderWithDefault:0' shape=() dtype=bool>: False, <tf.Tensor 'default_policy/Placeholder:0' shape=(?, 256) dtype=float32>: [array([ 2.34208116e-03, 2.55136266e-02, 3.10041010e-04, 9.96489171e-03, 7.67951598e-03, 1.36089586e-02, 3.93364485e-03, 2.33557466e-02, -2.86707655e-05, -2.02056170e-02, -2.16141902e-02, 1.11253373e-02, 1.11075882e-02, 1.64756086e-04, 5.24371862e-03, 9.52905789e-03, 6.41135685e-03, 1.19743608e-02, -2.50140689e-02, 2.69174203e-03, -1.46168256e-02, -1.42096751e-03, 8.91837385e-03, 1.54695939e-03, 1.35066248e-02, 3.65241454e-03, -4.80578281e-04, -8.75154696e-03, -1.37766516e-02, -2.54603452e-03, 1.32534420e-04, 1.43313948e-02, -9.28479526e-03, 1.54519873e-02, -2.06236504e-02, 5.41525614e-03, -5.89275034e-04, 1.17849987e-02, 2.13536602e-02, -1.13948854e-02, 6.57908805e-03, 2.68336590e-02, 7.36262556e-03, -6.47960976e-03, -5.87534113e-03, -1.23023894e-02, -3.09356488e-03, -4.59006289e-03, -5.72994212e-03, 8.42755754e-03, -7.12607848e-03, 1.62174199e-02, -3.50747630e-02, 1.19702499e-02, 7.45555107e-03, 2.46695825e-03, -9.18773748e-03, -1.69583615e-02, -2.05076821e-02, 1.43612996e-02, -1.42295612e-02, 2.84119882e-02, -1.52846221e-02, -5.77235501e-03, 5.37288608e-03, 1.31639875e-02, 3.14234942e-02, 1.20935515e-02, -1.32916663e-02, -3.43763432e-03, 1.45858154e-02, 1.71487732e-03, -3.28830117e-03, -3.96627467e-04, 4.75734659e-03, -1.25105372e-02, 1.08230552e-02, -7.62496609e-03, -4.86280117e-03, 3.62977292e-03, 3.24787479e-03, 4.03538812e-03, -1.01199895e-02, -8.23730417e-03, 3.41572277e-02, 6.17882842e-03, 2.34485269e-02, -5.80853922e-03, -1.94547419e-02, 2.36917660e-02, -9.88009013e-03, -2.83728130e-02, 6.70244545e-03, 1.70527566e-02, 5.37017360e-03, -4.68060840e-03, -2.86601763e-02, 9.00363829e-03, 1.74534097e-02, 1.88261382e-02, 6.83917664e-04, -1.66722983e-02, -2.42553055e-02, -5.75246150e-03, 1.77971367e-03, 3.37178633e-03, -7.38237519e-04, -9.40109417e-03, -1.31571665e-03, -1.79875195e-02, -3.12234741e-02, 4.24024370e-03, 2.51976270e-02, -2.56316345e-02, -2.17599049e-02, 1.16055962e-02, 1.28593529e-02, -7.72294728e-03, -1.88843105e-02, -9.87186097e-03, 7.03094527e-03, 1.61100421e-02, 9.15877149e-03, -2.48155985e-02, 1.70227215e-02, 1.75281633e-02, 7.41484575e-04, -5.46360854e-03, 7.57378759e-03, -5.77460742e-03, -1.32503342e-02, 1.66880563e-02, 2.07601655e-02, 1.20913619e-02, 1.82248242e-02, -6.32480765e-03, -7.68216234e-03, 4.15570429e-03, -2.86008790e-03, -2.44393721e-02, -7.42992386e-03, 1.26600377e-02, -2.26717740e-02, 2.65147677e-03, -4.83682752e-03, 1.37622301e-02, 3.06521868e-03, 2.24752873e-02, -7.29710935e-03, 2.55866088e-02, 7.12188147e-03, -8.18472169e-03, -9.69675183e-03, 1.69819742e-02, -2.62668356e-03, -6.96957298e-03, 3.74922063e-04, -1.14858216e-02, 1.45210158e-02, 6.33438677e-03, -4.39420808e-04, -1.34746283e-02, 1.85619406e-02, 5.10506425e-03, 3.78300902e-04, 2.45013908e-02, 5.17692138e-03, 1.25240255e-02, -1.22654308e-02, -6.59019407e-03, 1.71815343e-02, -1.23361982e-02, 1.73524837e-04, 1.52290203e-02, 2.08190158e-02, 2.56576631e-02, -1.66522879e-02, 1.97979547e-02, -1.19318878e-02, 3.46780522e-04, -2.58961581e-02, -7.94736762e-03, 2.49958225e-02, 2.41569579e-02, -1.81597024e-02, 1.87295135e-02, -4.49576089e-03, 3.46424282e-02, 3.23430123e-03, -2.12764367e-04, 1.14036025e-02, -9.37443040e-03, 8.58873222e-03, -2.67743645e-03, -1.42927673e-02, -7.09009683e-03, -8.12167861e-03, -1.07923560e-02, -1.41440099e-02, -1.11807305e-02, -1.56251024e-02, 9.76091437e-03, -1.27944164e-02, 6.36064168e-03, -3.56043242e-02, -5.62395016e-03, 9.02843382e-03, -9.05743055e-03, -2.44271848e-02, -1.91021189e-02, 8.41616374e-03, -7.17950054e-03, 1.55365458e-02, -1.00921113e-02, -1.90953724e-03, 2.11523436e-02, 1.20006055e-02, -5.95899764e-03, -2.78958678e-02, 2.07987279e-02, -1.82158705e-02, 1.86963833e-03, 2.38565505e-02, -1.14945155e-02, 4.22356650e-04, -1.64230131e-02, 9.71477851e-03, 2.66764313e-04, -1.26275728e-02, -6.90515293e-03, -3.75322974e-03, -2.98748501e-02, 1.10895783e-02, 1.63925346e-03, -1.84452217e-02, -2.77221277e-02, 2.21742354e-02, -2.65139202e-03, -6.82438724e-04, -1.03760781e-02, -3.60063389e-02, 6.30152645e-03, 4.35772911e-03, 4.66712750e-03, 1.97364576e-02, 1.11448281e-02, -2.51266873e-04, 1.86248217e-02, 5.73047902e-04, 1.24117425e-02, 9.43047367e-03, 2.12754272e-02, -1.74181163e-02, 1.52502870e-02, -6.38781209e-03, -9.77559946e-04], dtype=float32)], <tf.Tensor 'default_policy/Placeholder_1:0' shape=(?, 256) dtype=float32>: [array([ 1.17238273e-03, 1.27084013e-02, 1.55073009e-04, 5.00793103e-03, 3.88638955e-03, 6.82488224e-03, 1.96659961e-03, 1.15871131e-02, -1.44172964e-05, -1.01980362e-02, -1.08604459e-02, 5.56492480e-03, 5.56727732e-03, 8.24103772e-05, 2.59335106e-03, 4.71801404e-03, 3.20580299e-03, 6.05184818e-03, -1.24633387e-02, 1.33451342e-03, -7.35613285e-03, -7.10336491e-04, 4.48231213e-03, 7.77668203e-04, 6.79887785e-03, 1.83072349e-03, -2.40165798e-04, -4.38162545e-03, -6.85304077e-03, -1.27821346e-03, 6.56899647e-05, 7.10834516e-03, -4.65078233e-03, 7.70917209e-03, -1.02481218e-02, 2.69175344e-03, -2.94873491e-04, 5.90227125e-03, 1.07231382e-02, -5.69103705e-03, 3.28548462e-03, 1.34077407e-02, 3.68866138e-03, -3.23944702e-03, -2.95028952e-03, -6.19992102e-03, -1.53574580e-03, -2.28263182e-03, -2.86566350e-03, 4.21292242e-03, -3.52866529e-03, 8.19102488e-03, -1.74916927e-02, 6.00119447e-03, 3.75456433e-03, 1.23948266e-03, -4.60752519e-03, -8.46950337e-03, -1.02336025e-02, 7.15918979e-03, -7.06216646e-03, 1.41680036e-02, -7.52482004e-03, -2.88344361e-03, 2.66507361e-03, 6.60943612e-03, 1.57196652e-02, 5.97457821e-03, -6.67590043e-03, -1.71697454e-03, 7.22507853e-03, 8.52008583e-04, -1.65462436e-03, -1.98466121e-04, 2.36212881e-03, -6.25975942e-03, 5.40321134e-03, -3.79721378e-03, -2.44808616e-03, 1.80749816e-03, 1.62892696e-03, 2.01255269e-03, -5.06624579e-03, -4.10329457e-03, 1.71391647e-02, 3.07555520e-03, 1.18216462e-02, -2.94066337e-03, -9.84815415e-03, 1.18902121e-02, -4.91099013e-03, -1.40925311e-02, 3.35323438e-03, 8.55381228e-03, 2.68215523e-03, -2.34483043e-03, -1.42324241e-02, 4.49846219e-03, 8.69784970e-03, 9.47340578e-03, 3.41146748e-04, -8.31721723e-03, -1.21942014e-02, -2.87110009e-03, 8.89509625e-04, 1.67377363e-03, -3.68013163e-04, -4.68676351e-03, -6.57542842e-04, -9.05317534e-03, -1.54710319e-02, 2.13262602e-03, 1.25569459e-02, -1.28419753e-02, -1.09657412e-02, 5.78984711e-03, 6.43341988e-03, -3.89518915e-03, -9.47754737e-03, -4.99988673e-03, 3.54408170e-03, 8.12951103e-03, 4.57565859e-03, -1.23730823e-02, 8.45211558e-03, 8.76343064e-03, 3.71449045e-04, -2.76468811e-03, 3.75670311e-03, -2.88341474e-03, -6.59114681e-03, 8.40351451e-03, 1.02620153e-02, 5.98482555e-03, 9.22203809e-03, -3.14923213e-03, -3.83959548e-03, 2.07268097e-03, -1.41904084e-03, -1.22563094e-02, -3.75004741e-03, 6.26969757e-03, -1.13490243e-02, 1.32939592e-03, -2.44862540e-03, 6.83564739e-03, 1.53364742e-03, 1.13061294e-02, -3.68273421e-03, 1.27833197e-02, 3.56690167e-03, -4.11900412e-03, -4.82183276e-03, 8.57665576e-03, -1.31439767e-03, -3.48364119e-03, 1.87370446e-04, -5.80565073e-03, 7.32264994e-03, 3.14342743e-03, -2.17068635e-04, -6.72905287e-03, 9.35985707e-03, 2.51369923e-03, 1.90431863e-04, 1.23629933e-02, 2.57665524e-03, 6.27149502e-03, -6.11352827e-03, -3.31815402e-03, 8.53789877e-03, -6.19749539e-03, 8.71637167e-05, 7.64493365e-03, 1.04704006e-02, 1.29027953e-02, -8.34038015e-03, 9.93120018e-03, -5.98284742e-03, 1.73498833e-04, -1.30167268e-02, -3.96047113e-03, 1.24854911e-02, 1.20910667e-02, -9.09861084e-03, 9.35788173e-03, -2.26239464e-03, 1.74810700e-02, 1.59913069e-03, -1.05616731e-04, 5.66452276e-03, -4.67391964e-03, 4.29031951e-03, -1.34624145e-03, -7.20456801e-03, -3.55883781e-03, -4.10289178e-03, -5.38766757e-03, -7.05783907e-03, -5.67587046e-03, -7.68225873e-03, 4.92800586e-03, -6.35477109e-03, 3.16568674e-03, -1.78863946e-02, -2.77919136e-03, 4.50720917e-03, -4.54648072e-03, -1.22682890e-02, -9.62348096e-03, 4.20519477e-03, -3.56953125e-03, 7.89650250e-03, -4.98719327e-03, -9.51231981e-04, 1.05055440e-02, 5.98566188e-03, -2.98265391e-03, -1.40018947e-02, 1.04478272e-02, -9.14274622e-03, 9.26883193e-04, 1.19316401e-02, -5.70638711e-03, 2.11794802e-04, -8.13554693e-03, 4.87645296e-03, 1.31857232e-04, -6.33089384e-03, -3.49620776e-03, -1.88731821e-03, -1.50755728e-02, 5.56117389e-03, 8.14314408e-04, -9.27157514e-03, -1.39756259e-02, 1.10190241e-02, -1.31390942e-03, -3.40362982e-04, -5.18372376e-03, -1.79849193e-02, 3.16347950e-03, 2.16951757e-03, 2.34221458e-03, 9.86323785e-03, 5.54536097e-03, -1.25075836e-04, 9.30343010e-03, 2.84098554e-04, 6.25181384e-03, 4.75017307e-03, 1.06740957e-02, -8.78479239e-03, 7.56963762e-03, -3.20972619e-03, -4.90455655e-04], dtype=float32)]} Traceback (most recent call last): File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/rllib/examples/cartpole_lstm.py", line 190, in <module> "lstm_use_prev_action_reward": args.use_prev_action_reward, File "/home/noone/Documents/New_TF/ray-autoregressive/python/ray/tune/tune.py", line 262, in run raise TuneError("Trials did not complete", errored_trials) ray.tune.error.TuneError: ('Trials did not complete', [PPO_cartpole_stateless_0]) == Status == Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/1 GPUs Memory u

ericl · 2019-08-02T01:55:38Z

In common cases, it's a trival thing. But if we use BinaryAutoregressiveOutput, we do not have access to the input_dict, only having hidden_state (self.input). Is there any chance that we can access the input_dict in the custom distribution?

This is a bit hacky, but if you are defining a custom model you can in your forward() save the input dict as e.g., model.last_input_dict. Then, you can access model.last_input_dict from your action distribution.

Another issue I want to point out is that the hidden_state in the model has to have the same dimension with the model action output to pass the shape check in the Line 162 of modelv2.py

This should be addressed once #5164 is merged, since that PR allows custom action distributions to specify their own output size using required_model_output_shape. Hence, your custom distribution can specify any size as desired.

Let me know if these work.

AmplabJenkins · 2019-08-08T03:22:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16119/
Test FAILed.

yangysc · 2019-08-08T10:12:29Z

Hi, @ericl . Sorry again to trouble you, though the autoregressive action distribution is very mature enough, but I think there is still one tiny function issue about how to fetch the action logits of each subaction, instead of returning the hidden state as the behaviour_logits when we perform testing. This would be very helpful when we debug the programs.

Here are what I have tried.

1. Add the logits directly as the attribute of the model

I noticed that the function vf_preds_and_logits_fetches in the ppo_policy.py is in charge of the fetching task, so I tried to add the a1_logits and a2_logits to the model to return.

def vf_preds_and_logits_fetches(policy):
    """Adds value function and logits outputs to experience batches."""
    return {
        SampleBatch.VF_PREDS: policy.value_function,
        BEHAVIOUR_LOGITS: policy.model_out,
        'action_mask': policy.model.last_input_dict,
        # 'ac_1_logits': policy.model.a1_logits,
        # 'ac_2_logits': policy.model.a2_logits,
    }

But it turns that the BinaryAutoregressiveOutput distribution depends on the hidden state and a1_input as the Input, so it fails since there is no input for them.

2. Trying to call the action_model directly

Based on the first failure, I tried to feed the ctx_input and a1_input directly after running the Line 204 of the rollout.py file.

# Line 204
 a_action, p_state, info = agent.compute_action(
                            a_obs,
                            state=agent_states[agent_id],
                            prev_action=prev_actions[agent_id],
                            prev_reward=prev_rewards[agent_id],
                            policy_id=policy_id, full_fetch=True)
input_size = 256
agent.get_policy(policy_id).model.action_model([info['behaviour_logits'][None], np.zeros((1, input_size), dtype=np.float32)])

But an error shows that

{ValueError}Tensor("default_policy/a2_hidden/kernel/Read/ReadVariableOp:0", shape=(384, 128), dtype=float32) must be from the same graph as Tensor("model_1_1/concatenate_1/concat:0", shape=(1, 384), dtype=float32).

Where the model_1 is the ParametricActionsModel model, and the hidden state size is 128, and a1_input size is 256, so 384==128+256

Do you have any thoughts on fetching the action logits? The training is not well, and I hope I can watch the logits for helping debugging a little.

Sorry for disturbing you again.

Best Wishes

Shanchao

AmplabJenkins · 2019-08-08T10:24:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16134/
Test FAILed.

ericl · 2019-08-08T20:22:32Z

@yangysc I would probably add some tf.Prints inside the action distribution object itself, since you want to capture the logit outputs during the sampling process. It might also be possible to assign to self.model inside the action distribution object to capture the right conditioned output.

AmplabJenkins · 2019-08-09T02:50:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16156/
Test FAILed.

doc/source/rllib-env.rst

doc/source/rllib-models.rst

rllib/agents/dqn/dqn_policy.py

rllib/policy/tf_policy.py

rllib/models/tf/tf_action_dist.py

AmplabJenkins · 2019-08-10T00:48:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16195/
Test FAILed.

AmplabJenkins · 2019-08-10T06:57:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16203/
Test PASSed.

richardliaw · 2019-08-10T20:34:39Z

rllib/examples/autoregressive_action_dist.py

+    ModelCatalog.register_custom_model("autoregressive_model",
+                                       AutoregressiveActionsModel)
+    ModelCatalog.register_custom_action_dist("binary_autoreg_output",
+                                             BinaryAutoregressiveOutput)


why make this explicit?

We could auto register it but that would be more effort for sure.

richardliaw · 2019-08-10T20:39:56Z

rllib/examples/autoregressive_action_dist.py

+        BATCH = tf.shape(self.inputs)[0]
+        a1_logits, _ = self.model.action_model(
+            [self.inputs, tf.zeros((BATCH, 1))])
+        a1_dist = Categorical(a1_logits)


would this have issues? Like somehow adding nodes to the tf graph over and over?

It's fine since this entire thing is only called once in graph mode.

richardliaw · 2019-08-10T20:47:33Z

doc/source/rllib-models.rst

-                # and state of each episode (i.e., for multiagent). You can do
-                # whatever is needed here, e.g., MCTS rollouts.
-                return action_batch
+    class BinaryAutoregressiveOutput(ActionDistribution):


why not use literalinclude so that this doesn't go out of sync?

richardliaw

Looks good though we should probably add a check to make sure the TF graph doesn't change over time...

ericl added 5 commits July 28, 2019 18:24

wip

556985a

wip

515f7ce

fix

dee364f

doc

e2d4fcc

doc

3a51e24

ericl assigned hartikainen Jul 29, 2019

ericl added 7 commits July 28, 2019 23:41

Update dqn_policy.py

81c731f

none

a9e5e14

Merge branch 'autoregressive' of github.com:ericl/ray into autoregres…

05083c3

…sive

lint

292d1ba

Update rllib-models.rst

6e6059d

docs update

368188e

Merge branch 'autoregressive' of github.com:ericl/ray into autoregres…

c1980d7

…sive

ericl added 5 commits July 29, 2019 02:16

doc update

ca4cbbc

move matrix

b469b47

model

a4e3069

env

f5e5d0b

update

2c34ebc

This comment has been minimized.

Sign in to view

ericl changed the title ~~[rllib] Autoregressive action distributions~~ [rllib] [WIP] Autoregressive action distributions Aug 2, 2019

fix shuffle

67a2ae0

fix

da85071

fix vtrace

7a66f09

ericl added 2 commits August 8, 2019 14:09

Merge remote-tracking branch 'upstream/master' into autoregressive

b10b749

fix appo

406620d

ericl assigned richardliaw and unassigned hartikainen Aug 8, 2019