Sebula PPO (EnvPool's async API) #338

vwxyzjn · 2023-01-01T01:53:41Z

Description

Todo items:

SPMD learner ✅ helps!
SPS_update calculation for the actor ✅ helps!
multiple threads per actor? ✅ helps!
remove b_inds 🤔 same speed!
jnp.array_split within JIT 🤔 same speed!
integer division issue -- async_update is off (https://wandb.ai/costa-huang/cleanRL/reports/4-vs-3-learner-devices--VmlldzozNDg4MDE5)
possibility of hanging: https://wandb.ai/openrlbenchmark/cleanrl/runs/3too0kec?workspace=user-costa-huang
can you still replicate sample efficiency result with 1GPU?

More experiments

use EnvPool's XLA interface (it's going to be faster, but likely not worth it because it limits the application to non-C++ envs)
use larger batch size...?
allow using multiple actor GPUs
larger models? (e.g., muesli paper's model size)
potentially a data worker? (e.g., dm-haiku's impala)
study the balance between actor and learner?
automatically calculate the queue size
implement something like https://github.com/alex-petrenko/faster-fifo that uses one lock to get many items.

hypothesis:

suppose we have an update function that runs on GPU1, then the execution of update will block the jax.device_put_sharded call in a separate thread that tries to put data from GPU0 to GPU1. Not sure if this is the case for TPU as well.
- ✅ related to this issue: having a dedicated GPU actor helps compared to actor using GPU 0 and learners using GPU 0 1 2 3

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation and previewed the changes via mkdocs serve.
I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

Co-authored-by: Lucain <[email protected]>

python-poetry/poetry#4842 (comment)

Co-authored-by: Lucain <[email protected]>

vwxyzjn · 2023-02-17T16:53:01Z

Got an error

                                                               2023-02-17 06:08:24.084633: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:481] Failed to disconnect from coordination service with status: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614104.084285325","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}. Proceeding with agent shutdown anyway.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/admin/home-costa/.cache/pypoetry/virtualenvs/cleanrl-BE0ShDkT-py3.8/lib/python3.8/site-packages/jax/_src/distributed.py", line 168, in shutdown
    global_state.shutdown()
  File "/admin/home-costa/.cache/pypoetry/virtualenvs/cleanrl-BE0ShDkT-py3.8/lib/python3.8/site-packages/jax/_src/distributed.py", line 87, in shutdown
    self.client.shutdown()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614104.084285325","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}
    2023-02-17 06:08:24.405270: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:1129] Shutdown barrier in coordination service has failed: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::15630121101087999007 [type.googleapis.com/tensorflow.CoordinationServiceError='']. This suggests that at least one worker did not complete its job, or was too slow/hanging in its execution.
2023-02-17 06:08:24.405311: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:731] INTERNAL: Shutdown barrier has been passed with status: 'DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::15630121101087999007 [type.googleapis.com/tensorflow.CoordinationServiceError='']', but this task is not at the barrier yet. [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:24.405379: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:449] Stopping coordination service as shutdown barrier timed out and there is no service-to-client connection.
2023-02-17 06:08:53.225236: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:711] Coordination agent is in ERROR: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:53.225277: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:452] Coordination service agent in error status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:53.226009: F external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.h:75] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
srun: error: ip-26-0-134-228: task 0: Aborted

vwxyzjn · 2023-07-19T03:08:27Z

Closed in favor of https://github.com/vwxyzjn/cleanba

vwxyzjn and others added 30 commits October 13, 2022 17:51

initial commit

1b585d6

pre-commit

fa82356

Add hub integration

4074eee

pre-commit

4436ce4

use CommitOperation

df41e3d

Fix pre-commit

a98383d

refactor

b430540

Merge branch 'master' into hf-integration

dd8ee86

push changes

8144562

refactor

2f20e17

fix pre-commit

fdfc2a5

pre-commit

56413f8

Merge branch 'master' into hf-integration

b1b1dbd

close the env and writer after eval

f6865d4

support dqn jax

fbe986c

pre-commit

83aa010

Update cleanrl_utils/huggingface.py

ba1bfdb

Co-authored-by: Lucain <[email protected]>

address comments

aee6809

update docs

80a460f

support dqn_atari_jax

40be7d8

bug fix and docs

65ded2a

Add cleanrl to the hf's metadata

133e6bd

Merge branch 'master' into hf-integration

10d0b79

include huggingface integration

ca60f24

test for enjoy.py

b165e35

bump version, pip install extra hack

7163d0d

python-poetry/poetry#4842 (comment)

Update cleanrl_utils/huggingface.py

27d9b3d

Co-authored-by: Lucain <[email protected]>

Update cleanrl_utils/huggingface.py

2a2208f

Co-authored-by: Lucain <[email protected]>

Update cleanrl_utils/huggingface.py

4ac5631

Co-authored-by: Lucain <[email protected]>

Update cleanrl_utils/huggingface.py

40358b1

Co-authored-by: Lucain <[email protected]>

vercel bot deployed to Preview February 8, 2023 19:14 View deployment

make sure the program ends

9c46850

vercel bot deployed to Preview February 9, 2023 13:35 View deployment

push changes

52e2638

vercel bot deployed to Preview February 10, 2023 03:18 View deployment

ttumiel mentioned this pull request Feb 15, 2023

add complex observation atari ppo #359

Open

20 tasks

support jax.distributed

e61a1b0

vercel bot deployed to Preview February 16, 2023 04:23 View deployment

vwxyzjn added 2 commits February 16, 2023 20:24

distributed support

8f29c82

pre-commit

2657f59

vercel bot deployed to Preview February 16, 2023 20:25 View deployment

put data transfer in the learner

100ff51

vercel bot deployed to Preview February 17, 2023 02:20 View deployment

quick change

abfe835

vercel bot deployed to Preview February 17, 2023 02:36 View deployment

slurm support

2a89940

vercel bot deployed to Preview February 17, 2023 02:45 View deployment

quick fix

eab04f0

vercel bot deployed to Preview February 17, 2023 03:47 View deployment

push changes

85c6aac

vercel bot deployed to Preview February 17, 2023 03:52 View deployment

quick change

42a800b

vercel bot deployed to Preview February 17, 2023 04:53 View deployment

fix distributed bug

9b52acb

vercel bot deployed to Preview February 19, 2023 02:01 View deployment

pre-commit

753a9cb

vercel bot deployed to Preview February 19, 2023 02:03 View deployment

vwxyzjn closed this Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sebula PPO (EnvPool's async API) #338

Sebula PPO (EnvPool's async API) #338

vwxyzjn commented Jan 1, 2023 •

edited

Loading

vwxyzjn commented Feb 17, 2023

vwxyzjn commented Jul 19, 2023

Sebula PPO (EnvPool's async API) #338

Sebula PPO (EnvPool's async API) #338

Conversation

vwxyzjn commented Jan 1, 2023 • edited Loading

Description

Types of changes

Checklist:

vwxyzjn commented Feb 17, 2023

vwxyzjn commented Jul 19, 2023

vwxyzjn commented Jan 1, 2023 •

edited

Loading