Skip to content

Sebula PPO (EnvPool's async API) #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 108 commits into from
Closed

Sebula PPO (EnvPool's async API) #338

wants to merge 108 commits into from

Conversation

vwxyzjn
Copy link
Owner

@vwxyzjn vwxyzjn commented Jan 1, 2023

Description

Todo items:

More experiments

  • use EnvPool's XLA interface (it's going to be faster, but likely not worth it because it limits the application to non-C++ envs)
  • use larger batch size...?
  • allow using multiple actor GPUs
  • larger models? (e.g., muesli paper's model size)
  • potentially a data worker? (e.g., dm-haiku's impala)
  • study the balance between actor and learner?
  • automatically calculate the queue size
  • implement something like https://github.com/alex-petrenko/faster-fifo that uses one lock to get many items.

hypothesis:

  • suppose we have an update function that runs on GPU1, then the execution of update will block the jax.device_put_sharded call in a separate thread that tries to put data from GPU0 to GPU1. Not sure if this is the case for TPU as well.
    • ✅ related to this issue: having a dedicated GPU actor helps compared to actor using GPU 0 and learners using GPU 0 1 2 3

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have added links to the PR related to the algorithm variant.
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format).
    • I have added links to the tracked experiments.
    • I have updated the overview sections at the docs and the repo
  • I have updated the tests accordingly (if applicable).

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Feb 17, 2023

Got an error

                                                               2023-02-17 06:08:24.084633: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:481] Failed to disconnect from coordination service with status: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614104.084285325","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}. Proceeding with agent shutdown anyway.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/admin/home-costa/.cache/pypoetry/virtualenvs/cleanrl-BE0ShDkT-py3.8/lib/python3.8/site-packages/jax/_src/distributed.py", line 168, in shutdown
    global_state.shutdown()
  File "/admin/home-costa/.cache/pypoetry/virtualenvs/cleanrl-BE0ShDkT-py3.8/lib/python3.8/site-packages/jax/_src/distributed.py", line 87, in shutdown
    self.client.shutdown()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614104.084285325","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}
    2023-02-17 06:08:24.405270: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:1129] Shutdown barrier in coordination service has failed: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::15630121101087999007 [type.googleapis.com/tensorflow.CoordinationServiceError='']. This suggests that at least one worker did not complete its job, or was too slow/hanging in its execution.
2023-02-17 06:08:24.405311: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:731] INTERNAL: Shutdown barrier has been passed with status: 'DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::15630121101087999007 [type.googleapis.com/tensorflow.CoordinationServiceError='']', but this task is not at the barrier yet. [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:24.405379: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:449] Stopping coordination service as shutdown barrier timed out and there is no service-to-client connection.
2023-02-17 06:08:53.225236: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:711] Coordination agent is in ERROR: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:53.225277: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:452] Coordination service agent in error status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:53.226009: F external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.h:75] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
srun: error: ip-26-0-134-228: task 0: Aborted

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Jul 19, 2023

@vwxyzjn vwxyzjn closed this Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants