feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc #562

SolenoidWGT · 2022-12-25T12:22:20Z

Commit:

Add torchrpc message queue.
Implement buffer based on CUDA-shared-tensor to optimize the data path of torchrpc.
Add 'bypass_eventloop' arg in Task() and Parallel().
Add thread lock in distributer.py to prevent sender and receiver competition.
Add message queue perf test for torchrpc, nccl, nng, shm
Add comm_perf_helper.py to make program timing more convenient.
Modified the subscribe() of class MQ, adding 'fn' parameter and 'is_once' parameter.
Add new DummyLock and ConditionLock type in lock_helper.py
Add message queues perf test.
Introduced a new self-hosted runner to execute cuda, multiprocess, torchrpc related tests.

Description

DI-engine integrates torch.distributed.rpc module.

CPU-P2P-RDMA: In IB network environment, support RDMA CPU-P2P transmission
GPU-P2P-RDMA: supports GPU p2p communication

cli-ditask introduces new command line arguments

--mq_type: Introduced torchrpc:cuda and torchrpc:cpu options
1. torchrpc:cuda: Use torchrpc for communication, and allow setting device_map, can use GPU direct RDMA.
2. torchrpc:cpu: Use torchrpc for communication, but device_map is not allowed to be set. All data on the GPU side will be copied to the CPU side for transmission.
--init-method: Initialization entry for init_rpc (required if --mq_type is torchrpc)
--local-cuda-devices: Set the rank range of local GPUs that can be used (optional, default is all visible devices)

--cuda-device-map: Used to set device_map, the format is as follows:

Format:
<Peer node id>_<Local GPU rank>_<Peer GPU rank>,[...]
example:
--cuda-device-map=1_0_1,2_0_2,0_0_0

pytorch api:
options.set_device_map("worker1", {1: 2})

(Optional, the default is to map all visible GPU to the GPU-0 of the peer)

Dynamic GPU communication groups

We create devices mappings between all possible devices in advance. This mapping is all-2-all, which can cover all communication situations. The purpose is to avoid errors caused by incomplete devicemap coverage. Setting redundant mappings will not have any side effects. The mappings are used to check the validity of the device during transport. Only after a new process joins the communication group will it try to create a channel based on these maps.

Node_0 device_maps:
("Node_1", {0: 0}), ("Node_2", {0: 0}), ...., ("Node_99", {0: 0})

Node_1 device_maps:
("Node_0", {0: 0}), ("Node_2", {0: 0}), ...., ("Node_99", {0: 0})

At the same time, we still expose the --cuda-device-map interface, which is used to allow users to configure the topology between devices, torchrpc will follow user input.

Related Issue

TODO

Load balancing capability: in a time-heterogeneous RL task environment, each worker can run at full capacity.

Check List

merge the latest version source branch/repo, and resolve all the conflicts
pass style check
pass all the tests

1. Add torchrpc message queue. 2. Implement buffer based on CUDA-shared-tensor to optimize the data path of torchrpc. 3. Add 'bypass_eventloop' arg in Task() and Parallel(). 4. Add thread lock in distributer.py to prevent sender and receiver competition. 5. Add message queue perf test for torchrpc, nccl, nng, shm 6. Add comm_perf_helper.py to make program timing more convenient. 7. Modified the subscribe() of class MQ, adding 'fn' parameter and 'is_once' parameter. 8. Add new DummyLock and ConditionLock type in lock_helper.py 9. Add message queues perf test. 10. Introduced a new self-hosted runner to execute cuda, multiprocess, torchrpc related tests.

codecov · 2023-01-12T13:25:07Z

Codecov Report

Merging #562 (3fa5319) into main (f798002) will decrease coverage by 1.20%.
The diff coverage is 37.75%.

❗ Current head 3fa5319 differs from pull request most recent head e32055b. Consider uploading reports for the commit e32055b to get more accurate results

@@            Coverage Diff             @@
##             main     #562      +/-   ##
==========================================
- Coverage   83.60%   82.41%   -1.20%     
==========================================
  Files         565      571       +6     
  Lines       46375    47198     +823     
==========================================
+ Hits        38774    38900     +126     
- Misses       7601     8298     +697

Flag	Coverage Δ
unittests	`82.41% <37.75%> (-1.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
ding/torch_utils/data_helper.py	`76.82% <0.00%> (-1.47%)`	⬇️
ding/data/tests/test_shm_buffer.py	`20.58% <16.07%> (-29.42%)`	⬇️
ding/framework/message_queue/perfs/perf_nng.py	`16.83% <16.83%> (ø)`
...ramework/message_queue/perfs/perf_torchrpc_nccl.py	`19.49% <19.49%> (ø)`
ding/framework/message_queue/perfs/perf_shm.py	`25.84% <25.84%> (ø)`
ding/framework/parallel.py	`66.21% <33.91%> (-19.22%)`	⬇️
ding/data/shm_buffer.py	`60.19% <34.42%> (-37.59%)`	⬇️
ding/utils/comm_perf_helper.py	`35.82% <35.82%> (ø)`
ding/envs/env_manager/subprocess_env_manager.py	`74.63% <38.46%> (-0.26%)`	⬇️
ding/utils/lock_helper.py	`81.08% <41.17%> (-11.91%)`	⬇️
... and 118 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

…3.8 from cudatest

PaParaZz1 · 2023-02-02T07:51:19Z

.coveragerc

@@ -1,4 +1,5 @@
 [run]
+concurrency = multiprocessing,thread


why add this

Add concurrency = multiprocessing, so that codecov can count the coverage of subprocesses, and the default concurrency is set to threading. However，there are some things need to pay attention in using, refer to: https://pytest-cov.readthedocs.io/en/latest/subprocess-support.html

Makefile

PaParaZz1 · 2023-02-02T07:55:10Z

codecov.yml

+# fix me
+# The unittests of the torchrpc module are tested by different runners and cannot be included
+# in the test_unittest's coverage report. To keep CI happy, we don't count torchrpc related coverage.
+ignore: 


whether to add these ignore items to .coveragerc

PaParaZz1 · 2023-02-02T07:59:45Z

ding/data/shm_buffer.py

+        self.shape = shape
+        self.device = device
+        # We don't want the buffer to be involved in the computational graph
+        with torch.no_grad():


creation tensor operation doesn't involve in computation graph, so we don't need torch.no_grad here

PaParaZz1 · 2023-02-02T08:02:15Z

ding/data/tests/test_shm_buffer.py

+
+        event_run = ctx.Event()
+        shm_buf_np = ShmBufferCuda(np.dtype(np.float32), shape=(1024, 1024), copy_on_get=True)
+        shm_buf_torch = ShmBufferCuda(torch.float32, shape=(1024, 1024), copy_on_get=True)


we should add another unittest for the case copy_on_get=False to validate it.

PaParaZz1 · 2023-02-02T08:20:01Z

dizoo/atari/example/atari_dqn_dist_rdma.py

            task.use(eps_greedy_handler(cfg))
            task.use(StepCollector(cfg, policy.collect_mode, collector_env))
            task.use(termination_checker(max_env_step=int(1e7)))
        else:
            raise KeyError("invalid router labels: {}".format(task.router.labels))

-        task.run()


why remove this

PaParaZz1 · 2023-02-02T08:21:17Z

ding/utils/lock_helper.py

+    Overview:
+        thread lock decorator.
+    Arguments:
+        - func ([type]): A function that needs to be protected by a lock.


PaParaZz1 · 2023-02-02T08:24:01Z

ding/utils/comm_perf_helper.py

+        OUTPUT_DICT[func_name] = OUTPUT_DICT[func_name] + str(round(avg_tt, 4)) + ","
+
+
+def print_timer_result_csv():


maybe you can use pretty_print function in ding.utils

PaParaZz1 · 2023-02-02T08:25:13Z

ding/framework/task.py

            - args (:obj:`any`): Rest arguments for listeners.
        """
        # Check if need to broadcast event to connected nodes, default is True
        assert self._running, "Please make sure the task is running before calling the this method, see the task.start"
-        if only_local:


why remove this

PaParaZz1 · 2023-02-02T08:26:09Z

ding/framework/middleware/functional/trainer.py

@@ -71,8 +71,8 @@ def _train(ctx: Union["OnlineRLContext", "OfflineRLContext"]):

        if ctx.train_data is None:  # no enough data from data fetcher
            return
-        data = ctx.train_data.to(policy._device)
-        train_output = policy.forward(data)
+        # data = ctx.train_data.to(policy._device)


why comment this

PaParaZz1 · 2023-02-02T08:52:54Z

ding/framework/middleware/functional/collector.py

+            # so all data on the cpu side is copied to "cuda:0" here. In fact this
+            # copy is unnecessary, because torchrpc can support both cpu side and gpu
+            # side data to communicate using RDMA, but mixing the two transfer types
+            # will cause a bug, see issue:


where is the issue

PaParaZz1 added the efficiency optimization Efficiency optimization (time, memory and so on) label Dec 26, 2022

PaParaZz1 mentioned this pull request Dec 28, 2022

Roadmap for DI-engine #548

Open

SolenoidWGT force-pushed the p2p-rpc branch from 2eb25c4 to 5cfc2fb Compare January 12, 2023 05:54

SolenoidWGT added 3 commits January 12, 2023 07:56

fallback unit_test setting

2cee01a

fix branch conflict

6162b81

fix trainer.py:multistep_trainer args bug

97b9bc7

SolenoidWGT added 3 commits January 13, 2023 05:33

change port for nng perf

c5119f5

add multiprocess codecov support

a7a57a6

fix nng perf hostname error leading deadlock

fdd1bb9

SolenoidWGT force-pushed the p2p-rpc branch 2 times, most recently from 1da53e2 to 30b3a73 Compare January 17, 2023 07:18

SolenoidWGT added 2 commits January 17, 2023 08:26

add new self-hosted CI runner

c06288e

fix test_distributer process launch error

3fa5319

SolenoidWGT force-pushed the p2p-rpc branch from 30b3a73 to 3fa5319 Compare January 17, 2023 08:27

fix dataloader unit-test defect and nng perf test defect

345cc92

SolenoidWGT force-pushed the p2p-rpc branch from 8b1d69a to 345cc92 Compare January 18, 2023 08:21

SolenoidWGT added 2 commits January 18, 2023 09:30

add pytest timeout

dcc0a1a

modify test_learner_with_coordinator loop range and remove python3.7,…

adab7fb

…3.8 from cudatest

SolenoidWGT force-pushed the p2p-rpc branch from 142bce9 to adab7fb Compare January 18, 2023 12:11

Merge branch 'main' into p2p-rpc

ca25b27

PaParaZz1 requested changes Feb 2, 2023

View reviewed changes

PaParaZz1 reviewed Feb 2, 2023

View reviewed changes

polish

44dcf13

SolenoidWGT force-pushed the p2p-rpc branch from d347058 to 44dcf13 Compare February 13, 2023 04:33

test pytest worker = 1 to avoid timeout

735e7cc

SolenoidWGT force-pushed the p2p-rpc branch from d3ed57d to 735e7cc Compare February 13, 2023 08:29

Merge branch 'main' into p2p-rpc

e32055b

SolenoidWGT mentioned this pull request Mar 9, 2023

feature(wgt): add torch-rpc fix dockerfile and add new rpc deploy CI-… #606

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc #562

feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc #562

SolenoidWGT commented Dec 25, 2022

codecov bot commented Jan 12, 2023 •

edited

Loading

PaParaZz1 Feb 2, 2023

SolenoidWGT Feb 10, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023

PaParaZz1 Feb 2, 2023 •

edited

Loading

		OUTPUT_DICT[func_name] = OUTPUT_DICT[func_name] + str(round(avg_tt, 4)) + ","


		def print_timer_result_csv():

feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc #562

Are you sure you want to change the base?

feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc #562

Conversation

SolenoidWGT commented Dec 25, 2022

Description

cli-ditask introduces new command line arguments

Dynamic GPU communication groups

Related Issue

TODO

Check List

codecov bot commented Jan 12, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaParaZz1 Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jan 12, 2023 •

edited

Loading

PaParaZz1 Feb 2, 2023 •

edited

Loading