Why does device-to-device transfer (`DevicePut`) block computation in another thread? #14693

vwxyzjn · 2023-02-26T21:37:32Z

vwxyzjn
Feb 26, 2023

Hi Jax team,

I am recently building a distributed reinforcement learning platform. The rough idea is to have an actor worker perform rollouts ( sample actions, accumulate training data, then transfer the training data to the learner). Then the learner performs gradient updates on the training data. The actor resides in GPU0 and the learner GPU1.

I did some profiling and found this -- the learner blocks the actor thread when the learner attempts to fetch data from the actor's device to the learner's device. Here is the reproduction code, and below is the shortened code.

def prepare_data(obs, actions, logprobs):
    obs = jnp.asarray(obs)
    actions = jnp.asarray(actions)
    logprobs = jnp.asarray(logprobs)

    # dummy operation
    a = jnp.ones((1000, 1000))
    b = a @ a

    b_obs = obs.reshape((-1,) + obs.shape[2:])
    b_actions = actions.reshape(-1)
    b_logprobs = logprobs.reshape(-1)
    return b_obs, b_actions, b_logprobs

def rollout(params, rollout_queue, key):
    num_envs = 20
    cpu_next_obs = np.zeros((num_envs, 4, 84, 84))
    for update in range(20):
        if update == 4:
            jax.profiler.start_trace('./profile')
        obs = []
        actions = []
        logprobs = []
        for _ in range(384):
            next_obs, action, logprob, key = sample(params, cpu_next_obs, key)
            cpu_action = jax.device_get(action)
            # env.send(cpu_action)
            obs.append(next_obs)
            actions.append(action)
            logprobs.append(logprob)

        rollout_queue.put((obs, actions, logprobs))
    jax.profiler.stop_trace()

if __name__ == "__main__":
    devices = jax.devices()
    assert len(devices) >= 2
    key = jax.random.PRNGKey(0)
    network = Network()
    params = network.init(key, np.zeros((1, 4, 84, 84)))
    rollout_queue = queue.Queue(maxsize=1)
    threading.Thread(
        target=rollout,
        args=(
            params,
            rollout_queue,
            key,
        )
    ).start()
    prepare_data = jax.jit(prepare_data, device=devices[1])
    for update in range(20):
        obs, actions, logprobs = rollout_queue.get()
        b_obs, b_actions, b_logprobs = prepare_data(obs, actions, logprobs)
        print(update)

The profiling shows that when prepare_data is executed, ~~it initializes many D2H transfers and H2D transfers, with the memcpy details as "kind_src:device kind_dst:device size:2257920 dest:0 async:1"~~ it initializes many P2P transfers, probably to deal with operations such as actions = jnp.asarray(actions). As a result, the actor thread is blocked during the process.

Is there any way to not have the device transfers block the actor thread's computation? Thanks a lot!

---- quick update:

I realized when switching to a different machine, the transfers appear become P2P transfers. However, the transfers still block the actor threads.

~~#8545 might be related -- the D2H and H2D transfers should probably be P2P transfers maybe?~~

Test with both jax==0.3.25 and jax==0.4.4

Answered by hawkinsp

Feb 27, 2023

Without digging deeply into the reproduction, my guess is that the issue is the GPU memory allocator JAX uses. Fundamentally memory allocation is synchronized to the compute stream, so it's common for transfers to block waiting for compute or vice versa if an allocated block is not known to be free.

I'll need to dig more to be sure that's it, though.

View full answer

hawkinsp · 2023-02-27T18:18:58Z

hawkinsp
Feb 27, 2023
Maintainer

Without digging deeply into the reproduction, my guess is that the issue is the GPU memory allocator JAX uses. Fundamentally memory allocation is synchronized to the compute stream, so it's common for transfers to block waiting for compute or vice versa if an allocated block is not known to be free.

I'll need to dig more to be sure that's it, though.

1 reply

vwxyzjn Mar 1, 2023
Author

Thanks so much for the response. I tried batching the memory transfers (doing a single large Memcpy instead of 384 small memcpys in the reproduction code), which improved my training system's performance by 10%. I have some follow-up questions.

Fundamentally memory allocation is synchronized to the compute stream

Could another python thread or subprocess utilize the accelerator at that fraction of memory allocation time? I am working on replicating DeepMind's Podracer Sebulba architecture, which uses multiple actor threads, and I was wondering if it helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does device-to-device transfer (`DevicePut`) block computation in another thread? #14693

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why does device-to-device transfer (DevicePut) block computation in another thread? #14693

vwxyzjn Feb 26, 2023

Replies: 1 comment · 1 reply

hawkinsp Feb 27, 2023 Maintainer

vwxyzjn Mar 1, 2023 Author

Why does device-to-device transfer (`DevicePut`) block computation in another thread? #14693

vwxyzjn
Feb 26, 2023

Replies: 1 comment 1 reply

hawkinsp
Feb 27, 2023
Maintainer

vwxyzjn Mar 1, 2023
Author