Skip to content

Conversation

@chenzhuofu
Copy link
Collaborator

@chenzhuofu chenzhuofu commented Feb 5, 2025

Description of changes:
Realm backend implementation (single device version).
Passed unit test and E2E test.
image
image

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

This change is Reviewable

@chenzhuofu
Copy link
Collaborator Author

Multi devices:

  • One master proc, traverses PCG and spawns tasks on all GPUs (need to get rank info from PCG)
  • Use Realm events to control dependency of tasks
  • Execute a PCG within a Realm task
  • Add new tests

Multi nodes:

  • One master proc per node to traverse PCG
  • Use barrier for dependency between nodes:
barrier = create_barrier(); // create a barrier and pass it to shard task of node 0 and node 1

// node 0:
event = launch_task(Task0);
barrier.arrive(event);

// node 1:
launch_task(Task1, barrier);
  • Manage data transactions using Realm interface
  • Add new tests

@chenzhuofu
Copy link
Collaborator Author

Pass E2E test!
image

@chenzhuofu
Copy link
Collaborator Author

I reconstruct the code and remove the redundant parts, and now it looks much simpler and clearer. The PR is ready to review now. @lockshaw @jiazhihao

@chenzhuofu chenzhuofu changed the base branch from local-e2e-training to master August 6, 2025 03:47
@chenzhuofu chenzhuofu marked this pull request as ready for review August 6, 2025 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants