-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Add comments explaining the usage of the ray_syncer_ channels in the Raylet
#58342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| cluster_lease_manager_(cluster_lease_manager), | ||
| record_metrics_period_ms_(config.record_metrics_period_ms), | ||
| placement_group_resource_manager_(placement_group_resource_manager), | ||
| next_resource_seq_no_(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was vestigial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does vestigial mean lol
e755861 to
3edc6de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request improves the clarity of the ray_syncer usage in the NodeManager by adding detailed comments explaining the RESOURCE_VIEW and COMMANDS channels. Additionally, it refactors the versioning for COMMANDS channel messages, replacing a time-based version with a monotonic counter, which is a more robust approach. However, this change introduces a compilation error because a const method attempts to modify a member variable. I've left a comment with a suggested fix.
Signed-off-by: Edward Oakes <[email protected]>
3edc6de to
1e41918
Compare
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
| std::move(message); | ||
| StartSend(); | ||
| return true; | ||
| if (node_versions[message->message_type()] >= message->version()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no behavior change, just reversed the early return logic
| // Register resource manager and scheduler | ||
| // RESOURCE_VIEW is used to synchronize available resources across Raylets. | ||
| // | ||
| // LocalResourceManager::CreateSyncMessage will be called periodically to collect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to mention that it's both periodically called and also on-demand when local resources change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is not called on-demand! Inside of OnResourceOrStateChanged, we increment the version but we do not actually eagerly broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh whattt, thanks for the clarification 🤯
| if (triggered_by_global_gc) { | ||
| // Always increment the sync message version number so that all GC commands | ||
| // are sent indiscriminately. | ||
| gc_command_sync_version_++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to mention that even though we call OnDemandBroadcasting, it's only sent to the GCS and not to other raylets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would also be good to mention in BroadcastMessage or the map of sync_reactors_ that for node managers, we only have one bidi reactor which is to the GCS. GCS has multiple bidi reactors, one for each node. Hence just to emphasize that it's NOT all to all on the raylet level, it's node to GCS to all nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this where we initialize the syncer and in the ray_syncer_ field comment. Putting it here specifically felt odd because it applies to all usage of the syncer.
Sparks0219
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the comments could be a bit more refined
| } | ||
| } | ||
|
|
||
| std::optional<syncer::RaySyncMessage> NodeManager::CreateSyncMessage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we rename this to CreateSyncCommandsMessage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot because it's a virtual method to implement the sync broadcaster interface. I tried to do that already :'(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh... I see both LocalResourceManager and NodeManager inherit from the syncer class and override this RIP
Sparks0219
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚢
…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>
…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>
…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: YK <[email protected]>
…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>
Found it very hard to parse what was happening here, so helping future me (or you!).
Also:
next_resource_seq_no_.uint64_tfor the version number for commands.