Skip to content

Conversation

@edoakes
Copy link
Collaborator

@edoakes edoakes commented Oct 31, 2025

Found it very hard to parse what was happening here, so helping future me (or you!).

Also:

  • Deleted vestigial next_resource_seq_no_.
  • Converted from non-monotonic clock to a monotonically incremented uint64_t for the version number for commands.
  • Added logs when we drop messages with stale versions.

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Oct 31, 2025
@edoakes edoakes requested a review from a team as a code owner October 31, 2025 14:58
cluster_lease_manager_(cluster_lease_manager),
record_metrics_period_ms_(config.record_metrics_period_ms),
placement_group_resource_manager_(placement_group_resource_manager),
next_resource_seq_no_(0),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was vestigial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does vestigial mean lol

@edoakes edoakes force-pushed the eoakes/cleanup-syncer branch 2 times, most recently from e755861 to 3edc6de Compare October 31, 2025 15:00
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the clarity of the ray_syncer usage in the NodeManager by adding detailed comments explaining the RESOURCE_VIEW and COMMANDS channels. Additionally, it refactors the versioning for COMMANDS channel messages, replacing a time-based version with a monotonic counter, which is a more robust approach. However, this change introduces a compilation error because a const method attempts to modify a member variable. I've left a comment with a suggested fix.

Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes force-pushed the eoakes/cleanup-syncer branch from 3edc6de to 1e41918 Compare October 31, 2025 15:03
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
std::move(message);
StartSend();
return true;
if (node_versions[message->message_type()] >= message->version()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no behavior change, just reversed the early return logic

@ray-gardener ray-gardener bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Oct 31, 2025
@edoakes edoakes enabled auto-merge (squash) November 3, 2025 13:19
// Register resource manager and scheduler
// RESOURCE_VIEW is used to synchronize available resources across Raylets.
//
// LocalResourceManager::CreateSyncMessage will be called periodically to collect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to mention that it's both periodically called and also on-demand when local resources change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is not called on-demand! Inside of OnResourceOrStateChanged, we increment the version but we do not actually eagerly broadcast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whattt, thanks for the clarification 🤯

if (triggered_by_global_gc) {
// Always increment the sync message version number so that all GC commands
// are sent indiscriminately.
gc_command_sync_version_++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to mention that even though we call OnDemandBroadcasting, it's only sent to the GCS and not to other raylets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be good to mention in BroadcastMessage or the map of sync_reactors_ that for node managers, we only have one bidi reactor which is to the GCS. GCS has multiple bidi reactors, one for each node. Hence just to emphasize that it's NOT all to all on the raylet level, it's node to GCS to all nodes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this where we initialize the syncer and in the ray_syncer_ field comment. Putting it here specifically felt odd because it applies to all usage of the syncer.

Copy link
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think the comments could be a bit more refined

@edoakes edoakes requested a review from Sparks0219 November 4, 2025 02:49
}
}

std::optional<syncer::RaySyncMessage> NodeManager::CreateSyncMessage(
Copy link
Contributor

@Sparks0219 Sparks0219 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we rename this to CreateSyncCommandsMessage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot because it's a virtual method to implement the sync broadcaster interface. I tried to do that already :'(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh... I see both LocalResourceManager and NodeManager inherit from the syncer class and override this RIP

Copy link
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

@edoakes edoakes merged commit 97de782 into ray-project:master Nov 7, 2025
7 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…s in the Raylet (ray-project#58342)

Found it very hard to parse what was happening here, so helping future
me (or you!).

Also:

- Deleted vestigial `next_resource_seq_no_`.
- Converted from non-monotonic clock to a monotonically incremented
`uint64_t` for the version number for commands.
- Added logs when we drop messages with stale versions.

---------

Signed-off-by: Edward Oakes <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…s in the Raylet (ray-project#58342)

Found it very hard to parse what was happening here, so helping future
me (or you!).

Also:

- Deleted vestigial `next_resource_seq_no_`.
- Converted from non-monotonic clock to a monotonically incremented
`uint64_t` for the version number for commands.
- Added logs when we drop messages with stale versions.

---------

Signed-off-by: Edward Oakes <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…s in the Raylet (ray-project#58342)

Found it very hard to parse what was happening here, so helping future
me (or you!).

Also:

- Deleted vestigial `next_resource_seq_no_`.
- Converted from non-monotonic clock to a monotonically incremented
`uint64_t` for the version number for commands.
- Added logs when we drop messages with stale versions.

---------

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…s in the Raylet (ray-project#58342)

Found it very hard to parse what was happening here, so helping future
me (or you!).

Also:

- Deleted vestigial `next_resource_seq_no_`.
- Converted from non-monotonic clock to a monotonically incremented
`uint64_t` for the version number for commands.
- Added logs when we drop messages with stale versions.

---------

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…s in the Raylet (ray-project#58342)

Found it very hard to parse what was happening here, so helping future
me (or you!).

Also:

- Deleted vestigial `next_resource_seq_no_`.
- Converted from non-monotonic clock to a monotonically incremented
`uint64_t` for the version number for commands.
- Added logs when we drop messages with stale versions.

---------

Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants