-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Change client table from Log to Set #4598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test PASSed. |
|
Test FAILed. |
|
It looks like when starting 2 node in the test, the driver will connect to a random node now... Some tests will fail in this case... |
8a47230 to
c6fc966
Compare
|
Test PASSed. |
|
Test FAILed. |
|
I think we'd better add a |
python/ray/tests/test_basic.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be simpler.
node = cluster.add_node(num_cpus=3, resources={"CustomResource": 1})
available_plasma_socket_name = node.plasma_store_socket_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulchen Thanks! That is much better!
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
c3bca01 to
0a4b9ee
Compare
|
@hartikainen I have refined the temporary code which is used to fix #4140 . Could you take a look and test whether this fix is still working? |
|
Test PASSed. |
|
Test FAILed. |
|
@kfstorm Could you take a look? |
src/ray/gcs/tables.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if multiple nodes call replace at the same time? Will there be race conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the monitor and the raylet may both try to change the status. I will change the Log level from FATAL to ERROR.
src/ray/gcs/tables.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo ignore
kfstorm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why "we'd better keep information the clients that were connected before"? I think the code is still complicated with the is_connected flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider about the scenario that LRU has removed the whole key before replace happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have replace the FATAL log to ERROR, so this may not cause crash even if the LRU mechanism has removed this entry.
src/ray/gcs/tables.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have something to do with profile table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This is totally wrong to use CreateProfileTableData. However, because ProfileTableData and SetReplaceEntryData have similar structure, there is no error caught.
python/ray/experimental/state.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: exists
0a4b9ee to
0a6467e
Compare
|
Test PASSed. |
|
@kfstorm If we do not keep that information, there is no way to know whether a node was connected or not. It's unlike the object table, we can reconstruct a missing object, but for client, this information is important. |
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
stephanie-wang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. This looks good, thanks!
python/ray/tests/conftest.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous version of this was okay (cluster.redis_address should not change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag is awkward. Can we remove it and just have this function always return REDISMODULE_OK? I would prefer if we just had the caller explicitly check the value and call RedisModule_ReplyWithSimpleString(ctx, "OK") if necessary. We can make the pattern a macro if it becomes cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove the flag and change the function return value to Status. Then we can call the reply function at the top layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is safer to not allocate a new variable called new_argv and instead pass in argv_vector.data() here. Also, I would probably then rename argv_vector to new_argv or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
src/ray/gcs/tables.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notification_mode should always be GcsTableNotificationMode::APPEND_OR_ADD, right? Can we make this a check instead of an if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot do that, this is a callback for RequestNotifications. There will be messages for both APPEND_OR_ADD and REMOVE.
src/ray/gcs/tables.cc
Outdated
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
src/ray/gcs/tables.cc
Outdated
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
src/ray/gcs/tables.h
Outdated
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
LGTM. It will be better if we have a |
|
Test FAILed. |
8efb62a to
65f96a0
Compare
|
@raulchen I have resolved the conflict. It looks like that we need to do some code change to make this work with dynamic resources. It looks like I need to move this code snippet into node manager. And every time the resource updating request is received, the backend needs to look up GCS first and then change the resource correspondingly and then call |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Since Hash Table is preferred and there are racing condition in dynamic resource while using Set, this PR will be closed. The hash table PR is #4911 . |
What do these changes do?
is_insertionis changed tois_connectedwhich is more meaningful.Related issue number
#4154
#4140
Linter
scripts/format.shto lint the changes in this PR.