Skip to content

Conversation

@guoyuhong
Copy link
Contributor

What do these changes do?

  1. Change client table from Log to Set.
  2. We can totally delete the dead client entries from client table. However, we'd better keep information the clients that were connected before. Thus, I will remove the first entry that is added during the connection time. The 2-entry problem for a dead client caused many problems, so this change will make it clear.
  3. Since we do not use insertion operation, is_insertion is changed to is_connected which is more meaningful.

Related issue number

#4154
#4140

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/415/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13696/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

It looks like when starting 2 node in the test, the driver will connect to a random node now... Some tests will fail in this case...

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/429/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13730/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

I think we'd better add a RAY.SET_REPLACE command to guarantee the atomic of the replace operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be simpler.

node = cluster.add_node(num_cpus=3, resources={"CustomResource": 1})
available_plasma_socket_name = node.plasma_store_socket_name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulchen Thanks! That is much better!

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/457/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13770/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13771/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

@hartikainen I have refined the temporary code which is used to fix #4140 . Could you take a look and test whether this fix is still working?

@guoyuhong guoyuhong requested a review from hartikainen April 14, 2019 03:27
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/476/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13791/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

@kfstorm Could you take a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if multiple nodes call replace at the same time? Will there be race conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the monitor and the raylet may both try to change the status. I will change the Log level from FATAL to ERROR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo ignore

@raulchen raulchen requested a review from stephanie-wang April 15, 2019 13:11
Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why "we'd better keep information the clients that were connected before"? I think the code is still complicated with the is_connected flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider about the scenario that LRU has removed the whole key before replace happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replace the FATAL log to ERROR, so this may not cause crash even if the LRU mechanism has removed this entry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have something to do with profile table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This is totally wrong to use CreateProfileTableData. However, because ProfileTableData and SetReplaceEntryData have similar structure, there is no error caught.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: exists

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/592/
Test PASSed.

@guoyuhong
Copy link
Contributor Author

@kfstorm If we do not keep that information, there is no way to know whether a node was connected or not. It's unlike the object table, we can reconstruct a missing object, but for client, this information is important.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/593/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13961/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13962/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

@kfstorm @raulchen @ste Do you have more comments?

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. This looks good, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous version of this was okay (cluster.redis_address should not change).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is awkward. Can we remove it and just have this function always return REDISMODULE_OK? I would prefer if we just had the caller explicitly check the value and call RedisModule_ReplyWithSimpleString(ctx, "OK") if necessary. We can make the pattern a macro if it becomes cumbersome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the flag and change the function return value to Status. Then we can call the reply function at the top layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safer to not allocate a new variable called new_argv and instead pass in argv_vector.data() here. Also, I would probably then rename argv_vector to new_argv or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notification_mode should always be GcsTableNotificationMode::APPEND_OR_ADD, right? Can we make this a check instead of an if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do that, this is a callback for RequestNotifications. There will be messages for both APPEND_OR_ADD and REMOVE.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

@kfstorm
Copy link
Member

kfstorm commented May 8, 2019

LGTM.

It will be better if we have a Map structure in future days. 😁

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14330/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

@raulchen I have resolved the conflict. It looks like that we need to do some code change to make this work with dynamic resources. It looks like I need to move this code snippet into node manager. And every time the resource updating request is received, the backend needs to look up GCS first and then change the resource correspondingly and then call Set::Replace.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14333/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/865/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/868/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

Since Hash Table is preferred and there are racing condition in dynamic resource while using Set, this PR will be closed. The hash table PR is #4911 .

@guoyuhong guoyuhong closed this Jun 3, 2019
@guoyuhong guoyuhong deleted the clientTable2set branch June 3, 2019 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants