-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-25815 RSGroupBasedLoadBalancer online status never updates after being set to true for the first time #3606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5d8583d
4fa64e3
fd62630
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,7 +32,9 @@ | |
| import java.util.Set; | ||
| import java.util.SortedSet; | ||
| import java.util.TreeSet; | ||
| import java.util.concurrent.CompletableFuture; | ||
| import java.util.concurrent.Future; | ||
| import java.util.concurrent.TimeUnit; | ||
| import java.util.function.Function; | ||
| import java.util.stream.Collectors; | ||
| import org.apache.commons.lang3.StringUtils; | ||
|
|
@@ -661,12 +663,14 @@ private synchronized void flushConfig(Map<String, RSGroupInfo> newGroupMap) thro | |
| return; | ||
| } | ||
|
|
||
| // Make changes visible | ||
| resetRSGroupMap(newGroupMap); | ||
|
|
||
| /* For online mode, persist to hbase:rsgroup and Zookeeper */ | ||
| flushConfigTable(newGroupMap); | ||
|
|
||
| // Make changes visible after having been persisted to the source of truth | ||
| resetRSGroupMap(newGroupMap); | ||
| saveRSGroupMapToZK(newGroupMap); | ||
|
|
||
| // Update previous map | ||
| updateCacheOfRSGroups(newGroupMap.keySet()); | ||
| } | ||
|
|
||
|
|
@@ -825,6 +829,20 @@ private void createRSGroupTable() throws IOException { | |
| } | ||
|
|
||
| public boolean isOnline() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How often is this check run?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @saintstack It's called by
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the intention here is we will not come back to offline mode after online?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess that was the original intention, but it seems misleading/strange. We have encountered errors in prod resulting from trying to flush to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this part of code is written by me but I can not recall if it was already like this or I changed the implementation to make it only online once. Give some time to check code for different branches...
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, it was not me. It was like this when we first introduced this class in HBASE-6721. Even on branch-1 backport, the implementation is just return the isOnline flag. So if you want to change the implementation, please explain a bit about the reason?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason is that if we do not periodically update the Instead of being blocked waiting like this, we can go through an "offline" code path. There already is an offline code path in Please correct me if I misunderstood anything. What do you think about this rationale?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Apache9 any thoughts?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The logic seems good to me. @Apache9 Would you like to verify? |
||
| if (isMasterRunning(masterServices)) { | ||
| try { | ||
| // try reading from the table | ||
| CompletableFuture<Result> read = conn.getTable(RSGROUP_TABLE_NAME).get(new Get(ROW_KEY)); | ||
| if (read.get(10000, TimeUnit.MILLISECONDS) != null) { | ||
| online = true; | ||
| } | ||
| } catch (Exception e) { | ||
| LOG.warn("Failed to read from " + RSGROUP_TABLE_NAME+ "; setting online = false"); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: use log placeholder |
||
| online = false; | ||
| } | ||
| } else { | ||
| online = false; | ||
| } | ||
| return online; | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because generally it is the case to update in-memory before persistent storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caroliney14 even if we don't change this order, are we still good with the main problem? I was wondering if this particular order of in-memory vs persistent update could be taken up in follow-up PR as well if we are good with the latest change in
isOnline()method.