-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-24286: HMaster won't become healthy after after cloning or crea… #2113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
🎊 +1 overall
This message was automatically generated. |
| // mismatches with hbase:meta. in fact if HBCKSCP finds any in-memory region states, | ||
| // HBCKSCP is basically same as SCP. | ||
| lastReport.unknownServers.stream().forEach(regionInfoServerNamePair -> { | ||
| LOG.warn("Submitting HBCKSCP for Unknown Region Server {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be at INFO level? if it is at WARN what can an operator do to prevent future warnings? if this is something that is going to happen as a matter of course then INFO is more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this is expected, the log is just for info purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I will change it.
| @@ -0,0 +1,185 @@ | |||
| /** | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should be a non-javadoc comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, sorry that I copied from other files and i found many place in hbase has this header lol
| TEST_UTIL.shutdownMiniHBaseCluster(); | ||
| TEST_UTIL.getDFSCluster().getFileSystem().delete( | ||
| new Path(walRootPath.toString(), HConstants.HREGION_LOGDIR_NAME), true); | ||
| TEST_UTIL.getDFSCluster().getFileSystem().delete( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have some check before this that the cluster successfully cleanly shut down and thus has no procedures pending?
if we are expressly trying to test that we can come up safely after having the master procedure wal destroyed while things are in-flight then we should make that a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor to wait will procedure completed and delete the WALs and Master procedure WALs.
|
I'm not sure if we want to discuss here or on #2114, but, copying from #2114 (review),
|
|
I think the code is the same between the two, so let's discuss here since it seems all the info is here (at the moment).
In what cases can a RS be marked as "unknown"? If we think this is a transient state, we can always add a ttl before reassigning (but that will add considerable recovery time). |
If hbase:meta points to the correct RS, theoretically, no writes should come to this RS, right? In that case, there shouldn't be too adverse of affects... if that RS crashes, the Region won't be reassigned because HMaster/ZK didn't know about it. |
|
💔 -1 overall
This message was automatically generated. |
|
I need to double check the unit test failing, will update here only once I have the correct patch (something is hanging the region restarts).
Correct me if I'm wrong, if an unknown server join on a hbase cluster, since it's not online or dead, assignment manager should not take those Basically, |
|
💔 -1 overall
This message was automatically generated. |
No. A client which spans the lifetime of your cluster (prior to shutdown, destruction, and recreation) could potentially have a cached region location. This is why fencing (e.g. HDFS lease recovery) is super-important for us. A client could presumably continue to try to write to a RS who has gone haywire. |
Doesn't a SCP trigger log splitting (and therefore recoverLease) which would handle this case? |
So, I feel like you're asking a different question than what I was concerned about (I was concerned about making sure all "old" RegionServers are actually down before we reassign regions onto new servers). This worries me because we rely on SCP's which we are acknowledging are gone in this scenario. Do we just have to make an external "requirement" that the system stopping old hardware ensures all previous RegionServers are fully dead before proceeding with creating new ones that point at the same data? To the question you asked, what is the definition of an "unknown" server in your case: a ServerName listed in meta as a region's assigned location which is not in the AssignmentManagers set of live RegionServers? If that's the case, yes, that's how I understand AM to work today -- the presence of an "unknown" server as an assignment indicates a failure in the system. That is, we lost a MasterProcWAL which had an SCP for a RS. I think that's why this is a "fix by hand" kind of scenario today.
Yup, that's the crux of it. This has been a nagging problem in the back of my mind that turns my stomach. This is what I think the situation is (to try to come up with some common terminology):
For HBase's consistency/safety, before we start any RS in |
That's my point. We don't have the SCP because the proc wals were deleted. We normally do an SCP when we receive the RS ephemeral node deletion in ZK. Since we don't have either of these, we just have to be super sure that it's actually safe to submit that SCP. I agree with you if that we did submit an SCP, the system should recover. This makes me wonder... do we have any analogous situations in a "normal" cluster (with hardware). For example..
Do we submit an SCP for that RS today? Or, only when the new instance of that RS is started? I think this is a comparable situation -- maybe there's something I've not considered that we can still pull "state" from (e.g. we store something in the proc wals) |
|
Ah, ignore my comment... I see that we aren't really running a SCP for those cases, we're just cleaning them up from meta? |
|
Also from my investigation, I found that this comment should be updated if we handle this in CatalogJanitor: https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/HBCKServerCrashProcedure.java#L53 |
Wouldn't there still be a znode in this case? That would probably trigger a SCP. Maybe you would get that situation if you added a 3b. delete zNode/clear out ZK |
Not sure, honestly :). I know the znodes that the Master watches for cluster membership are ephemeral (not persistent). However, maybe there is something else. I'd have to look. I can try to do this tmrw if you or Stephen don't trace through :) |
|
Thanks Josh, and honestly I didn't know the logic till now. And here is the finding for both situations you're concerning: first casesecond caseThere is three Key parts in the normal system to handle If MasterProcWALs/MasterRegion both exist after a cluster restarts, when Then in the case of deleting MasterProcWALs (or MasterRegion in branch-2.3+) and kept the ZK nodes, even there is no procedure MasterProcWALs restored from, as long as we have the WAL from for previous host, we can still schedule SCP for it. but if MasterProcWALs and WAL are deleted, neither of the first and second cases will not operating normally. The case we were originally trying to solve that is falling into the situation of MasterProcWALs and WAL are deleted after cluster restarted, we don't have the WAL, MasterProcWALs/MasterRegion and Zookeeper but HFiles, then those servers are under unknown and regions cannot be reassigned. About the unit tests failure, Now....I'm hitting a strange issue, my tests works fine if I delete WAL, MasterProcWALs, and ZK baseZNode in branch-2.2. However, with the same setup in branch-2.3+ and master will hangs the master initialization if the ZK baseZNode is deleted with or without my changes. (what has been changed in branch-2.3? I found MasterRegion but not sure why that's related to ZK data, is it a bug? ) Interestingly, my fix works if keep the baseZnode, so, I'm trying to figure out a right way to cleanup zookeeper such it matched the one of the cloud use cases that WAL on HDFS and ZK are also deleted when HBase cluster terminated. |
6f23800 to
3245587
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
while waiting for the unit tests runs, I want to bring up two extra topics and may follow on new JIRA(s)
|
|
💔 -1 overall
This message was automatically generated. |
3245587 to
b34fe54
Compare
|
💔 -1 overall
This message was automatically generated. |
|
I haven't looked at the patch yet, but what I want to say is that, it is not a good idea to use HBCK stuff in automatic fail recovery. The design of HBCK is that, the operation is dangerous so only operators can perform it, manually, and the operators need to know the risk. Will report back later after reviewing the patch, and also I need to learn what is the problem we want to solve. Thanks. |
|
🎊 +1 overall
This message was automatically generated. |
|
thanks @Apache9 we can switch to use standard |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
I think the old server will be unknown servers, not the new servers? I do not think there is a guarantee that you change the filesystem layout of HBase internal, and HBase cluster will still be functional. Even if sometimes it could, as you said, on 1.4, it does not mean that we will always keep this behavior in new versions. For your scenario, it is OK as you can confirm that there is no data loss as you manually flushed all the data. But what if another user just configs the wrong WAL directory? In this case, if we schedule SCPs automatically, there will be data loss. In general, in HBase, we only know that something strange happen, but we do not know how we come into this situation so it is not safe for us to just recover automatically. Only end user knows, so I still prefer we just provide tools for them to recover, not add some 'unsafe' options in HBase... Thanks. |
Agree to Duo. That might be a better path for such use case where the cluster is deleted (after persisting all data) and later create pointing to existing FS.
Ya I am also agreeing to this point. |
|
So the use case here is starting a new cluster in the cloud where HDFS (WAL) data on the previous cluster will not be available. One of the benefits of storing the data off the cluster (in our case, S3), is to not have to replicate data (and just create a new cluster pointed to the same root directory). IMO, in this case we shouldn't need the WAL directories to exist just to tell us to reassign and this is a valid use case. I get that there is pushback for enabling this in the catalog cleaner, and I think that's fine. For this case, it's a one time issue, not something that periodically needs fixing. (there might be other unknown server cases that would require that, but that isn't blocking us at the moment). So, instead maybe a 1-time run to cleanup old servers/schedule SCP for them (this is what the code that was removed in HBASE-20708 actually did) makes the most sense? I understand that it was removed to simplify the assignment, but it has a very different behavior. In fact it looks like we don't even try to read hbase:meta if it is found (without SCP/WAL) and simply just delete the directory[1]. What problem is being solved by deleting instead of trying to assign it if data is there? [1] hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/InitMetaProcedure.java Lines 75 to 81 in bad2d4e
In this scenario, regardless of what we do, there will be dataloss unless the correct WAL directory is (again) specified. In fact, I don't believe you can change WAL dir without restarting servers (I also don't think it works with rolling restart). I don't think this is a valid scenario for this issue. |
No, there will be silent data loss, the user will notice that no region is online, and the cluster is not in a good state, just the same as what you described here. And again, this is not a normal operation in HBase, we do not expect that the WAL directories can be removed without SCP. I wonder why our SCP can even pass without a WAL directory. We should hang there I suppose. Only HBCKSCP can do the dangerous operation. |
|
In general, if you touch the internal of HBase directly, it may lead to data loss, unexpected behavior, etc. As I said above, the current design is to compare the WAL directories and the region server znodes on zookeeper to detect dead region servers when master starts up. If you just removed the WAL directories then HBase will have unexpected behaviors. Any addendums to solve the problem here should be considered as dangerous operations, which should only be in HBCK. If you want to solve the problem automatically, you should find another way to detect the dead region servers when master starts up, to make sure we do not rely on the WAL directories. But I'm still a bit nervous that when SCP notices that there is no WAL directory for a dead region server, what should it do. It is not the expected behavior in HBase... |
|
Thanks everyone, let me give a try on adding |
95a6bac to
69a0be0
Compare
|
🎊 +1 overall
This message was automatically generated. |
| Path tableDir = CommonFSUtils.getTableDir(rootDir, TableName.META_TABLE_NAME); | ||
| if (fs.exists(tableDir) && !fs.delete(tableDir, true)) { | ||
| LOG.warn("Can not delete partial created meta table, continue..."); | ||
| if (fs.exists(tableDir)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another issue other than the unknown RS issue.
In cluster recreate cloud case, the zk data is not there and so no META region location. So for HM, this is a cluster bootstrap and so as part of that init meta and its FS.
HBASE-24471 added this extra code of deleting an existing META table FS path. It is done by considering as a partial created meta table as part of some previous failed bootstrap cc @Apache9
So this issue is because of the zk data not in new cluster. Unknown server issue is because of loss of WAL (to be precise the master proc wal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as Zack and I highlighted in the comments above, we found the change was from HBASE-24471 and it's a bit different from that change before (I agreed that change is solving the meta startup/bootstrap problem.).
let me send another update and add a feature flag to be able to turn off delete partial meta (default is to delete it). Then for the cloud case, we can turn it off for the cloud use case that ZK data has been deleted before we come up a better way to tell what is partial. (sorry that I don't have a good way to validate if meta is partial from previous bootstrap. any suggestion?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that in cloud cluster recreate case, this META FS delete is a bigger concern. At least the other case of unknown servers, we can make the table to be disabled at least.. But here not such workaround even possible I fear. Thanks for pointing out.
Now taking a step back. All these decision points in HM start regarding bootstrapping or AM work is based on the assumption that HBase is a continuously running cluster. So the cluster comes up first and then data getting persisted in FS over the run. Now in cloud there is a very useful feature of delete a cluster and keep the data in blob store and later when needed, recreate the cluster pointing to the existing data. Many of the decision making in HM startup is not holding right at that time! (Like the one above which thinks that if there is no META location in zk, means meta table was never been online and no data in it) So what we need is a way to know that whether this cluster start is from an existing data (cluster recreate).. All these decisions can be based on that check result (?) Even the unknown server handling also so that it will happen ONLY in this special cloud case and only once.
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…ting a new cluster pointing at the same file system HBase currently does not handle `Unknown Servers` automatically and requires users to run hbck2 `scheduleRecoveries` when one see unknown servers on the HBase report UI. This became an issue on HBase2 adoption especially when a table wasn't disabled before shutting down a HBase cluster on cloud that WALs and Zookeeper data are removed on a dynamic environment that hostname changes frequently. Once the cluster restarts, hbase:meta keeps the old hostname/IPs for region servers that were running in the last cluster. Those region servers became `Unknown Servers` and regions on those region servers are never been reassigned automatically. Our fix here is to trigger a repair immediately after hbase:meta is loaded and is online, find any non-offline regions of enabled table on `Unknown Servers` such that thet can be reassigned to other online servers. - Also introduce a feature to that skip the removal of the hbase:meta table directory if InitMetaProcedure#writeFsLayout runs, especially if ZNode is fresh but hbase:meta table exists
69a0be0 to
ded37e0
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| FileSystem fs = rootDir.getFileSystem(conf); | ||
| Path tableDir = CommonFSUtils.getTableDir(rootDir, TableName.META_TABLE_NAME); | ||
| if (fs.exists(tableDir) && !fs.delete(tableDir, true)) { | ||
| boolean removeMeta = conf.getBoolean(HConstants.REMOVE_META_ON_RESTART, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Checking if meta needs initializing.
status.setStatus("Initializing meta table if this is a new deploy");
InitMetaProcedure initMetaProc = null;
// Print out state of hbase:meta on startup; helps debugging.
RegionState rs = this.assignmentManager.getRegionStates().
getRegionState(RegionInfoBuilder.FIRST_META_REGIONINFO);
LOG.info("hbase:meta {}", rs);
if (rs != null && rs.isOffline()) {
Optional optProc = procedureExecutor.getProcedures().stream()
.filter(p -> p instanceof InitMetaProcedure).map(o -> (InitMetaProcedure) o).findAny();
initMetaProc = optProc.orElseGet(() -> {
// schedule an init meta procedure if meta has not been deployed yet
InitMetaProcedure temp = new InitMetaProcedure();
procedureExecutor.submitProcedure(temp);
return temp;
});
}
So the checks we do see that META location is not there in zk and so it thinks its new deploy. So here is what we need to tackle.
In cloud redeploy case we will see a pattern where we will have a clusterId in the FS and not in zk. This can be used as an indicator? IMO we should find it (using this way or other) that its a redeploy on an existing datset and all these places, we need to consider that also to decide we need such bootstrap steps.
We should not be doing that with a config way IMO. Because then in cloud based deploy, what if the 1st time start fail and there is a need for this bootstrap cleaning of META FS dir?
Even the other unknown server case also. Lets identify clearly this redeploy case and act then only.
Can we pls ave that discuss and conclude on a solution for that and then move forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cloud redeploy case we will see a pattern where we will have a clusterId in the FS and not in zk. This can be used as an indicator?
I will come back on this later tomorrow, but I agreed with you that we should check explicitly how we define partial bootstrap and that partial meta need some cleanup.
also, do you mean if the clusterID did't write to ZK, is it partial during bootstrap ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, do you mean if the clusterID did't write to ZK, is it partial during bootstrap ?
I am not sure whether that can be really used. I need to check the code. We need a way to identify the fact that its a cluster redeploy. Not use some config to identify that.. The HBase system should be smart enough. So I was just wondering whether this we can use to know that. May be not.. Need to see. So my thinking is this that we will make the feature of recreate a cluster on top of existing data a 1st class feature for HBase itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, do you mean if the clusterID did't write to ZK, is it partial during bootstrap ?
I am not sure whether that can be really used. I need to check the code. We need a way to identify the fact that its a cluster redeploy. Not use some config to identify that.. The HBase system should be smart enough. So I was just wondering whether this we can use to know that. May be not.. Need to see. So my thinking is this that we will make the feature of recreate a cluster on top of existing data a 1st class feature for HBase itself.
That would be great, let's find a good way to differentiate this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for late and I have reread the code and come up the following.
First of all, the partial meta in current logic should mean a Procedure WAL of InitMetaProcedure did not succeed and INIT_META_ASSIGN_META was not completed. Currently, even if meta table can be read and a Table Descriptor can be retrieved but not assigned, it is still considered to be partial (correct me if I'm wrong). So, in short, partial meta table cannot be defined by reading the tableinfo or storefile itself.
Further, a combination of looking at WALs, Procedure WALs and Zookeeper data are the requirement and are used to define partial meta in the normal cases. But for the cloud use case, or other use cases that one of the requirements is missing, we will need a different discussion. For example.
- partial meta on the HDFS long running cluster cases
a. if have WALs and have ZK, it will be able to reassign normally
b. if have WALs but no ZK, it will not submit a new/enter into any state ofInitMetaProcedurebecause it found the oldInitMetaProcedurein the WAL. then the old server was handled by submit any SCP and assignment manager is do nothing. such Master hangs and does not finish initialization. (this is a different problem from the cloud case)
c. if no WALs but have ZK,state=OPENremains forhbase:metawhen opening an existing meta region,InitMetaProcedurewill not be submitted/entered as well (see this section inHMaster). master will hang and does not finish initialization. (this is a different problem from the cloud case)
There, for this PR, if we only focus on the cloud use cases, the unknown servers and partial meta will be much simpler. e.g. to when running InitMetaProcedure, clusterID in zookeeper (suggested by Anoop) can be used to indicate if it's partial meta that indicates ZK data is fresh, Region WALs and procedure WAL of InitMetaProcedure may not be exist. And if WAL and procedure WAL exits, it fails into the same failures as mentioned above case 1b (out of scope for this PR).
- partial meta on Cloud without WALs and ZK
a. if we're inINIT_META_WRITE_FS_LAYOUTand continue, then ZK should have existed when master restarts. Otherwise for the case of have WALs and no ZK, we will fail back to case 1b and we don't handle it within this PR.
b. if no WAL and no ZK, it submits aInitMetaProcedurebut the procedure lands withINIT_META_WRITE_FS_LAYOUT- during
INIT_META_WRITE_FS_LAYOUT, we check if ZK does not exist and with an existing meta directory, we should trust it and try to open it.- we're running this state of
INIT_META_WRITE_FS_LAYOUTonly when ZK does not exist orINIT_META_WRITE_FS_LAYOUTdidn't finish previously.
- we're running this state of
- during
So, we're fixing case 2b in this PR, and I have come up the prototype and unit tests are running off this PR now (TestClusterRestartFailoverSplitWithoutZk is falling even without our changes on branch-2).
The proposed changes are
- Only perform regions reassignment for regions on unknown server when there is no PE WALs, no Region WALs and no ZK data
- Do not recreate meta table directory if the restarted procedure of
InitMetaProcedure#INIT_META_WRITE_FS_LAYOUTcomes with no ZK data (or maybe no WAL as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Need to read through the analysis what you put above.. What you mention about when, after recreate cluster, the start will hang because of META not getting assigned, is correct. Can u pls create a sub issue for this case? ie. knowing whether we are starting HM after a recreate (create cluster over existing data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of HM start and the bootstrap we create the ClusterID and write to FS and then to zk and then create the META table FS layout. So in a cluster recreate, we will see clusterID is there in FS and also the META FS layout but no clusterID in zk. Ya seems we can use this as indication for cluster recreate over existing data. In HM start, this is some thing we need to check at 1st itself and track. If this mode is true, later when (if) we do INIT_META_WRITE_FS_LAYOUT , we should not delete the META dir. As part of the Bootstrap when we write that proc to MasterProcWal, we can include this mode (boolean) info also. This is a protobuf message anyways. So even if this HM got killed and restarted (at a point where the clusterId was written to zk but the Meta FS layout part was not reached) we can use the info added as part of the bootstrap wal entry and make sure NOT to delete the meta dir.
Can we do this part alone in a sub task and a provide a patch pls? This is very key part.. That is why better we can fine tune this with all diff testcases. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds right to me, as you suggested we put this PR on-held and depends on the new sub-task. I will try to send another JIRA and PR out in a few days and refer to the conversation we discussed here.
Thanks again Anoop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not delete the META dir.
Sorry for harping on an implementation detail: let's sideline meta and not delete please :).
Can we do this part alone in a sub task and a provide a patch pls? This is very key part..
This seems like a very reasonable starting point. Like Anoop points out, if we can be very sure that we will only trigger this case when we are absolutely sure we're in the "cloud recreate" situation, that will bring a lot of confidence.
I will try to send another JIRA and PR out in a few days and refer to the conversation we discussed here.
Lazy Josh: did you get a new Jira created already for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshelser the new JIRA is HBASE-24833 and the discussion are mainly on the new PR#2237. I may need to send email to dev@ list for a boarder discussion if we should not depend on the data on zookeeper (that will help us to prevent deleting the meta directory)
Am catching up.. but I think Duo's comments here hit on a lot of the worry that I had. How can we be 100% certain that we don't hit this other code-path when we are not in the "cloud recreation" path? Is this better served by HBCK2-type automation which can be run when stuff is being "recreated"? Just thinking out loud. Making a pass through the code changes you have so far, Stephen. Looks like some good reviews by Anoop already :) |
| .filter(s -> !s.isOffline()) | ||
| .filter(s -> isTableEnabled(s.getRegion().getTable())) | ||
| .filter(s -> !regionStates.isRegionInTransition(s.getRegion())) | ||
| .filter(s -> { | ||
| ServerName serverName = regionStates.getRegionServerOfRegion(s.getRegion()); | ||
| if (serverName == null) { | ||
| return false; | ||
| } | ||
| return master.getServerManager().isServerKnownAndOnline(serverName) | ||
| .equals(ServerManager.ServerLiveState.UNKNOWN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collapse these down into one method so we don't end up making 4 iterations over a list of (potentially) a lot of regions.
| FileSystem fs = rootDir.getFileSystem(conf); | ||
| Path tableDir = CommonFSUtils.getTableDir(rootDir, TableName.META_TABLE_NAME); | ||
| if (fs.exists(tableDir) && !fs.delete(tableDir, true)) { | ||
| boolean removeMeta = conf.getBoolean(HConstants.REMOVE_META_ON_RESTART, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not delete the META dir.
Sorry for harping on an implementation detail: let's sideline meta and not delete please :).
Can we do this part alone in a sub task and a provide a patch pls? This is very key part..
This seems like a very reasonable starting point. Like Anoop points out, if we can be very sure that we will only trigger this case when we are absolutely sure we're in the "cloud recreate" situation, that will bring a lot of confidence.
I will try to send another JIRA and PR out in a few days and refer to the conversation we discussed here.
Lazy Josh: did you get a new Jira created already for this?
|
On the @joshelser good question of 'I was concerned about making sure all "old" RegionServers are actually down before we reassign regions onto new servers', SCP should probably call expire on the ServerName it is passed. It'd be redundant in most cases. I thought it did this already but it does not. Queuing an SCP adds the server to the 'Dead Servers' list (I think -- check) so if it arrives at any time subsequent, it will be told 'YouAreDead..' and it will shut itself down. On the @Apache9 question:
Currently HBCK2 does not have special handling for 'Unknown Servers'. The 'HBCK Report' page that reports 'Unknown Servers' found by a CatalogJanitor run suggests:
So, the 'fix' for 'Unknown Servers' as exercised by myself recently was to parse the 'HBCK Report' page to make a list of all 'Unknown Servers' and then script a call to 'hbck2 scheduleRecoveries' for each one. We should be able to do better than this -- either add handling of 'Unknown Servers' to the set of issues 'fixed' when we run 'hbck2 fixMeta' or as is done here, scheduling an SCP for any 'Unknown Server' found when CatalogJanitor runs. On the latter auto-fix, there is understandable reluctance. I think this comes of 'Unknown Servers' being an ill-defined entity-type; the auto-fix can wait on the concept hardening. I like this comment of @Apache9:
But there should be 'safe' means of attaining your ends @taklwu . Perhaps of help is a little known utility, hbase.master.maintenance_mode config, where you can start the Master in 'maintenance' mode (HBASE-21073): Master comes up, assigns meta but nothing else... it is so you can ask Master to make edits of state/procedures/meta. Perhaps you could script moving cluster to new location, starting Master in new location in maintenance mode, edit meta (a scp that doesn't assign?), then shut it down followed by normal restart. |
|
Mentioning here as the recommendation of Zach, I'm trying to see if we can get an answer as to whether or not we think a default=false configuration option to automatically schedule SCPs when unknown servers are seen, as described in #2114 I agree/acknowledge that other solutions to this also exist (like Stack nicely wrote up), but those would require a bit more automation to implement. I don't want to bulldoze the issue, but this is an open wound for me that keeps getting more salt rubbed into it :) |
|
I'm closing it, not much movement we can make here, if anyone has a better solution or hit the issue again, please link this discussion on it. |
…ting a new cluster pointing at the same file system
HBase currently does not handle
Unknown Serversautomatically and requiresusers to run hbck2
scheduleRecoverieswhen one see unknown servers onthe HBase report UI.
This became a blocker on HBase2 adoption especially when a table wasn't
disabled before shutting down a HBase cluster on cloud or any dynamic
environment that hostname may change frequently. Once the cluster restarts,
hbase:meta will be keeping the old hostname/IPs for the previous cluster,
and those region servers became
Unknown Serversand will never be recycled.Our fix here is to trigger a repair immediately after the CatalogJanitor
figured out any
Unknown Serverswith submitting a HBCKServerCrashProceduresuch that regions on
Unknown Servercan be reassigned to other onlineservers.