-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-22617 Recovered WAL directories not getting cleaned up #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
Fixed the failed UTs. In the new patch, will delete the wrong region wal directory after successfully opening a region. And also, in GCRegionProcedure we will delete the region wal directory, and in DeleteTableProcedure.deleteFromFs(which will be called from many places), we will delete the table wal directory. |
|
💔 -1 overall
This message was automatically generated. |
|
Let me check the failed UTs. |
|
TestHbck.testRecoverSplitAfterMetaUpdated is still failing. Will dig more. |
| */ | ||
| public Path getRegionDir(RegionInfo region) { | ||
| return FSUtils.getRegionDir(FSUtils.getTableDir(getRootDir(), region.getTable()), region); | ||
| return FSUtils.getRegionDirFromRootDir(getRootDir(), region); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this? The old impl should be getRegionDirFromTableDir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same...
You can see the implementation of getRegionDirFromRootDir, it will first generate the tableDir...
| // Cleanup the directories on WAL filesystem also | ||
| Path regionWALDir = FSUtils.getWALRegionDir(env.getMasterConfiguration(), | ||
| getRegion().getTable(), getRegion().getEncodedName()); | ||
| walFs.delete(regionWALDir, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regionWALDir should already be deleted when archive region dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could be on different file system. The table data are on the normal file system and these are on the wal file system. By default they are the same but they could be different, i.e, when deploying HBase on S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. But for the wrong regionWALDir, how about exist first and then delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 check exists first
|
|
||
| // Delete the directory on wal filesystem | ||
| FileSystem walFs = mfs.getWALFileSystem(); | ||
| Path tableWALDir = FSUtils.getWALTableDir(env.getMasterConfiguration(), tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be deleted when archive table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could be on different file system. The table data are on the normal file system and these are on the wal file system. By default they are the same but they could be different, i.e, when deploying HBase on S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
| if (!createDir(splitdir)) { | ||
| throw new IOException("Failed create of " + splitdir); | ||
| } | ||
| Path daughterATmpDir = getSplitsDir(daughterA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to create the directories for daughter regions here, as it is possible that we do not have any store files to split, which means the directories may not be created when we call commitDaugherRegion, and then we will miss the .regioninfo file and cause TestHbck to fail. Not sure why this does not have problem in the past but it is no harm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
💔 -1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DeleteTableProcedure.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitUtil.java
Outdated
Show resolved
Hide resolved
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
apurtell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm assuming all tests pass.
I may be back. Using this to make a branch-1 patch. Might find a small detail later.
| // Cleanup the directories on WAL filesystem also | ||
| Path regionWALDir = FSUtils.getWALRegionDir(env.getMasterConfiguration(), | ||
| getRegion().getTable(), getRegion().getEncodedName()); | ||
| walFs.delete(regionWALDir, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 check exists first
| } | ||
| } | ||
| for (Path file: files) { | ||
| for (Path file : filesUnderWrongRegionWALDir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small code smell. Now that we have three lists, why not create one List, add to the one lists the files we find in any of the three locations we must check, then process the one list all at once. The only difference here is the log messages are slightly different but operators and us won't care if "root" or "wrong". The error or debug log lines all contain the file path, which is enough, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are on different file systems so at least we need two loops...
| return new Path(tabledir, name); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good
|
After the fix is ready we should also have a unit test that confirms that no files are written outside of where we expect, so we can catch this kind of error in the future. Could be done as part of this work or as a follow up PR and JIRA, but we should have that additional test coverage in place going forward. |
| Path tableWALDir = FSUtils.getWALTableDir(env.getMasterConfiguration(), tableName); | ||
| if (walFs.exists(tableWALDir) && !walFs.delete(tableWALDir, true)) { | ||
| throw new IOException("Couldn't delete table dir on wal filesystem" + tableWALDir); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not be checking for wrong wal directory here too and deleting that if it exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary I think, at least there is no way to delete the namespace directories... So the users hava to use a script to delete the empty directories manually...
| // Cleanup the directories on WAL filesystem also | ||
| Path regionWALDir = FSUtils.getWALRegionDir(env.getMasterConfiguration(), | ||
| getRegion().getTable(), getRegion().getEncodedName()); | ||
| walFs.delete(regionWALDir, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log at DEBUG if delete fails
| walFs.delete(regionWALDir, true); | ||
| Path wrongRegionWALDir = FSUtils.getWrongWALRegionDir(env.getMasterConfiguration(), | ||
| getRegion().getTable(), getRegion().getEncodedName()); | ||
| walFs.delete(wrongRegionWALDir, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log at DEBUG if delete fails
| getRegionInfo().getEncodedName()); | ||
| FileSystem walFs = getWalFileSystem(); | ||
| if (walFs.exists(wrongRegionWALDir)) { | ||
| walFs.delete(wrongRegionWALDir, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log at DEBUG if delete fails
Let's do this in a follow up, as the 'outside of where we expect' is not a clear description, we need to discuss how to implement the test, as well as how to check the result. Let me address the review comments. |
|
💔 -1 overall
This message was automatically generated. |
Signed-off-by: Guanghao Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]>
Signed-off-by: Guanghao Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]>
Signed-off-by: Guanghao Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]>
Signed-off-by: Guanghao Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]>
) Signed-off-by: Guanghao Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]>
| for (Path file: files) { | ||
| if (!walFS.delete(file, false)) { | ||
| LOG.error("Failed delete of " + file); | ||
| for (Path file : filesUnderRootDir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, if the region is read-only mode for HbaseTableSnapshotInputFormat, it try to delete the filesUnderRootDir and no access, it will throw exception and failed
) Signed-off-by: Guanghao Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]>
) Signed-off-by: Guanghao Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]> (cherry picked from commit 2b7a7da) Change-Id: I01752de3d258ac5e68fe0cd735bafadedb7547c3
No description provided.