-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-23181 Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us" #739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@saintstack @binlijin PTAL. This is another way to fix the problem of HBASE-23157. The idea is simple, just remove everything in SequenceIdAccounting when we write a close region marker. Why introducing a new parameter instread of checking the edit when appending is that, the append is single threaded so we'd better do less in it, and also passing a protobuf message back from a byte array can throw IOException, which is a bit annoying as we do not expect IOExceptions for lots of methods. And maybe this could also solve HBASE-23181, but I still think we need to find out the root cause for HBASE-23181 and add a UT for it, maybe there are still other problems. |
|
💔 -1 overall
This message was automatically generated. |
|
The idea is good, it can solve the problem that make regionservers have too many wals, but the master may still have the warn message: Looks like the warn do not make any problem? |
|
The warning does not lead to actual problem but I think we should still avoid it. Maybe another issue? I think we should find a better way to find the proper flushed sequence id for reporting to master. |
|
I thought about doing this but thought it too obnoxious. It also punts on what the actual issue is over in HBASE-23181, on why on close an extra edit can arrive after we've moved aside the Region Map in SequenceIdAccounting. We could apply as a workaround or patch until we figure real issues? @binlijin So that is why we get that message? I didn't think of that. Good one. |
|
But if we use ASYNC_WAL, the patch on HBASE-23181 can not solve the problem, as by design, ASYNC_WAL will not wait for the SyncFuture to complete... So we have to manually remove the records in SequenceIdAccounting, after we make sure that the region has been closed and all data have been flushed out. |
saintstack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question is if the close will cut off any edits in flight?
| return Bytes.toString(b); | ||
| } | ||
|
|
||
| public String toStringBinary() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make this a toString rather than add a new method?
| System.currentTimeMillis(), value)); | ||
| long txid = log.append(info, getWalKeyImpl(System.currentTimeMillis(), scopes), edit, true); | ||
| long txid = | ||
| log.append(info, getWalKeyImpl(System.currentTimeMillis(), scopes), edit, true, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense adding an override for append rather than make change everywhere append is called?
| final boolean inMemstore, boolean closeRegion, ServerCall<?> rpcCall) { | ||
| super(key, edit); | ||
| this.inMemstore = inMemstore; | ||
| this.closeRegion = closeRegion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to add a flag? See WALUtil where we have createFlushWALEdit which uses special family and qualifier to denote an edit a FLUSH edit. Could do same here and skip the extra flag?
| .collect(Collectors.joining(",", "{", "}"))); | ||
| } | ||
| } | ||
| this.highestSequenceIds.remove(encodedRegionName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. This is like the purge method over in HBASE-23181.
| * so we can finish an unfinished compaction -- it is not an edit for memstore. | ||
| * @param closeRegion Whether this is a region close marker, i.e, the last wal edit for this | ||
| * region on this region server. The WAL implementation should remove all the related | ||
| * stuff, for example, the sequence id accounting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
|
💔 -1 overall
This message was automatically generated. |
|
As proposed on slack in the direct message to @saintstack , let me try to introduce two methods in WAL interfacce, one is for appending data, and another is for appending marker, to avoid too many flags. |
|
@Apache9 Yes, it another problem we need to find the root cause and fix it in another issue. |
|
@saintstack i need to study it and figure out why we get that message. |
|
Hi @binlijin , please follow the steps on this page https://cwiki.apache.org/confluence/display/OPENWHISK/Accessing+Apache+GitHub+as+a+Committer to combine your github account and the asf account? You should be a member of HBase so you can merge pull requests. Now you are listed as a contributor... |
|
@Apache9 OK. @Apache9 @saintstack After debuging i find this message caused by the same problem. |
|
💔 -1 overall
This message was automatically generated. |
|
Thanks for bringing up the " indicates a last flushed sequence id" message on the master. Good one. It is related. At least for the HBASE-23181 case, it makes sense because an entry goes into SequenceIdAccounting with a sequenceid that is one less than the flush sequenceid (somehow it is getting in after the flush the flush). Once it goes in, the Master starts complaining. You added a whole new flushcache method to HRegion in TestMaxFlushedSeqId.patch. How is it different from the original? The test looks good. What do you see as the output? You can make it so a flush gets a sequenceid that is less than the last edit's sequenceid? The last edit does not make it into the flush? Its out of order? Nice work. |
|
@binlijin See above sir. |
|
I think this could be a follow on. It is the same problem, but we need to discuss how to fix it. If we think the current solution, where we remove all the related records when closing a region, is fine, then I think we could add some sequence ids back to the unflushedSequenceIds in completeCacheFlush, if we found that the current sequence id in the unflushedSequenceIds is even smaller then the flushedSequenceId. As we will write a flush marker and wait for it to complete, so after flush we can make sure that there are no pending WAL entries for this region any more, so I think the solution above can work. |
|
Or let's change this PR against HBASE-23181, and use HBASE-23157 for addressing the flushed sequence id problem. |
|
What would be a follow on sir? Diff between 23181 and this is a async wal. The patch on there at moment is bad I realize now |
Like I proposed above, we could do something in the completeCacheFlush to prevent the flushed sequence id go backwards. And this does not lead to data loss, only a warning log on master side which makes people a bit confusing, so I think it is fine to address it in a follow on issue. Issuing a wal sync while holding the writeLock of updateLock is not good, and we have been doing a lot of works to avoid this... I've already changed the title to HBASE-23181, if we all agree that this is fine, let's merge this, and then let's use Lijin Bin's new UT to address the flushed sequence id problem using HBASE-23157. Thanks. |
|
@saintstack |
|
Let's close this one, will open a new PR against HBASE-23181. |
|
@binlijin we should fix that. It does not seem to be the behavior seen in HBASE-23181 though, right? @Apache9 you see the comments above? Ok on continuing on a new PR. |
|
@saintstack Yes, they are not the same behavior, but for the same reason. |
No description provided.