-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-10141. [hsync] Support hard limit and auto recovery for hsync file. #6033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...one-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMRecoverLeaseRequest.java
Outdated
Show resolved
Hide resolved
...zone-manager/src/test/java/org/apache/hadoop/ozone/om/service/TestOpenKeyCleanupService.java
Show resolved
Hide resolved
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly makes sense. Just one comment:
| } else if (isHsync && openKeyInfo.getModificationTime() <= expiredLeaseTimestamp && | ||
| !openKeyInfo.getMetadata().containsKey(OzoneConsts.LEASE_RECOVERY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else if (isHsync && openKeyInfo.getModificationTime() <= expiredLeaseTimestamp && | |
| !openKeyInfo.getMetadata().containsKey(OzoneConsts.LEASE_RECOVERY)) { | |
| } else if (isHsync && openKeyInfo.getModificationTime() <= expiredLeaseTimestamp) { |
What if the client performing lease recovery crashes before committing the final change? The file would become recovery-in-progress forever and can't be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When manual recovery is performed on a file we are skipping those file for auto recovery. These file can be manually recovered any time. We are skipping this to avoid any data loss in this case as the exact length may not get updated during auto recovery.
@ChenSammi can you please help to confirm if we can remove this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If the manual recovery client crashes before the final file commit, then this file will be kept in the openKeyTable. User can rerun the manual recovery through CLI again to recover the file later.
The reason we let this file in openKeyTable is auto-recovery will inevitably lost some data in file's last block. So if user showed the intent to recover the file manually, we need to keep the chance for the user.
| <description> | ||
| Controls how long an open hsync key is considered as active. Specifically, if a hsync key | ||
| has been open longer than the value of this config entry, that open hsync key is considered as | ||
| expired (e.g. due to client crash). Unit could be defined with postfix (ns,ms,s,m,h,d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if an admin misconfigures this to 7ms? A key will almost immediately get committed after a client calls hsync() if OpenKeyCleanupService is triggered at that moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is, should we have a lower limit on this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add one check, the hard limit should not be less than the soft limit.
For most of duration properties in Ozone, if some of them are related, then we will check whether one is smaller than another. Talking about misconfiguration, it's hard to draw a bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenSammi I agree that it is generally hard to prevent all misconfigurations.
I was prompted when I saw these ns,ms units in the description. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check for hard limit should not be less than the soft limit, and made it equal if it is so.
...ne/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/OpenKeyCleanupService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last patch looks good to me.
@jojochuang , @smengcl , would you like to another look?
What changes were proposed in this pull request?
A new property
ozone.om.lease.hard.limit(default 7d) aded for hsync open files.If file lifetime is already beyond the
ozone.om.lease.hard.limitthreshold, there are two cases,OpenKeyCleanupServicewill1. If it has the
leaseRecoveryflag set, which means it is under an explicitly lease recovery, then OpenKeyCleanupService will keep it in OpenFileTable.2. If it doesn’t has the
leaseRecoveryflag set, thenOpenKeyCleanupServicewill auto commit it with the KeyInfo in openFileTable. In this case, the final block might lose some data at the block end which is hsynced.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10141
How was this patch tested?
New and exiting unit test.