-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-10527. Rewrite key atomically #6385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10527. Rewrite key atomically #6385
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sodonnel for the patch.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneKey.java
Outdated
Show resolved
Hide resolved
| // TODO - if we are effectively cloning a key, probably the ACLs should | ||
| // be copied over server side. I am not too sure how this works. | ||
| .setAcls(getAclList()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think we should omit metadata and ACLs, if possible, and let OM copy them.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneBucket.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/protocol/ClientProtocol.java
Outdated
Show resolved
Hide resolved
…ID is unchanged, and metadata is copied over to the open key
|
Why not just introduce an object generation field and expose this API (in the future for applications itself).
|
I think it would be possible to extend to this in the future, but it may need more thought about what all operations it impacts. Implementing this sort of generic approach probably needs to cover key delete and create (overwrite) at least. Perhaps move / rename too. We would need to think through what it would mean for FSO buckets (perhaps nothing different, but I have not given it deep thought). The workings of the proposed solution here is hidden behind the new overwriteKey API on the bucket object, and that is the only external API I have changed. If we were able to get the generation concept added, the overwrite key could be changed to use it and further enhance this sort of feature. |
|
Create https://issues.apache.org/jira/browse/HDDS-10558 for the idea suggested by @kerneltime as it is something that is worth exploring, but probably needs a bit more design work than just this PR. |
It would not make sense to have generation and also use objectID. It is essentially the same capability, one using just the objectID and another having a more developer friendly generation id. I would recommend to introduce generation ID here instead of using objectID. It would not make sense to expose object ID to developers. |
|
ObjectID is already there, as is updateID. They have not been introduced here, and they are already persisted and managed in OM. This PR only publicly exposes the new method on the bucket, with no intention of exposing it further. If we are going to expose GenerationID through the APIs in the way the google cloud docs indicate, then we need to decide if that is something we want to start with. Eg Google cloud supports it, but AWS does not. Its more test surface, and more code to write and expose via all the interfaces and most importantly to be supported going forward. We would also have to worry about forward / backward compatibility if we are adding another ID to be persisted in OM. Old keys will not have a generationID, but perhaps it can be derived from the object / update ID. How does it tie in with the object Version? If it is a new ID to be stored, it will add a small storage / memory overhead on OM. I don't believe we should just implement something like that without giving it due consideration. What is there in this PR, could easily be changed to use a generationID if it was introduce later and this feature sees some adoption as the use of object / updateID is contained and hidden from users of the API, which is why I would suggest exploring GenerationID in the other Jira I raised. I am trying to make the smallest useful change possible to allow for atomic key overwrite, without closing any paths for future improvements. Therefore I would like to move any design around GenerationID into the other Jira and move ahead with this one in its current direction. |
|
The generationID in google cloud appears to be like the version in AWS and in Ozone to some extent (I believe versioning is not fully implemented in Ozone). From the docs:
From that, it is not clear if each generationID is unique across the bucket, or if two different keys can have the the same generationID. In Ozone, if you create a key, with objectID=5, and then create a new key of the same name, the objectID, I believe, remains the same, but the update_id is incremented. If you deleted the key (without versioning enabled) and recreated it, the object_ID will change. UpdateID comes from the Ratis Transaction ID (if Ratis is enabled), so it is probably unique on its own without the objectID, but I am not sure if that can be trusted without Ratis enabled. Also based on comments in WithObjectID.java. Therefore I think that to guarantee an object has not changed, we need both object and update ID. For now, I don't think we should expose any of this via the S3 or Hadoop compatible filesystems. We could also tag the new bucket.overwriteKey() as experimental for now, giving us scope to remove it or change it later if we decide it is not useful. However I still think it could be adapted to a new approach easily behind the public interface. Basically, I think the whole Ozone version story and version ID along with object and update ID needs to be fully worked out before we expose anything more widely that is done in this PR. |
Versioning != generation ID. Generation is just a monotonically increasing number without preserving the older generations.
Each object has its own generation field, it is not based on the bucket.
|
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sodonnel for working on this quick patch. Overall LGTM +1. However I think, since overwrite key write flow is same as normal create key first time, it would be good if we introduce some extra information in Audit logs for overwrite flow to differentiate.
|
@devmadhuu Thanks for the review. I added one more commit that adds details to the audits. @kerneltime In google cloud, they use the generationID to access versions of an object like: So there is surely some overlap with the Ozone version and how AWS does things. Objects that are not versioned also have a generationID, so its not tied to versioning, but it is used in versioning. |
Object ID is derived from the Ratis transaction index of the create request. See the method doing the calculation that is called from all create requests. Note the extra space created around each object ID to account for directories that might need to be created with a file create request. Creating a new key with the same name will have a different object ID. Update ID is also set on key create but additionally changed when metadata like Native ACLs are changed. This is just the Ratis transaction index with no additional modifications. See this method and everywhere it is used for what operations modify the update ID. I'm not sure what you mean "update ID is incremented". The update ID of the new key will be arbitrarily larger than the old key since it is a Ratis transaction index, but its value is not related to the update ID of the previous version of the key. |
|
@errose28 I think that for a HSync (HBase related changes), when a key is appended or synced, the update ID should be increased while the objectID stays the same. As I first read your comment, I thought maybe I can cut this down to just checking the objectID, but I think the append / hsync thing changes the picture slightly, needing the update ID to be checked too if we want to ensure there are no changes. By "update ID is incremented" I just meant it was changed. Not that it is increased by 1. I was aware that both IDs are derived from the Ratis Transaction ID. |
Actually I'm thinking we can just check the update ID to determine if the key changed. The case where object ID alone would be useful is if only a key's native ACLs were updated and we wanted to disregard this change. However, I think we want this API to preserve native ACLs in the original key. This makes it consistent with Ranger ACLs which will remain the same as long as the key is re-rewritten to the same location.
I don't think a key with an open lease being hsync'ed should be eligible for overwrite. I think this overwrite case is already blocked by existing hsync changes on the feature branch but we should double check to make sure we don't have issues later. I believe append will require an OM side update every time it is called, so yes that should increment the update ID. |
A key being hsync'ed is "open but visible" in Ozone and hence should have a lease which blocks other writers. I think the "hsync / hbase" work also allows for a key to be appended - ie the key is closed and committed. Then a writer reopens it and appends some new data and commits / closes it again. While the first scenario should be blocked, I need to ensure we do the right thing if an append happens to a closed key that is currently being overwrriten, and whether that is even possible! I will ask around about that. I think you could be right that update_id is all we need as it will change on key append, hsync and object delete and recreate. It would be much nicer to only need to use 1 field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @sodonnel. I left some minor comments and have a few ideas on the design I think we can look into:
- Can we just use updateID to check for changes as discussed above?
- Can we avoid persisting the overwrite ID to the open key proto on key create and have the client hold it in memory, supply it on commit, and check it then?
- I think we want native ACLs of the original key preserved when using this API. We should add tests for this.
- Since the PR description states FSO is out of scope for now, let's have the code explicitly enforce this.
- Currently overwrite will fail if the key is renamed or its immediate parent is changed. It will not fail if a directory farther up the file's parent tree is moved or renamed.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneBucket.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneKey.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneKey.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java
Outdated
Show resolved
Hide resolved
| // Optimistic locking validation has passed. Now set the overwrite fields to null so they are | ||
| // not persisted in the key table. | ||
| omKeyInfo.setOverwriteUpdateID(null); | ||
| omKeyInfo.setOverwriteObjectID(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need these persisted to the open key table? The client could just hold the update ID in memory from the original read, and pass it as the overwrite update ID on commit for the OM to use only during the in-memory portion of the commit. If the client dies before finishing the overwrite it will lose its client ID and not be able to access the open key to resume writing anyways.
It seems like we can move all new logic to the commit phase without modifying key create requests or the protos that are written to the DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the overwrite IDs to the commit phase is much more difficult as the commit is performed by various sub-classes on the client. Eg Ratis vs EC vs any new write class. I did look into added them only to the commit, but concluded it was much easier to persist them to the open key table.
Therefore I prefer to keep it the way it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also argue, why persist any of the key meta data on key open (created date, ACLs, etc). I think persisting the overwriteID in the open key table keeps with the existing pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also argue, why persist any of the key meta data on key open (created date, ACLs, etc).
After looking this over a bit I think these are actually bugs that we need to fix:
-
Create time:
This one is more of a debugging/auditing inconvenience. We set create time in the create phase before the object is visible, and then set modification time in the commit phase when the key is visible. Ideally for a key that has just been created, I would think ctime and mtime would be expected to be the same. However since both create and commit show up in the audit log I guess you could argue that the current implementation is correct as well. -
ACLs
This one looks more concerning. I haven't tested this yet, but it looks like the ACLs at the time of create are what are also committed to the final key, without checking if the key being replaced had ACL updates in the mean time. For example: -
key1 exists with acl1
-
key1' is created at the same path as key1
-
ACLs for key1 are updated to acl2 by another user/admin.
-
key1' is committed with acl1 that was read at create time.
Now the ACLs have gone back in time without the admin or user intending to make this change.
Looking at it from this angle, the existing approach looks like it should be fixed to write metadata like ACLs and create time at the time of commit. Once these bugs are fixed, persisting the overwrite update ID at the time of create does not make much sense in context either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any change to the key, including ACLs should change the keys updateID, as that is what we are relying on to test if the key has changed. That would cause the optimistic commit to fail.
From the client code perspective, I think its easier to set the ACLs etc at the time of key open, as otherwise they need to be potentially passed down to various sub-classes. This also adds different testing paths for each type of key (Ratis, EC). That is why I am persisting the overwriteUpdateID in the openKey part. It should be passed as part of open key so we can check they key has no change since it was read and the key write starts. It costs basically nothing to persist it in the open key table and saves some complexity on commit.
If you think there are bugs around existing ACLs / Created Time / Modification time handling then please raise Jiras for them. We cannot change that as part of this PR, as it would not make sense in this context. A different PR should take care of that.
On the created time / modification time front - I think could be argued either way. What is the creation time of a file in Linux? You open the file, you write a series of bytes of several minutes and close the file. The ctime is probably the time the file was opened. The mtime is the time the file was last changed - they can easily be different. With Ozone its not as clear cut as the key traditionally has not been visible until it is committed. They the difference between ctime and mtime is really the time it took to write the bytes. After HBase allows uncommitted keys to be visible, the behavior is more like Linux and hence is quite possibly correct as it stands.
...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java
Outdated
Show resolved
Hide resolved
|
Converted to draft until #6482 is outstanding. |
… the client as with create key
Conflicts: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneBucket.java hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneKey.java hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneKeyDetails.java hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/OmKeyArgs.java hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/OmKeyInfo.java hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/RequestAuditor.java hadoop-ozone/s3gateway/src/test/java/org/apache/hadoop/ozone/client/OzoneBucketStub.java
|
@errose28 @kerneltime I have updated this PR to match what we discussed in the design doc PR. The only things in the design doc that are not included here are:
I also plan to send this PR to a branch - it will not be committed to master. I need to create the branch and then see if we can re-point the PR at the branch. |
|
@errose28 @kerneltime, @adoroszlai has created the branch HDDS-10656-atomic-key-overwrite and pointed this PR at that branch. Please let us know if you are happy with the changes here, or if you have further comments so we can progress toward committing this onto the branch. |
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneKey.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java
Outdated
Show resolved
Hide resolved
|
@errose28 @kerneltime @adoroszlai have you got any further comments on this PR? |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sodonnel for updating the patch.
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this @sodonnel.
For the unit and integration tests, I think we need one static checker in a test util class that takes the old and new KeyInfo objects from a rewrite operation and checks that metadata is either altered or unaltered accordingly. This lets the code document and enforce exactly which fields are and are not supposed to be modified by rewrite. Currently metadata checks are done ad-hoc in unit and integration checks and I think some fields like mtime and key owner are not tested.
Also there are still some lingering references to "overwrite" in the change, but I think "rewrite" is the terminology we are using now. For clarity it would be good to just do a find/replace to make those match.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/OzoneBucket.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java
Show resolved
Hide resolved
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/WithObjectID.java
Outdated
Show resolved
Hide resolved
| public static final String BUCKET_LAYOUT = "bucketLayout"; | ||
| public static final String TENANT = "tenant"; | ||
| public static final String USER_PREFIX = "userPrefix"; | ||
| public static final String OVERWRITE_GENERATION = "overwriteGeneration"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just call this generation? We may use it for something other than overwrites in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this, similarly for OmKeyArgs.overwriteGeneration (edit: OmkeyArgs.overwriteGeneration change can be ignored). Although I haven't taken a look deeply, I think the "generation" concept can be reused in the future. For example, we can ensure follower read adheres to "read-your-own-writes" consistency by passing the generation that was just written to the follower to ensure that the value has been updated in the follower (instead of linearizable read which has higher latency since it requires the follower to contact the leader as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant is only used in the audit log. I think it makes sense to call it overwrite (or probably rewrite) generation to make it distinct from the current key generation. Calling the rewriteGeneration just generation could be confusing if we have a current generation called just "generation" and then this atomic rewrite parameter called "generation".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was meant as a more general comment on the PR which probably wasn't clear. I think there are some places where rewriteGeneration makes sense, like the audit log here and the KeyArgs that are passed in by the client. Basically the places where the client is instructing the server to use the generation for rewrite purposes. However, when the value is stored or returned like in OmKeyInfo and the KeyInfo proto I think we should just call it generation. At that point the value could technically be used for anything, not just a rewrite.
...one-manager/src/test/java/org/apache/hadoop/ozone/om/request/key/TestOMKeyCommitRequest.java
Outdated
Show resolved
Hide resolved
...ration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestOzoneRpcClientAbstract.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| public void testAtomicRewrite() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional suggestion:
The large multi-scenario test methods used in the three request testing classes here would be clearer if each scenario was split into its own test case, like testRewriteWhenKeyDeleted, testRewriteWhenKeyUpdated, testRewritePreservesMetadata etc. These smaller units make it easier to see what is actually being tested, and what scenarios may have been missed.
That said there's technically nothing wrong with the way it is implemented here so you can disregard this if you think there would be code duplication in test setup or other factors that make this suboptimal.
...one-manager/src/test/java/org/apache/hadoop/ozone/om/request/key/TestOMKeyCreateRequest.java
Show resolved
Hide resolved
|
@errose28 I believe I have addressed you further comments. Please have a look and let me know what you think. |
|
Looking good overall, I think we just need to determine when to call it |
An object should have a |
|
@kerneltime @errose28 Can you please clarify exactly what you want it to be called in this PR, keeping in mind we don't have metaGeneration at the moment? Is it to be "generation", "dataGeneration", or perhaps "expectedGeneration", which makes sense when making a call that should only exceed when the given expected generation is present. It seems strange to have simply "generation" to me, as ideally that would be the current generation of a key, which is currently "updateID", as we are reusing it. If we ever added metaGeneration, I doubt we would call it metaUpdateID, but perhaps we would to keep consistency. |
I have created https://issues.apache.org/jira/browse/HDDS-10843 for the test improvements. |
|
@kerneltime @errose28 Following up on the naming of the "expectedGeneration" field in the protobuf. Can you review my thoughts above so we can agree on what this should be called? This is the only outstanding item in this PR. |
|
I have raised https://issues.apache.org/jira/browse/HDDS-10857 to decide on the naming of the passed generation in the API. Please comment on the Jira so we can decide on the naming. As @adoroszlai has given a +1 and @errose28 voiced approval pending this last decision I will commit this PR onto the branch tomorrow. All the followup items are under the epic and will be taken care of before merging the branch to master. |
What changes were proposed in this pull request?
This change introduces the ability to re-create / overwrite a key in Ozone using an optimistic locking technique.
Say there is a desire to replace a key with some additional data added somewhere in the key, or perhaps change its replication type from Ratis to EC. To do this, you can read the current key data, write a new key with the same name, and then on commitKey, the new key version will be visible.
However, there is a possibility that some other client deletes the original key, or re-writes it at the same time, resulting in potential lost updates.
To replace a key in this way, the proposal is to use the existing objectID and updateID on the key to ensure the key has not changed since it was read. The flow would be:
This technique is similar to optimistic locking used in relational databases, to avoid holding a lock on an object for a long period of time.
Notably there are no additional locks needed on OM and no additional calls or rocksDB reads required to implement this - passing and storing the IDs in the openKey table is all that is required. The overwriteIDs don't need to be stored in the keyTable.
This change only added the feature for Object Store buckets for now.
Additionally, there is a question over what to do about meta-data and ACLs. Should they be copied from the existing key, or passed from the client.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10527
How was this patch tested?
New integration and unit tests added.