From 3726dc8aa955704df80c66cfa90ec8244d0a3ed9 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Fri, 5 Apr 2024 18:11:55 +0100 Subject: [PATCH 01/13] HDDS-10657. esign Doc for overwriting a key if it has not changed --- .../design/overwrite-key-only-if-unchanged.md | 125 ++++++++++++++++++ 1 file changed, 125 insertions(+) create mode 100644 hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md new file mode 100644 index 000000000000..faf0c54b1565 --- /dev/null +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -0,0 +1,125 @@ +Ozone offers write semantics where the last writer to commit a key wins. Therefore multiple writers can concurrently write the same key, and which ever commits last will effectively overwrite all data that came before it. + +As an extension of this, there is no "locking" on a key which is being replaced. + +For any key, but especially a large key, it can take significant time to read and write it. There are scenarios where it would be desirable to replace a key in Ozone, but only if the key has not changed since it was read. With the absence of a lock, such an operation is not possible today. + +## As Things Stand + +Internally, all Ozone keys have both an objectID and UpdateID which are stored in OM as part of the key metadata. + +Each time something changes on the key, whether it is data or metadata, the updateID is changed. It comes from the ratis transactionID and is generally an increasing number. + +When an existing key is over written, its existing metadata including the ObjectID and ACLs are mirrored onto the new key version. The only metadata which is replaced is any custom metadata stored on the key by the user. Upon commit, the updateID is also changed to the current Ratis transaction ID. + +Writing a key in Ozone is a 3 step process: + +1. The key is opened via an Open Key request from the client to OM +2. The client writes data to the data nodes +3. The client commits the key to OM via a Commit Key call. + + +## Atomic Key Replacement + +In relational database applications, records are often assigned an update counter similar to the updateID for a key in Ozone. The data record can be read and displayed on a UI to be edited, and then written back to the database. However another user could have made an edit to the same record in the mean time, and if the record is written back without any checks, those edits could be lost. + +To combat this, "optimistic locking" is used. With Optimistic locking, no locks are actually involved. The client reads the data along with the update counter. When it attempts to write the data back, it validates the record has not change by including the updateID in the update statement, eg: + +``` +update customerDetails +set +where customerID = :b1 +and updateCounter = :b2 +``` +If no records are updated, the application must display an error or reload the customer record to handle the problem. + +In Ozone the same concept can be used to perform an atomic update of a key only if it has not changed since the key details were originally read. + +To do this: + +1. The client reads the key details as usual. The key details can be extended to include the existing updateID as it is currently not passed to the client. +2. The client opens a new key for writing with the same key name as the original, passing the previously read updateID in a new field. Call this new field overwriteExpectedUpdateID. +3. On OM, it receives the openKey request as usual and detects the presence of the overwriteExpectedUpdateID. +4. On OM, it first ensures that a key is present with the given key name and having a updateID == overwriteExpectedUpdateID. If so, it opens the key and stored the details including the overwriteExpectedUpdateID in the openKeyTable. As things stand, the other existing key metadata copied from the original key is stored in the openKeyTable too. +5. The client continues to write the data as usual. +6. On commit key, the client does not need to send the overwriteExpectedUpdateID again, as the open key contains it. +7. On OM, on commit key, it validates the key still exists with the given key name and its updateID is unchanged. If so the key is committed, otherwise an error is returned to the client. + +## Changes Required + +In order to enable the above steps on Ozone, several small changes are needed. + +### Wire Protocol + +1. The overwriteExpectedUpdateID needs to be added to the KeyInfo protobuf object so it can be stored in the openKey table. +2. The overwriteExpectedUpdateID needs to be added to the keyArgs protobuf object, which is passed from the client to OM when creating a key. + +No new messages need to be defined. + +### On OM + +No new OM handlers are needed. The existing OpenKey and CommitKey handlers will receive the new overwriteExpectedUpdateID and perform the checked. + +No new locks are needed on OM. As part of the openKey and commitKey, there are existing locks taken to ensure the key open / commit is atomic. The new checks are performed under those locks, and come down to a couple of long comparisons, so add negligible overhead. + +### On The Client + + 1. We need to allow the updateID of an existing key to be accessible when an existing details are read, by adding it to OzoneKey and OzoneKeyDetails. There are internal object changes and do no impact any APIs. + 2. To pass the overwriteExpectedUpdateID to OM on key open, it would be possible to overload the existing OzoneBucket.createKey() method, which already has several overloaded versions, or create a new explicit method on Ozone bucket called replaceKeyIfUnchanged, passing either the OzoneKeyDetails of the existing key (which includes the key name and existing updateID, or by passing the key name and updateID explicitly, eg: + + ``` + public OzoneOutputStream replaceKeyIfUnchanged(OzoneKeyDetails keyToOverwrite, ReplicationConfig replicationConfigOfNewKey) + throws IOException + +// Alternatively or additionally + + public OzoneOutputStream replaceKeyIfUnchanged(String volumeName, String bucketName, String keyName, long size, long expectedUpdateID, ReplicationConfig replicationConfigOfNewKey) + throws IOException + + + ``` +This specification is roughly in line with the exiting createKey method: + +``` + public OzoneOutputStream createKey( + String volumeName, String bucketName, String keyName, long size, + ReplicationConfig replicationConfig, + Map metadata) +``` + +An alternative, is to create a new overloaded createKey: + +``` + public OzoneOutputStream createKey( + String volumeName, String bucketName, String keyName, long size, + ReplicationConfig replicationConfig, long expectedUpdateID) +``` + +Note the omission of the metaData map, as the intention of this API is to copy that from what already exisits on the server. + +The intended usage of this API, is that the existing key details are read, then used to open the new key, and then data is written, eg: + +``` +OzoneKeyDetails exisitingKey = bucket.getKey(keyName); +try (OutputStream os = bucket.replaceKeyIfUnchanged(existingKey, newRepConfig) { + os.write(bucket.readKey(keyName)) +} +``` + +## Other Storage Systems + +Amazon S3 does not offer a facility like this. + +Google Cloud has a concept of a generationID which is used in various [API calls](https://cloud.google.com/storage/docs/json_api/v1/objects/update). + +## Further Ideas + +The intention of this initial design is to make as few changes to Ozone as possible to enable overwriting a key if it has not changed. + +It would be possible to have separate UpdateIDs for metadata changes and data changes to give a more fine grained approach. + +It would also be possible to expose these IDs over the S3 interface as well as the Java interface. + +However both these options required more changes to Ozone and more API surface to test and support. + +The changes suggested here are small, and carry little risk to existing operations if the new field is not passed. They also do not rule out extending the idea to cover a separate metadata UpdateID if such a thing is desired by enough users. From 6991baf13cb7624888fc4574c265ef62d23bc07e Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Fri, 5 Apr 2024 18:21:09 +0100 Subject: [PATCH 02/13] Added header --- .../design/overwrite-key-only-if-unchanged.md | 24 +++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index faf0c54b1565..aaa6ea2fa8e1 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -1,3 +1,27 @@ +--- +title: Overwriting an Ozone Key only if it has not changed. +summary: A minimal design illustrating how to replace a key in Ozone only if it has not changes since it was read. +date: 2024-04-05 +jira: HDDS-10657 +status: accepted +author: Stephen ODonnell +--- + + + + Ozone offers write semantics where the last writer to commit a key wins. Therefore multiple writers can concurrently write the same key, and which ever commits last will effectively overwrite all data that came before it. As an extension of this, there is no "locking" on a key which is being replaced. From ccfd868019991fdd6b02b7d20724e489134a6b88 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 9 Apr 2024 12:01:56 +0100 Subject: [PATCH 03/13] State updateID is changed by a key rewrite --- .../docs/content/design/overwrite-key-only-if-unchanged.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index aaa6ea2fa8e1..450b284835d3 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -69,6 +69,8 @@ To do this: 6. On commit key, the client does not need to send the overwriteExpectedUpdateID again, as the open key contains it. 7. On OM, on commit key, it validates the key still exists with the given key name and its updateID is unchanged. If so the key is committed, otherwise an error is returned to the client. +Note that any change to a key will change the updateID. This is existing behaviour, and committing an rewritten key will also modify the updateID. Note this also offers protection against concurrent rewrites. + ## Changes Required In order to enable the above steps on Ozone, several small changes are needed. From 5ec73a44124210c2925fc7c4226f4b1f5c1dc95e Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 9 Apr 2024 12:11:08 +0100 Subject: [PATCH 04/13] Highlight problem with existing metadata loss when write a new key version --- .../docs/content/design/overwrite-key-only-if-unchanged.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index 450b284835d3..3daeae1ca6bc 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -42,6 +42,13 @@ Writing a key in Ozone is a 3 step process: 2. The client writes data to the data nodes 3. The client commits the key to OM via a Commit Key call. +Note, that as things stand, it is possible to lose metadata updates (eg ACL changes) when a key is overwritten. + +1. If the key exists, then a new copy of the key is open for writing. +2. While the new copy is open, another process updates the ACLs for the key +3. On commit, the new ACLs are not copied to the new key as the new key made a copy of the existing metadata at the time the key was opened. + +With the technique described in the next section, that problem is removed in this design, as the ACL update will change the updateID, and the key will not be committed. ## Atomic Key Replacement From 1123159a2b16e24b588084752178874a2b7477e6 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 9 Apr 2024 12:40:20 +0100 Subject: [PATCH 05/13] Added details of alternative proposal --- .../design/overwrite-key-only-if-unchanged.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index 3daeae1ca6bc..8aec6c561e4c 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -78,6 +78,22 @@ To do this: Note that any change to a key will change the updateID. This is existing behaviour, and committing an rewritten key will also modify the updateID. Note this also offers protection against concurrent rewrites. +### Alternative Proposal + +1. Pass the expected updateID to the rewrite API which passes it down to the relevant key stream, effectively saving it on the client +2. Client attaches update ID to the commit request to indicate a rewrite instead of a put +3. OM checks the update ID if present and returns the corresponding success/fail result + +The advantage of this alternative approach is that it does not require the overwriteUpdateID to be stored in the openKey table. + +However the client code required to implement this appears more complex due to having different key commit logic for Ratis and EC and the parameter needing to be passed through many method calls. + +The existing implementation for key creation stores various attributes (metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so storing the overwriteExpectedUpdateID keeps with that convention, which is less confusing for future developers. + +In terms of forward / backward compatibility both solutions are equivalent. Only a new parameter is required within the KeyArgs passed to create and commit Key. + +If an upgraded server is rolled back, it will still be able to deal with an openKey entry containing overWriteUpdateID, but it will not process it atomically. + ## Changes Required In order to enable the above steps on Ozone, several small changes are needed. From a974d0bb664304dd1c936c1b3686428ad0b36d2f Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 9 Apr 2024 12:49:39 +0100 Subject: [PATCH 06/13] Added upgrade / downgrade section --- .../design/overwrite-key-only-if-unchanged.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index 8aec6c561e4c..d333df66a159 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -94,6 +94,10 @@ In terms of forward / backward compatibility both solutions are equivalent. Only If an upgraded server is rolled back, it will still be able to deal with an openKey entry containing overWriteUpdateID, but it will not process it atomically. +### Scope + +The intention is to first implement this for OBS buckets. Then address FSO buckets. + ## Changes Required In order to enable the above steps on Ozone, several small changes are needed. @@ -155,6 +159,18 @@ try (OutputStream os = bucket.replaceKeyIfUnchanged(existingKey, newRepConfig) { } ``` +## Upgrade and Compatibility + +If a newer client is talking to an older server, it could call the new atomic API but the server will ignore it without error. This is the case for any API change. + +There are no changes to protobuf methods. + +A single extra field is added to the KeyArgs object, which is passed from the client to OM on key open and commit. This is a new field, so it will be null if not set, and the server will ignore it if it does not expect it. + +A single extra field is added to the OMKeyInfo object which is stored in the openKey table. This is a new field, so it will be null if not set, and the server will ignore it if it does not expect it. + +There should be not impact on upgrade / downgrade with the new field added in this way. + ## Other Storage Systems Amazon S3 does not offer a facility like this. From 1504668beda0b99553a22ac2fbd52bbcc549e645 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 9 Apr 2024 12:52:40 +0100 Subject: [PATCH 07/13] Add note about metadata map to the APIs --- .../docs/content/design/overwrite-key-only-if-unchanged.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index d333df66a159..49df8fccda92 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -128,6 +128,9 @@ No new locks are needed on OM. As part of the openKey and commitKey, there are e public OzoneOutputStream replaceKeyIfUnchanged(String volumeName, String bucketName, String keyName, long size, long expectedUpdateID, ReplicationConfig replicationConfigOfNewKey) throws IOException + +// Can also add an overloaded version of these methods to pass a metadata map, as with the existing +// create key method. ``` @@ -148,8 +151,6 @@ An alternative, is to create a new overloaded createKey: ReplicationConfig replicationConfig, long expectedUpdateID) ``` -Note the omission of the metaData map, as the intention of this API is to copy that from what already exisits on the server. - The intended usage of this API, is that the existing key details are read, then used to open the new key, and then data is written, eg: ``` From bed659e1a4cd2f620b4c8dd4a77f648473599f13 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 23 Apr 2024 13:42:43 +0100 Subject: [PATCH 08/13] Exclude multi-part keys --- .../docs/content/design/overwrite-key-only-if-unchanged.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index 49df8fccda92..308b34d5de81 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -98,6 +98,8 @@ If an upgraded server is rolled back, it will still be able to deal with an open The intention is to first implement this for OBS buckets. Then address FSO buckets. +Multi-part keys need more investigation and hence are also excluded in the initial version. + ## Changes Required In order to enable the above steps on Ozone, several small changes are needed. From d64e95c47201fc4cbd8c9a87487bcc2e4ab2dd96 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 23 Apr 2024 13:58:03 +0100 Subject: [PATCH 09/13] Rename updateID to generation --- .../design/overwrite-key-only-if-unchanged.md | 51 +++++++++---------- 1 file changed, 24 insertions(+), 27 deletions(-) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index 308b34d5de81..196ef05ee5e1 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -68,31 +68,31 @@ In Ozone the same concept can be used to perform an atomic update of a key only To do this: -1. The client reads the key details as usual. The key details can be extended to include the existing updateID as it is currently not passed to the client. -2. The client opens a new key for writing with the same key name as the original, passing the previously read updateID in a new field. Call this new field overwriteExpectedUpdateID. -3. On OM, it receives the openKey request as usual and detects the presence of the overwriteExpectedUpdateID. -4. On OM, it first ensures that a key is present with the given key name and having a updateID == overwriteExpectedUpdateID. If so, it opens the key and stored the details including the overwriteExpectedUpdateID in the openKeyTable. As things stand, the other existing key metadata copied from the original key is stored in the openKeyTable too. +1. The client reads the key details as usual. The key details can be extended to include the existing updateID as it is currently not passed to the client. This field already exists, but when exposed to the client it will be referred to as the key generation. +2. The client opens a new key for writing with the same key name as the original, passing the previously read generation in a new field. Call this new field expectedGeneration. +3. On OM, it receives the openKey request as usual and detects the presence of the expectedGeneration field. +4. On OM, it first ensures that a key is present with the given key name and having a updateID == expectedGeneration. If so, it opens the key and stored the details including the expectedGeneration in the openKeyTable. As things stand, the other existing key metadata copied from the original key is stored in the openKeyTable too. 5. The client continues to write the data as usual. -6. On commit key, the client does not need to send the overwriteExpectedUpdateID again, as the open key contains it. -7. On OM, on commit key, it validates the key still exists with the given key name and its updateID is unchanged. If so the key is committed, otherwise an error is returned to the client. +6. On commit key, the client does not need to send the expectedGeneration again, as the open key contains it. +7. On OM, on commit key, it validates the key still exists with the given key name and its stored updateID is unchanged when compared with the expectedGeneration. If so the key is committed, otherwise an error is returned to the client. -Note that any change to a key will change the updateID. This is existing behaviour, and committing an rewritten key will also modify the updateID. Note this also offers protection against concurrent rewrites. +Note that any change to a key will change the updateID. This is existing behaviour, and committing a rewritten key will also modify the updateID. Note this also offers protection against concurrent rewrites. ### Alternative Proposal -1. Pass the expected updateID to the rewrite API which passes it down to the relevant key stream, effectively saving it on the client -2. Client attaches update ID to the commit request to indicate a rewrite instead of a put -3. OM checks the update ID if present and returns the corresponding success/fail result +1. Pass the expected expectedGeneration to the rewrite API which passes it down to the relevant key stream, effectively saving it on the client +2. Client attaches the expectedGeneration to the commit request to indicate a rewrite instead of a put +3. OM checks the passed generation against the stored update ID and returns the corresponding success/fail result -The advantage of this alternative approach is that it does not require the overwriteUpdateID to be stored in the openKey table. +The advantage of this alternative approach is that it does not require the expectedGeneration to be stored in the openKey table. However the client code required to implement this appears more complex due to having different key commit logic for Ratis and EC and the parameter needing to be passed through many method calls. -The existing implementation for key creation stores various attributes (metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so storing the overwriteExpectedUpdateID keeps with that convention, which is less confusing for future developers. +The existing implementation for key creation stores various attributes (metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so storing the expectedGeneration keeps with that convention, which is less confusing for future developers. In terms of forward / backward compatibility both solutions are equivalent. Only a new parameter is required within the KeyArgs passed to create and commit Key. -If an upgraded server is rolled back, it will still be able to deal with an openKey entry containing overWriteUpdateID, but it will not process it atomically. +If an upgraded server is rolled back, it will still be able to deal with an openKey entry containing expectedGeneration, but it will not process it atomically. ### Scope @@ -106,29 +106,25 @@ In order to enable the above steps on Ozone, several small changes are needed. ### Wire Protocol -1. The overwriteExpectedUpdateID needs to be added to the KeyInfo protobuf object so it can be stored in the openKey table. -2. The overwriteExpectedUpdateID needs to be added to the keyArgs protobuf object, which is passed from the client to OM when creating a key. +1. The expectedGeneration needs to be added to the KeyInfo protobuf object so it can be stored in the openKey table. +2. The expectedGeneration needs to be added to the keyArgs protobuf object, which is passed from the client to OM when creating a key. No new messages need to be defined. ### On OM -No new OM handlers are needed. The existing OpenKey and CommitKey handlers will receive the new overwriteExpectedUpdateID and perform the checked. +No new OM handlers are needed. The existing OpenKey and CommitKey handlers will receive the new expectedGeneration and perform the checks. No new locks are needed on OM. As part of the openKey and commitKey, there are existing locks taken to ensure the key open / commit is atomic. The new checks are performed under those locks, and come down to a couple of long comparisons, so add negligible overhead. ### On The Client - 1. We need to allow the updateID of an existing key to be accessible when an existing details are read, by adding it to OzoneKey and OzoneKeyDetails. There are internal object changes and do no impact any APIs. - 2. To pass the overwriteExpectedUpdateID to OM on key open, it would be possible to overload the existing OzoneBucket.createKey() method, which already has several overloaded versions, or create a new explicit method on Ozone bucket called replaceKeyIfUnchanged, passing either the OzoneKeyDetails of the existing key (which includes the key name and existing updateID, or by passing the key name and updateID explicitly, eg: + 1. We need to allow the updateID (called generation on the client) of an existing key to be accessible when an existing details are read, by adding it to OzoneKey and OzoneKeyDetails. There are internal object changes and do no impact any APIs. + 2. To pass the expectedGeneration to OM on key open, it would be possible to overload the existing OzoneBucket.createKey() method, which already has several overloaded versions, or create a new explicit method on Ozone bucket called rewriteKey, passing the expectedGeneration, eg: ``` - public OzoneOutputStream replaceKeyIfUnchanged(OzoneKeyDetails keyToOverwrite, ReplicationConfig replicationConfigOfNewKey) - throws IOException - -// Alternatively or additionally - public OzoneOutputStream replaceKeyIfUnchanged(String volumeName, String bucketName, String keyName, long size, long expectedUpdateID, ReplicationConfig replicationConfigOfNewKey) + public OzoneOutputStream rewriteKey(String volumeName, String bucketName, String keyName, long size, long expectedGeneration, ReplicationConfig replicationConfigOfNewKey) throws IOException // Can also add an overloaded version of these methods to pass a metadata map, as with the existing @@ -145,7 +141,7 @@ This specification is roughly in line with the exiting createKey method: Map metadata) ``` -An alternative, is to create a new overloaded createKey: +An alternative, is to create a new overloaded createKey, but it is probably less confusing to have the new rewriteKey method: ``` public OzoneOutputStream createKey( @@ -157,7 +153,8 @@ The intended usage of this API, is that the existing key details are read, then ``` OzoneKeyDetails exisitingKey = bucket.getKey(keyName); -try (OutputStream os = bucket.replaceKeyIfUnchanged(existingKey, newRepConfig) { +try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getVolume, + existingKey.getKeyName, existingKey.getSize(), existingKey.getGeneration(), newRepConfig) { os.write(bucket.readKey(keyName)) } ``` @@ -184,10 +181,10 @@ Google Cloud has a concept of a generationID which is used in various [API calls The intention of this initial design is to make as few changes to Ozone as possible to enable overwriting a key if it has not changed. -It would be possible to have separate UpdateIDs for metadata changes and data changes to give a more fine grained approach. +It would be possible to have separate generation IDs for metadata changes and data changes to give a more fine grained approach. It would also be possible to expose these IDs over the S3 interface as well as the Java interface. However both these options required more changes to Ozone and more API surface to test and support. -The changes suggested here are small, and carry little risk to existing operations if the new field is not passed. They also do not rule out extending the idea to cover a separate metadata UpdateID if such a thing is desired by enough users. +The changes suggested here are small, and carry little risk to existing operations if the new field is not passed. They also do not rule out extending the idea to cover a separate metadata generation if such a thing is desired by enough users. From 44121d453fd9174503d3d2010f8c543c08c40660 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Tue, 23 Apr 2024 13:59:41 +0100 Subject: [PATCH 10/13] Added a note about scope on FSO buckets --- .../docs/content/design/overwrite-key-only-if-unchanged.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index 196ef05ee5e1..b2fbffc5c31f 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -96,7 +96,7 @@ If an upgraded server is rolled back, it will still be able to deal with an open ### Scope -The intention is to first implement this for OBS buckets. Then address FSO buckets. +The intention is to first implement this for OBS buckets. Then address FSO buckets. FSO bucket handling will reuse the same fields, but the handlers on OM are different. We also need to decide on what should happen if a key is renamed or moved folders during the rewrite. Multi-part keys need more investigation and hence are also excluded in the initial version. From 3f8ad5a4bfc56d222c95423d3ac4b79892c2c93c Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Wed, 24 Apr 2024 15:30:48 +0100 Subject: [PATCH 11/13] Update compatibility section --- .../content/design/overwrite-key-only-if-unchanged.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index b2fbffc5c31f..8896bc81e769 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -161,15 +161,19 @@ try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getV ## Upgrade and Compatibility -If a newer client is talking to an older server, it could call the new atomic API but the server will ignore it without error. This is the case for any API change. +### Client Server Protocol -There are no changes to protobuf methods. +If a newer client is talking to an older server, it could call the new atomic API but the server will ignore it without error. The client server versioning framework can be used to avoid this problem. + +No new protobuf messages are needed and hence no new Client to OM APIs as the existing APIs are used with an additional parameter. A single extra field is added to the KeyArgs object, which is passed from the client to OM on key open and commit. This is a new field, so it will be null if not set, and the server will ignore it if it does not expect it. +### Disk Layout + A single extra field is added to the OMKeyInfo object which is stored in the openKey table. This is a new field, so it will be null if not set, and the server will ignore it if it does not expect it. -There should be not impact on upgrade / downgrade with the new field added in this way. +There should be no impact on upgrade / downgrade with the new field added in this way. ## Other Storage Systems From 85a516328e09944f10f0623f990f4058f06d4753 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Thu, 25 Apr 2024 12:47:44 +0100 Subject: [PATCH 12/13] Be more explicit that rewriting may only be required after inspecting the original key data --- .../design/overwrite-key-only-if-unchanged.md | 27 ++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index 8896bc81e769..eec6b7a11528 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -69,12 +69,13 @@ In Ozone the same concept can be used to perform an atomic update of a key only To do this: 1. The client reads the key details as usual. The key details can be extended to include the existing updateID as it is currently not passed to the client. This field already exists, but when exposed to the client it will be referred to as the key generation. -2. The client opens a new key for writing with the same key name as the original, passing the previously read generation in a new field. Call this new field expectedGeneration. -3. On OM, it receives the openKey request as usual and detects the presence of the expectedGeneration field. -4. On OM, it first ensures that a key is present with the given key name and having a updateID == expectedGeneration. If so, it opens the key and stored the details including the expectedGeneration in the openKeyTable. As things stand, the other existing key metadata copied from the original key is stored in the openKeyTable too. -5. The client continues to write the data as usual. -6. On commit key, the client does not need to send the expectedGeneration again, as the open key contains it. -7. On OM, on commit key, it validates the key still exists with the given key name and its stored updateID is unchanged when compared with the expectedGeneration. If so the key is committed, otherwise an error is returned to the client. +1. The client can inspect the read key details and decide if it wants to replace the key. +1. The client opens a new key for writing with the same key name as the original, passing the previously read generation in a new field. Call this new field expectedGeneration. +1. On OM, it receives the openKey request as usual and detects the presence of the expectedGeneration field. +1. On OM, it first ensures that a key is present with the given key name and having a updateID == expectedGeneration. If so, it opens the key and stored the details including the expectedGeneration in the openKeyTable. As things stand, the other existing key metadata copied from the original key is stored in the openKeyTable too. +1. The client continues to write the data as usual. This can be the same data in a different format (eg Ratis to EC conversion), or new data in the key depending on the application's needs. +1. On commit key, the client does not need to send the expectedGeneration again, as the open key contains it. +1. On OM, on commit key, it validates the key still exists with the given key name and its stored updateID is unchanged when compared with the expectedGeneration. If so the key is committed, otherwise an error is returned to the client. Note that any change to a key will change the updateID. This is existing behaviour, and committing a rewritten key will also modify the updateID. Note this also offers protection against concurrent rewrites. @@ -88,6 +89,8 @@ The advantage of this alternative approach is that it does not require the expec However the client code required to implement this appears more complex due to having different key commit logic for Ratis and EC and the parameter needing to be passed through many method calls. +PR [#5524](https://github.com/apache/ozone/pull/5524) illustrates this approach for the atomicKeyCreation feature which was added to S3. + The existing implementation for key creation stores various attributes (metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so storing the expectedGeneration keeps with that convention, which is less confusing for future developers. In terms of forward / backward compatibility both solutions are equivalent. Only a new parameter is required within the KeyArgs passed to create and commit Key. @@ -123,14 +126,11 @@ No new locks are needed on OM. As part of the openKey and commitKey, there are e 2. To pass the expectedGeneration to OM on key open, it would be possible to overload the existing OzoneBucket.createKey() method, which already has several overloaded versions, or create a new explicit method on Ozone bucket called rewriteKey, passing the expectedGeneration, eg: ``` - public OzoneOutputStream rewriteKey(String volumeName, String bucketName, String keyName, long size, long expectedGeneration, ReplicationConfig replicationConfigOfNewKey) throws IOException // Can also add an overloaded version of these methods to pass a metadata map, as with the existing -// create key method. - - +// create key method. ``` This specification is roughly in line with the exiting createKey method: @@ -149,11 +149,14 @@ An alternative, is to create a new overloaded createKey, but it is probably less ReplicationConfig replicationConfig, long expectedUpdateID) ``` -The intended usage of this API, is that the existing key details are read, then used to open the new key, and then data is written, eg: +The intended usage of this API, is that the existing key details are read, perhaps inspected and then used to open the new key, and then data is written. In this example, the key is overwritten with the same data in a different replication format. Equally, the key could be rewritten with the original data modified in some application specific way. The atomic check guarantees against lost updates if another application thread is attempting to update the same key in a different way. ``` OzoneKeyDetails exisitingKey = bucket.getKey(keyName); -try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getVolume, +// Insepect the key and decide if overwrite is desired: +boolean shouldOverwrite = ... +if (shouldOverwrite) { + try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getVolume, existingKey.getKeyName, existingKey.getSize(), existingKey.getGeneration(), newRepConfig) { os.write(bucket.readKey(keyName)) } From 7c4c3b4113b0973c38c99a8514121c30fc932a71 Mon Sep 17 00:00:00 2001 From: S O'Donnell Date: Thu, 25 Apr 2024 12:50:23 +0100 Subject: [PATCH 13/13] Add note about failing earlier on block allocation --- .../docs/content/design/overwrite-key-only-if-unchanged.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md index eec6b7a11528..c4d4211cabfa 100644 --- a/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md +++ b/hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md @@ -77,7 +77,9 @@ To do this: 1. On commit key, the client does not need to send the expectedGeneration again, as the open key contains it. 1. On OM, on commit key, it validates the key still exists with the given key name and its stored updateID is unchanged when compared with the expectedGeneration. If so the key is committed, otherwise an error is returned to the client. -Note that any change to a key will change the updateID. This is existing behaviour, and committing a rewritten key will also modify the updateID. Note this also offers protection against concurrent rewrites. +Note that any change to a key will change the updateID. This is existing behaviour, and committing a rewritten key will also modify the updateID. Note this also offers protection against concurrent rewrites. + +An optional enhancement for large keys, is that on each block allocation the expectedGeneration can be checked against the current key version to ensure it has not changed. This would allow the rewrite to fail early if a large multi block key is modified. ### Alternative Proposal