HDDS-10657. Design Doc for overwriting a key if it has not changed #6482

sodonnel · 2024-04-05T17:14:17Z

What changes were proposed in this pull request?

Design doc - see content in the PR.

For any key, but especially a large key, it can take significant time to read and write it. There are scenarios where it would be desirable to replace a key in Ozone, but only if the key has not changed since it was read. This design outlines a minimal change to allow this feature in the Ozone API.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10657

adoroszlai

Thanks @sodonnel for the design doc. Makes sense to me.

errose28

(sorry for the delayed response. I reviewed this Friday and forgot to hit the publish button)

Thanks for writing this up @sodonnel, this will be helpful both now and in the future for people following this change.

Can you add upgrade/cross compatibility concerns and handling to the document?
We should probably restrict this to a single bucket to allow sharding in the future. Lets call that out explicitly.
I think we are only handling this for OBS buckets right now, but with plans to have handling for FSO later. Maybe a "scope" section or something like that would help to define what we are designing here and what will be handled in later phases.

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

…rsion

sodonnel · 2024-04-09T11:44:27Z

We should probably restrict this to a single bucket to allow sharding in the future. Lets call that out explicitly.

I don't understand what this means. At key is in a single bucket, and the operations are on a single key ...

I think we are only handling this for OBS buckets right now, but with plans to have handling for FSO later.

The intention is to handle it for OBS and FSO buckets. OBS is to be worked on first.

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

kerneltime · 2024-04-18T17:55:55Z

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO buckets.


What is the additional complexity to do it for both?

With FSO we need to decide on something things like, for example, what happens if the key is moved to a new location. I feel there is enough to figure out with OBS buckets without getting involved in FSO buckets at this stage. It is very much the intention to figure out FSO bucket with an addition to this design after we get OBS buckets working.

I added a note to the doc about this.

kerneltime · 2024-04-18T17:59:05Z

The general approach seems fine, to elaborate on my previous feedback in the code PR, I think the internal implementation choices leak out too much to the public/client facing APIs. I would like this PR to be the basis of a first class feature that we can expose via S3 APIs analogous to Google's Object Store. For now the main change I would like to focus is nomenclature clean up (use generation, it is well understood in this context in other storage systems as to what is being discussed, updateID is a new name we are introducing and we can choose to do this as a building block for future features) and API name clean up.

sodonnel · 2024-04-23T13:05:17Z

@kerneltime @errose28 I think I have addressed all comments and added them to the design. Please take another look and let me know if there is anything else you would like changed or added.

errose28

Thanks for the updates @sodonnel

"I don't understand what this means. At key is in a single bucket, and the operations are on a single key ... "

I was thinking of this case, but actually this should be fine:

genID = getInfo(/v1/b1/k1)
ostream = create(/v1/b1/k1, newRepType, genID)
read /v2/b2/k2 into ostream
commit(/v1/b1/k1)

Overall you can disregard this comment.

errose28 · 2024-04-23T22:15:13Z

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so it can be stored in the openKey table.


OmKeyInfo is used in many places outside of just the open key table:

All open key, committed key, deleted key tables. I wouldn't really consider these "wire protocol" since they aren't part of the network.

On the client as part of RpcClient#getKeyInfo, where it is then wrapped/converted to OzoneKeyDetails

errose28 · 2024-04-23T22:25:50Z

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

+try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getVolume, 
+    existingKey.getKeyName, existingKey.getSize(), existingKey.getGeneration(), newRepConfig) {
+  os.write(bucket.readKey(keyName))


Wouldn't it be easier to just give rewriteKey the path to the key and the fields you want to change, and have the method do the get and put operations inside of it? This seems like a lot of parameter copying for the common use case.

Basically size and generation parameters could be removed and the method could pull them itself.

ostream rewriteKey(key, repType) { genID = getInfo(key) return create(key, genID) }

The key is returned to the client as an object (KeyInfo) - in an earlier version I simply passed the keyInfo, but you and @kerneltime didn't seem to like that, so I changed it to look like the existing create API.

In the general case whatever application is using the rewrite API has to pull a keys details and then decided based on the key metadata or content if it wants to rewrite it or not. Therefore having the rewrite method pull the key details will result in:

Two pulls from OM to get the key info when one would have done.

The potential for different key details to be returned between 1 and 2, and the details from 2 may not want to be overwritten by the application.

The point of the API is that the application pulls some details and then makes a decision based on those - this key needs updated. And it determines that "this key" has not changed by the generation / updateID it received then.

Therefore I believe the method must be passed the keyInfo or the details of the key it is to overwrite.

Therefore I believe the method must be passed the keyInfo or the details of the key it is to overwrite.

Ok I see my comment here wasn't clear. I was trying to abstract generation ID inside the rewrite method, not get rid of the key parameters. Key parameters here are fine.

in an earlier version I simply passed the keyInfo, but you and @kerneltime didn't seem to like that

This makes the objection sound arbitrary. There was technical validation to this point. Separating the key parameters like this is consistent with RpcClient#createKey. This also leaves OMKeyInfo as an output given to the client like its current usage, not input from the client to OM. That's what KeyArgs would be for.

The point of the API is that the application pulls some details and then makes a decision based on those - this key needs updated.

This actually relates to both this comment and this one

This does not work in the general case, where a client reads a key, inspects it and decides that it needs rewritten

I think I see the confusion. They do work for the use case presented here in the document, where there are no reads between when the generation ID is read and the rewrite starts. "Use Cases" should really be its own section in the doc with more thorough examples, as it helps answer questions like this.
Where the suggestion here doesn't work is for a use case that your comments seem to imply but the document does not define:

OzoneKeyDetails exisitingKey = bucket.getKey(keyName); if (existingKey.getReplicationType() == RATIS) { // Important addition that changes the guarantees required from the client methods and API try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getVolume, existingKey.getKeyName, existingKey.getSize(), existingKey.getGeneration(), EC) { os.write(bucket.readKey(keyName)) } }

So the proposal here looks good, but a use cases section will help illustrate both now and in the future why certain decisions were made instead of others.

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

errose28 · 2024-04-23T22:50:13Z

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

+2. Client attaches the expectedGeneration to the commit request to indicate a rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the expectedGeneration to be stored in the openKey table.


More generally, it is not required to be stored in OmKeyInfo, which is stored in all key related tables. I know an empty protobuf field will not take up extra space, but it still reduces the scope of the change.

errose28 · 2024-04-23T23:01:23Z

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

+
+The advantage of this alternative approach is that it does not require the expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to having different key commit logic for Ratis and EC and the parameter needing to be passed through many method calls.


This needs to be quantified. "appears complex" seems like actual investigation of this approach was not done. The doc can site #5524 and the atomicKeyCreation field added. Only 3 files were changed to add this field:

ECKeyOutputStream

KeyDataStreamOutput

KeyOutputStream

Now whether that is considered an excessive amount of change to rule out this approach is debatable, but at least the doc provides readers with all the information.

I added a reference to that PR.

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md

errose28 · 2024-04-24T01:23:59Z

I've been able to think about this a bit more and I think a good way to differentiate the approaches is by who manages the update ID during the write. The two most intuitive options would be that either the client manages the update ID (currently the second proposal), or the server manages the update ID (not yet discussed). In the first proposal listed in the doc, both the client and the server are managing the update ID at different parts of the operation and I think this is why it feels "off" to me. Hopefully defining the options in this way can clarify the differences:

Server manages the update ID:
1. The key create request would take a flag indicating that this should be an atomic replacement of an existing key.
2. The server saves the update ID at the time of create in the open key table, and returns an outputstream to the client.
3. The client reads, writes, and commits the data to rewrite the same as before.
4. The server checks the update ID saved with the open key on commit.
- Pseudocode:
```
ostream = create(/v1/b1/k1, newRepType, rewrite=true)
read /v1/b1/k1 into ostream
commit(/v1/b1/k1)
```
Client manages the update ID (currently proposal 2 in the doc):
1. The key create request would return an outputstream to the client the same as before.
2. The client gets the update ID of the key to overwrite.
3. The client reads and writes the data to rewrite the same as before.
4. The client commits the key, including the update ID
5. The server checks the update ID saved with the open key on commit.
- Pseudocode:
```
stream = create(/v1/b1/k1, newRepType)
genID = getInfo(/v1/b1/k1)
read /v1/b1/k1 into ostream
commit(/v1/b1/k1, genID)
```
Client and server manage the update ID (currently proposal 1 in the doc):
1. The client gets the update ID of the key to overwrite.
2. The key create request would take this update ID and return an outputstream to the client.
3. The client reads and writes the data to rewrite the same as before.
4. The client commits the key, the same as before.
5. The server checks the update ID saved with the open key on commit.
- Pseudocode:
```
genID = getInfo(/v1/b1/k1)
ostream = create(/v1/b1/k1, newRepType, genID)
read /v1/b1/k1 into ostream
commit(/v1/b1/k1)
```

To me, either of the first two options where only one side is responsible for storing the update ID as the write is ongoing make sense. The third option is a mashup of the others, which IMO is the least intuitive option. The client reads something from the server and then immediately gives the same thing back for the server to manage. It also unnecessarily spreads the update ID into the client/server and server/disk protocols when it only needs to be in one or the other.

sodonnel · 2024-04-24T15:03:00Z

Server manages the update ID

This does not work in the general case, where a client reads a key, inspects it and decides that it needs rewritten. The key on the server could have changed in the meantime resulting in lost updates. The client must pass the generation it expects to overwrite based on what it has read. It cannot just trust that whatever is currently on the server has not changed. That is the entire point of this change.

Client manages the update ID (currently proposal 2 in the doc):

The key create request would return an outputstream to the client the same as before.

The client gets the update ID of the key to overwrite.

The client reads and writes the data to rewrite the same as before.

This is doable, but not quite as you described. The client would need to read the existing key first to get its meta data. Then, ideally it passes the generation on key open so it can fail fast. If the key has already changed, there is little point in going ahead and writing it all out only to fail at the end.

An addition which I had not yet considered, is that even on block allocation the generation could be checked against that which is in the key table, so for a large object it could be check at each block boundary too. I have not looked at the block allocation code, but I think it must persist the allocated blocks in the open key table along with the key to allow for them to be garbage collected later if the client should crash. I am also not sure what the block allocation protocol looks like, but by storing the expectedGeneration on the server, we avoid any changes to the block allocation protocol and gain this feature.

To me, either of the first two options where only one side is responsible for storing the update ID as the write is ongoing make sense. The third option is a mashup of the others, which IMO is the least intuitive option.

But the third option, is how things currently work for the other metadata fields in a key. To do differently is less intuitive as now this solution goes against how all the other fields are stored. To give an analogy from web-development - the current structure is to have the session store on the server, rather than in a cookie. What you want, is to split this new area into something like a cookie session which we also have the server side session. You have already cited that you don't like the current approach, but we are not going to change that. Sometimes it is better to stick with the conventions already in place, rather than going in a new direction that is possibly better, but possibly not. In my opinion, both have their pros and cons and there is no clear best answer.

The HSync code has added information to the openKey table. It has added it to the MetaData map, so it has avoided adding an extra protobuf field in the protobuf at all. That is also something I could consider, but it will be less efficient and kind of sidesteps a lot of the static type checking Java can do for us, so bugs are easier to get in.

errose28 · 2024-04-25T00:46:38Z

Server manages the update ID
This does not work in the general case, where a client reads a key, inspects it and decides that it needs rewritten
... That is the entire point of this change.

This comes back to this discussion. You are correct that this doesn't work for when there is an "inspect" between the read and write, but this doesn't happen in the one example provided by the document. It seems the document is missing a section demonstrating "the entire point of the change".

An addition which I had not yet considered, is that even on block allocation the generation could be checked against that which is in the key table, so for a large object it could be check at each block boundary too.

This is a great point. I also had only thought about failing early in the context of create key, not on each block operation. Storing the expected ID on the server makes the check on each block boundary possible. Let's add it to the doc.

Sometimes it is better to stick with the conventions already in place, rather than going in a new direction that is possibly better, but possibly not. In my opinion, both have their pros and cons and there is no clear best answer.

In this context I was trying to look at approaches from a top down API level, as in what does the client see and is it clear what is happening. While the conventions you mention are here are important to consider too, they are internal details that we have already discussed. It seems there has not been much discussion on what things look like to the client which was the point of this comment.

The reason that the third approach looks strange from the client's perspective is that if you visualize generation ID as a sort of optimistic lock, it looks like the lock is released, and thus loses its guarantees, when create is called. For example:

info, genID = getInfo(key) // lock "acquired" optimistically
if info != expected:
  ostream = rewrite(key, genID) // To the casual reader, it looks like the lock is "released" here.
  write to ostream // Is this safe? Yes, even though it doesn't look like it.
  commit(key) // Is the lock from earlier still respected? Yes, but unintuitive from the API structure due to server "magic".

However, I think your idea for failing early on block allocations is solid and outweighs the odd looks of this API and wider spread proto changes. With some doc updates to outline this case as only possible when the ID is passed on create, I'm ok to go forward with this implementation.

… the original key data

sodonnel · 2024-04-25T11:52:06Z

@errose28 I have enhanced the sections about the use case to make it more clear "immediate rewrite" is no the only goal. I have also added a note about the "fail early" on block allocation idea.

Please check and let me know if you are happy, and then I think we can commit this design PR and return to the original code PR after moving it to a branch rather than master.

errose28

Thanks for providing this document @sodonnel this will be helpful for others following this feature development and as a spec to use in the code reviews.

For reconciliation we are leaving the design doc PR open until that phase of development is complete so we can easily update the doc if we find problems in the original plan when implementing. If you think that workflow would be helpful you can do the same, or merge it.

Also there is a lot of detail here now. If you could update #6385 / HDDS-10527 with how much of the content is in scope for that one change and what will be done in follow up tasks that will help as well.

adoroszlai · 2024-04-28T14:55:56Z

For reconciliation we are leaving the design doc PR open until that phase of development is complete so we can easily update the doc if we find problems in the original plan when implementing.

My 2 cents: creating follow-up PRs for any design change based on implementation experience makes them more visible. In the single PR case, Git history preserves only the final commit, and readers have to refer to the PR (which is GitHub-specific).

sodonnel · 2024-04-29T09:38:29Z

I went ahead and merged it. We can create followup PRs based on the implementation, and I don't like "completed" PRs hanging around in the queue unnecessarily.

…pache#6482) (cherry picked from commit 1eaddc4)

HDDS-10657. esign Doc for overwriting a key if it has not changed

3726dc8

sodonnel mentioned this pull request Apr 5, 2024

HDDS-10527. Rewrite key atomically #6385

Merged

Added header

6991baf

adoroszlai reviewed Apr 8, 2024

View reviewed changes

kerneltime requested review from errose28 and kerneltime April 8, 2024 16:17

adoroszlai added documentation Improvements or additions to documentation design labels Apr 8, 2024

errose28 reviewed Apr 8, 2024

View reviewed changes

S O'Donnell added 3 commits April 9, 2024 12:01

State updateID is changed by a key rewrite

ccfd868

Highlight problem with existing metadata loss when write a new key ve…

5ec73a4

…rsion

Added details of alternative proposal

1123159

S O'Donnell added 2 commits April 9, 2024 12:49

Added upgrade / downgrade section

a974d0b

Add note about metadata map to the APIs

1504668

kerneltime reviewed Apr 18, 2024

View reviewed changes

hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md Show resolved Hide resolved

kerneltime reviewed Apr 18, 2024

View reviewed changes

S O'Donnell added 3 commits April 23, 2024 13:42

Exclude multi-part keys

bed659e

Rename updateID to generation

d64e95c

Added a note about scope on FSO buckets

44121d4

errose28 reviewed Apr 24, 2024

View reviewed changes

Update compatibility section

3f8ad5a

S O'Donnell added 2 commits April 25, 2024 12:47

Be more explicit that rewriting may only be required after inspecting…

85a5163

… the original key data

Add note about failing earlier on block allocation

7c4c3b4

errose28 approved these changes Apr 25, 2024

View reviewed changes

kerneltime approved these changes Apr 26, 2024

View reviewed changes

sodonnel merged commit 1eaddc4 into apache:master Apr 29, 2024

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

HDDS-10657. Design Doc for overwriting a key if it has not changed (a…

b86ebf3

…pache#6482) (cherry picked from commit 1eaddc4)


		### Scope

		The intention is to first implement this for OBS buckets. Then address FSO buckets.


		### Wire Protocol

		1. The expectedGeneration needs to be added to the KeyInfo protobuf object so it can be stored in the openKey table.


		The advantage of this alternative approach is that it does not require the expectedGeneration to be stored in the openKey table.

		However the client code required to implement this appears more complex due to having different key commit logic for Ratis and EC and the parameter needing to be passed through many method calls.

HDDS-10657. Design Doc for overwriting a key if it has not changed #6482

HDDS-10657. Design Doc for overwriting a key if it has not changed #6482

Uh oh!

Conversation

sodonnel commented Apr 5, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sodonnel commented Apr 9, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kerneltime commented Apr 18, 2024

Uh oh!

sodonnel commented Apr 23, 2024

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

errose28 Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

errose28 commented Apr 24, 2024

Uh oh!

sodonnel commented Apr 24, 2024

Uh oh!

errose28 commented Apr 25, 2024

Uh oh!

sodonnel commented Apr 25, 2024

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Apr 28, 2024

Uh oh!

sodonnel commented Apr 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

errose28 Apr 23, 2024 •

edited

Loading