Skip to content

Conversation

@sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Apr 5, 2024

What changes were proposed in this pull request?

Design doc - see content in the PR.

For any key, but especially a large key, it can take significant time to read and write it. There are scenarios where it would be desirable to replace a key in Ozone, but only if the key has not changed since it was read. This design outlines a minimal change to allow this feature in the Ozone API.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10657

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sodonnel for the design doc. Makes sense to me.

@adoroszlai adoroszlai added documentation Improvements or additions to documentation design labels Apr 8, 2024
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry for the delayed response. I reviewed this Friday and forgot to hit the publish button)

Thanks for writing this up @sodonnel, this will be helpful both now and in the future for people following this change.

  • Can you add upgrade/cross compatibility concerns and handling to the document?
  • We should probably restrict this to a single bucket to allow sharding in the future. Lets call that out explicitly.
  • I think we are only handling this for OBS buckets right now, but with plans to have handling for FSO later. Maybe a "scope" section or something like that would help to define what we are designing here and what will be handled in later phases.

@sodonnel
Copy link
Contributor Author

sodonnel commented Apr 9, 2024

We should probably restrict this to a single bucket to allow sharding in the future. Lets call that out explicitly.

I don't understand what this means. At key is in a single bucket, and the operations are on a single key ...

I think we are only handling this for OBS buckets right now, but with plans to have handling for FSO later.

The intention is to handle it for OBS and FSO buckets. OBS is to be worked on first.


### Scope

The intention is to first implement this for OBS buckets. Then address FSO buckets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the additional complexity to do it for both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With FSO we need to decide on something things like, for example, what happens if the key is moved to a new location. I feel there is enough to figure out with OBS buckets without getting involved in FSO buckets at this stage. It is very much the intention to figure out FSO bucket with an addition to this design after we get OBS buckets working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note to the doc about this.

@kerneltime
Copy link
Contributor

The general approach seems fine, to elaborate on my previous feedback in the code PR, I think the internal implementation choices leak out too much to the public/client facing APIs. I would like this PR to be the basis of a first class feature that we can expose via S3 APIs analogous to Google's Object Store. For now the main change I would like to focus is nomenclature clean up (use generation, it is well understood in this context in other storage systems as to what is being discussed, updateID is a new name we are introducing and we can choose to do this as a building block for future features) and API name clean up.

@sodonnel
Copy link
Contributor Author

@kerneltime @errose28 I think I have addressed all comments and added them to the design. Please take another look and let me know if there is anything else you would like changed or added.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @sodonnel

"I don't understand what this means. At key is in a single bucket, and the operations are on a single key ... "

I was thinking of this case, but actually this should be fine:

genID = getInfo(/v1/b1/k1)
ostream = create(/v1/b1/k1, newRepType, genID)
read /v2/b2/k2 into ostream
commit(/v1/b1/k1)

Overall you can disregard this comment.


### Wire Protocol

1. The expectedGeneration needs to be added to the KeyInfo protobuf object so it can be stored in the openKey table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OmKeyInfo is used in many places outside of just the open key table:

  • All open key, committed key, deleted key tables. I wouldn't really consider these "wire protocol" since they aren't part of the network.
  • On the client as part of RpcClient#getKeyInfo, where it is then wrapped/converted to OzoneKeyDetails

Comment on lines 156 to 158
try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getVolume,
existingKey.getKeyName, existingKey.getSize(), existingKey.getGeneration(), newRepConfig) {
os.write(bucket.readKey(keyName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier to just give rewriteKey the path to the key and the fields you want to change, and have the method do the get and put operations inside of it? This seems like a lot of parameter copying for the common use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically size and generation parameters could be removed and the method could pull them itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ostream rewriteKey(key, repType) {
    genID = getInfo(key)
    return create(key, genID)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is returned to the client as an object (KeyInfo) - in an earlier version I simply passed the keyInfo, but you and @kerneltime didn't seem to like that, so I changed it to look like the existing create API.

In the general case whatever application is using the rewrite API has to pull a keys details and then decided based on the key metadata or content if it wants to rewrite it or not. Therefore having the rewrite method pull the key details will result in:

  1. Two pulls from OM to get the key info when one would have done.
  2. The potential for different key details to be returned between 1 and 2, and the details from 2 may not want to be overwritten by the application.

The point of the API is that the application pulls some details and then makes a decision based on those - this key needs updated. And it determines that "this key" has not changed by the generation / updateID it received then.

Therefore I believe the method must be passed the keyInfo or the details of the key it is to overwrite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore I believe the method must be passed the keyInfo or the details of the key it is to overwrite.

Ok I see my comment here wasn't clear. I was trying to abstract generation ID inside the rewrite method, not get rid of the key parameters. Key parameters here are fine.

in an earlier version I simply passed the keyInfo, but you and @kerneltime didn't seem to like that

This makes the objection sound arbitrary. There was technical validation to this point. Separating the key parameters like this is consistent with RpcClient#createKey. This also leaves OMKeyInfo as an output given to the client like its current usage, not input from the client to OM. That's what KeyArgs would be for.

The point of the API is that the application pulls some details and then makes a decision based on those - this key needs updated.

This actually relates to both this comment and this one

This does not work in the general case, where a client reads a key, inspects it and decides that it needs rewritten

I think I see the confusion. They do work for the use case presented here in the document, where there are no reads between when the generation ID is read and the rewrite starts. "Use Cases" should really be its own section in the doc with more thorough examples, as it helps answer questions like this.
Where the suggestion here doesn't work is for a use case that your comments seem to imply but the document does not define:

OzoneKeyDetails exisitingKey = bucket.getKey(keyName);
if (existingKey.getReplicationType() == RATIS) { // Important addition that changes the guarantees required from the client methods and API
  try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, existingKey.getVolume, 
      existingKey.getKeyName, existingKey.getSize(), existingKey.getGeneration(), EC) {
    os.write(bucket.readKey(keyName))
  }
}

So the proposal here looks good, but a use cases section will help illustrate both now and in the future why certain decisions were made instead of others.

2. Client attaches the expectedGeneration to the commit request to indicate a rewrite instead of a put
3. OM checks the passed generation against the stored update ID and returns the corresponding success/fail result

The advantage of this alternative approach is that it does not require the expectedGeneration to be stored in the openKey table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, it is not required to be stored in OmKeyInfo, which is stored in all key related tables. I know an empty protobuf field will not take up extra space, but it still reduces the scope of the change.


The advantage of this alternative approach is that it does not require the expectedGeneration to be stored in the openKey table.

However the client code required to implement this appears more complex due to having different key commit logic for Ratis and EC and the parameter needing to be passed through many method calls.
Copy link
Contributor

@errose28 errose28 Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be quantified. "appears complex" seems like actual investigation of this approach was not done. The doc can site #5524 and the atomicKeyCreation field added. Only 3 files were changed to add this field:

  • ECKeyOutputStream
  • KeyDataStreamOutput
  • KeyOutputStream

Now whether that is considered an excessive amount of change to rule out this approach is debatable, but at least the doc provides readers with all the information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a reference to that PR.

@errose28
Copy link
Contributor

I've been able to think about this a bit more and I think a good way to differentiate the approaches is by who manages the update ID during the write. The two most intuitive options would be that either the client manages the update ID (currently the second proposal), or the server manages the update ID (not yet discussed). In the first proposal listed in the doc, both the client and the server are managing the update ID at different parts of the operation and I think this is why it feels "off" to me. Hopefully defining the options in this way can clarify the differences:

  • Server manages the update ID:

    1. The key create request would take a flag indicating that this should be an atomic replacement of an existing key.
    2. The server saves the update ID at the time of create in the open key table, and returns an outputstream to the client.
    3. The client reads, writes, and commits the data to rewrite the same as before.
    4. The server checks the update ID saved with the open key on commit.
    • Pseudocode:
    ostream = create(/v1/b1/k1, newRepType, rewrite=true)
    read /v1/b1/k1 into ostream
    commit(/v1/b1/k1)
    
  • Client manages the update ID (currently proposal 2 in the doc):

    1. The key create request would return an outputstream to the client the same as before.
    2. The client gets the update ID of the key to overwrite.
    3. The client reads and writes the data to rewrite the same as before.
    4. The client commits the key, including the update ID
    5. The server checks the update ID saved with the open key on commit.
    • Pseudocode:
    stream = create(/v1/b1/k1, newRepType)
    genID = getInfo(/v1/b1/k1)
    read /v1/b1/k1 into ostream
    commit(/v1/b1/k1, genID)
    
  • Client and server manage the update ID (currently proposal 1 in the doc):

    1. The client gets the update ID of the key to overwrite.
    2. The key create request would take this update ID and return an outputstream to the client.
    3. The client reads and writes the data to rewrite the same as before.
    4. The client commits the key, the same as before.
    5. The server checks the update ID saved with the open key on commit.
    • Pseudocode:
    genID = getInfo(/v1/b1/k1)
    ostream = create(/v1/b1/k1, newRepType, genID)
    read /v1/b1/k1 into ostream
    commit(/v1/b1/k1)
    

To me, either of the first two options where only one side is responsible for storing the update ID as the write is ongoing make sense. The third option is a mashup of the others, which IMO is the least intuitive option. The client reads something from the server and then immediately gives the same thing back for the server to manage. It also unnecessarily spreads the update ID into the client/server and server/disk protocols when it only needs to be in one or the other.

@sodonnel
Copy link
Contributor Author

Server manages the update ID

This does not work in the general case, where a client reads a key, inspects it and decides that it needs rewritten. The key on the server could have changed in the meantime resulting in lost updates. The client must pass the generation it expects to overwrite based on what it has read. It cannot just trust that whatever is currently on the server has not changed. That is the entire point of this change.

Client manages the update ID (currently proposal 2 in the doc):

  • The key create request would return an outputstream to the client the same as before.
  • The client gets the update ID of the key to overwrite.
  • The client reads and writes the data to rewrite the same as before.

This is doable, but not quite as you described. The client would need to read the existing key first to get its meta data. Then, ideally it passes the generation on key open so it can fail fast. If the key has already changed, there is little point in going ahead and writing it all out only to fail at the end.

An addition which I had not yet considered, is that even on block allocation the generation could be checked against that which is in the key table, so for a large object it could be check at each block boundary too. I have not looked at the block allocation code, but I think it must persist the allocated blocks in the open key table along with the key to allow for them to be garbage collected later if the client should crash. I am also not sure what the block allocation protocol looks like, but by storing the expectedGeneration on the server, we avoid any changes to the block allocation protocol and gain this feature.

To me, either of the first two options where only one side is responsible for storing the update ID as the write is ongoing make sense. The third option is a mashup of the others, which IMO is the least intuitive option.

But the third option, is how things currently work for the other metadata fields in a key. To do differently is less intuitive as now this solution goes against how all the other fields are stored. To give an analogy from web-development - the current structure is to have the session store on the server, rather than in a cookie. What you want, is to split this new area into something like a cookie session which we also have the server side session. You have already cited that you don't like the current approach, but we are not going to change that. Sometimes it is better to stick with the conventions already in place, rather than going in a new direction that is possibly better, but possibly not. In my opinion, both have their pros and cons and there is no clear best answer.

The HSync code has added information to the openKey table. It has added it to the MetaData map, so it has avoided adding an extra protobuf field in the protobuf at all. That is also something I could consider, but it will be less efficient and kind of sidesteps a lot of the static type checking Java can do for us, so bugs are easier to get in.

@errose28
Copy link
Contributor

Server manages the update ID
This does not work in the general case, where a client reads a key, inspects it and decides that it needs rewritten
... That is the entire point of this change.

This comes back to this discussion. You are correct that this doesn't work for when there is an "inspect" between the read and write, but this doesn't happen in the one example provided by the document. It seems the document is missing a section demonstrating "the entire point of the change".

An addition which I had not yet considered, is that even on block allocation the generation could be checked against that which is in the key table, so for a large object it could be check at each block boundary too.

This is a great point. I also had only thought about failing early in the context of create key, not on each block operation. Storing the expected ID on the server makes the check on each block boundary possible. Let's add it to the doc.

Sometimes it is better to stick with the conventions already in place, rather than going in a new direction that is possibly better, but possibly not. In my opinion, both have their pros and cons and there is no clear best answer.

In this context I was trying to look at approaches from a top down API level, as in what does the client see and is it clear what is happening. While the conventions you mention are here are important to consider too, they are internal details that we have already discussed. It seems there has not been much discussion on what things look like to the client which was the point of this comment.

The reason that the third approach looks strange from the client's perspective is that if you visualize generation ID as a sort of optimistic lock, it looks like the lock is released, and thus loses its guarantees, when create is called. For example:

info, genID = getInfo(key) // lock "acquired" optimistically
if info != expected:
  ostream = rewrite(key, genID) // To the casual reader, it looks like the lock is "released" here.
  write to ostream // Is this safe? Yes, even though it doesn't look like it.
  commit(key) // Is the lock from earlier still respected? Yes, but unintuitive from the API structure due to server "magic".

However, I think your idea for failing early on block allocations is solid and outweighs the odd looks of this API and wider spread proto changes. With some doc updates to outline this case as only possible when the ID is passed on create, I'm ok to go forward with this implementation.

@sodonnel
Copy link
Contributor Author

@errose28 I have enhanced the sections about the use case to make it more clear "immediate rewrite" is no the only goal. I have also added a note about the "fail early" on block allocation idea.

Please check and let me know if you are happy, and then I think we can commit this design PR and return to the original code PR after moving it to a branch rather than master.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing this document @sodonnel this will be helpful for others following this feature development and as a spec to use in the code reviews.

For reconciliation we are leaving the design doc PR open until that phase of development is complete so we can easily update the doc if we find problems in the original plan when implementing. If you think that workflow would be helpful you can do the same, or merge it.

Also there is a lot of detail here now. If you could update #6385 / HDDS-10527 with how much of the content is in scope for that one change and what will be done in follow up tasks that will help as well.

@adoroszlai
Copy link
Contributor

For reconciliation we are leaving the design doc PR open until that phase of development is complete so we can easily update the doc if we find problems in the original plan when implementing.

My 2 cents: creating follow-up PRs for any design change based on implementation experience makes them more visible. In the single PR case, Git history preserves only the final commit, and readers have to refer to the PR (which is GitHub-specific).

@sodonnel sodonnel merged commit 1eaddc4 into apache:master Apr 29, 2024
@sodonnel
Copy link
Contributor Author

I went ahead and merged it. We can create followup PRs based on the implementation, and I don't like "completed" PRs hanging around in the queue unnecessarily.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

design documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants