-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13513 Ozone Event Notification Design #8871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| Example configuration to provide a kafka instance: | ||
| ```xml | ||
|
|
||
| <configration> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe as a future work, make it possible to configure notification on a per bucket basis, which is actually more AWS-like. That way each bucket owner can decide to send notification or send them to different kafka topics, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optionally, support AWS's PutBucketNotificationConfiguration and GetBucketNotificationConfiguration
peterxcli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also consider the multiple OM listener case
- Listeners should acquire some kind of lease to prevent they do the same job at the same time.
- The
lastAppliedTxnprogress of messages submission should reach consensus not just persist in OM listener's local, otherwise if the lease is acquired by another listener they wouldn't know the actual progress.
Btw, just FYI, there are some discussion in #8449, you can take it as a reference.
|
Thanks for adding a design for this. Based on the discussion in the community sync this morning here are some high level comments: Comments on the Specification
Comments on the design
Slight Alternative Approach: Co-locating a listener on each OMI haven't totally worked out this proposal but wanted to put it out here since it seems to address some of the previous concerns. Instead of a full OM, the listener can be a small process that is co-located on each OM node. If its workload ends up being especially light, it may even be able to be a thread within the OM itself. The main OM would be configured to move old Ratis log files to a backup directory instead of deleting them. This keeps its working directory clean and will not affect startup time due to a large number of files. I did a quick look through Ratis and it doesn't look like this is supported currently, but it could be added. The listener can read log entries from the backup dir, and then the main OM dir. As a listener, it will be notified of the cluster's apply index, which it can use to determine which log files correspond to valid events. It will also know the current leader through election events, so the instances running on followers can pause running. This listener can then push events to the plugged in consumers based on the Ratis logs, and purge them from the backup dir when the consumers have acked them. It does not need to consume the ratis logs that come through the Ratis listener API since it will use the local copies on the OM. We would still need to hash out how the at-least-once delivery specification from Ozone to the consumer will fit with leader changes in this model. |
|
|
||
| Filtering of events or paths/buckets | ||
| Persistent storage of notification messages | ||
| Asynchronous delivery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High availability and consistency.
It sounds like the design does not consider availability. If the Listener goes down there's no guarantee the events notification will be processed.
Consistency: does it guarantee in-order delivery of events? Any delivery latency assumptions or SLAs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that S3 does not guarantee event order. It seems like this would make it difficult to build any sort of replication based system around these events. We may be able to support this depending how we are working with the Ratis logs.
| ### At-least-once delivery | ||
|
|
||
| At-least-once delivery cannot be guaranteed without some form of persistence, however we want to avoid persisting the notification messages themselves. One approach to achieving this is by "replaying" the proposals. | ||
| If we persist the lastAppliedTxn after each notification is sent successfully, on a restart we could reset the lastAppliedTxn on the state machine and replay proposals in a "non-committal" mode in order to generate any missing notifications up to the current proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do persist lastAppliedIndex in ratis log, and there's a callback to the statemachine implementation when a transaction is applied, so it sounds doable.
when we conceived the original design in HDDS-5984, we couldn't do it as S3 gateway is stateless. Making the logic inside the OM follower makes a lot of sense.
@errose28 - Thanks for the feedback. This approach is interesting as it is similar to a separate POC we had developed internally. The reason we did not go with that approach was that we didn't think we could guarantee the entry in the ratis log was actually applied successfully without attempting to update the metadata itself first. Your comment seems to imply that should be possible though? Can you explain how we could verify this? My understanding is that the lastAppliedTxn is just the last txn the leader executed, there is no guarantee it was executed successfully - is that assumption incorrect? E.g. If the notify process is on txn 2 and the leader is on txn 10 how do we confirm which of txns 2->10 were applied successfully? |
|
FYI more detail on the event notification schema has been uploaded in a separate doc |
Yes this is true. Actually the raft-log-only proposal would only work in the context of the new leader execution model in HDDS-11897, where the commit index is guaranteed to have its corresponding operations applied to the state machine. We would then track the commit index to know where in the raft logs we need to generate events for. The current model will update the applied index even for invalid transactions that don't cause mutations like a directory already existing. The challenge is that you need an up to date OM DB to know whether or not a transaction will succeed. However this may require provisioning a whole OM Node with hardware just to push some json messages to an endpoint, and does not have at least once delivery in the presence of Ratis snapshot installs. This overhead would then become tech debt once the leader execution project finishes. I think the high level design requirements we seem to be converging on are:
While using Raft logs seems to fit most of these requirements, it is not the only option if other Ratis constraints are getting in the way. Another option is that when the apply happens to the state machine, we write a ledger to the disk if that transaction indicates a successful modification. Basically the ledgers become part of the Ratis state machine, not the Ratis log. The notifier then consumes these ledgers and pushes them to the consumers without worrying about indices, because all ledgers represent successful operations. The notifier can prune these ledgers as it goes. Since we would likely need to add some metadata to the Ratis requests about operation type in HDDS-11897 for most event notification proposals I've seen so far, this should still work for both request models. |
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding information on the schema. Based on my current reading I think S3 schema with extensions looks the most promising, but I'll let others weigh in as well.
|
|
||
| required fields: | ||
| - path (volume + bucket + key) | ||
| - isfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this field would be redundant given the event name. Same with CreateDirectoryRequest.
| ## SetTimes | ||
|
|
||
| event notifications should be raised to inform consumers that | ||
| mtime/atime has changed, as per **SetTimesRequest** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support atime (only mtime) because atime turns all read operations into write operations which kills performance.
| RenameKeyRequest would be the fromKeyName and the toKeyName of the | ||
| *parent* of the directory being renamed (and not the impacted child | ||
| objects). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is probably the best way to do it with rename or delete directory generating a single event. These are atomic operations on the Ozone cluster, so ideally the consumer would see them that way as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the request we should also track the bucket layout(OBS/FSO). The consumer event may get some additional info on the kind of event. Behaviour of events like rename and delete are different might be a good info to track.
| - no additional processing when emitting events | ||
|
|
||
| Cons: | ||
| - non-standard S3 event notification semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is acceptable for FSO buckets. FSO buckets accessed through S3 API do compatibility on a best-effort basis, but there are some things that just won't work. For example writing an object called /bucket/prefix then writing another object called /bucket/prefix/key is valid in S3 and OBS but not FSO (the second object would try to create prefix as a directory). IMO it is ok for S3 event notifications to make similar tradeoffs when used with FSO buckets.
| NOTE: directory rename is just one example of a hierarchical FSO | ||
| operation which impacts child objects. There may be other Ozone | ||
| hierarchical FSO operations which will need be catered for in a similar | ||
| way (recursive delete?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think directory delete and rename are the only two that fit this category. We do have atomic recursive directory delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 With OM leader execution designing for FSO the from path and toPath might not be defined. I am not sure how we would handle this in case of FSO? In leader execution the gate keeper would do path resolution parallely and only thing valid in that case would be parentId/keyName. I don't think we can really handle this correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should spend some more time what events we want to emit in case of FSO. From what I understand order of event is very important maybe we have to send the entire hierarchy of objectIds witnessed while path resolution so within a ratis transaction batch all rename request we can figure out parents being renamed and the event added within a ratis batch has to be ordered.
For instance within a ratis batch say we have:
d1(O1)/d2(O2)/d3(O3)/d4(O4)/F1(O5)
Now within a ratis transaction batch if I have 2 parallel transactions:
mv d1/d2 -> d1/d5
mv d1/d2/d3/d4/F1 -> d1/d2/d3/F1
then notification should be either
mv d1/d2/d3/F1 -> d1/d2/F1 and mv d1/d2 -> d1/d5
or
mv d1/d2 -> d1/d5 and mv d1/d5/d3/F1 -> d1/d5/F1
mv d1/d2 -> d1/d5 and mv d1/d2/d3/F1 -> d1/d2/F1 would be invalid
So a ratis txn batch should send the entire objectId hierarchy in the batch request to figure out this change and identify all the paths transformation this could get a bit complex I had experienced this first hand when we were implementing snapshot diff and we had scrapped the idea of making snaphshot diff order compliant but I believe this event notification design cannot be agnostic to order of events here.
|
|
||
| Filtering of events or paths/buckets | ||
| Persistent storage of notification messages | ||
| Asynchronous delivery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that S3 does not guarantee event order. It seems like this would make it difficult to build any sort of replication based system around these events. We may be able to support this depending how we are working with the Ratis logs.
|
|
||
| ## Performance | ||
|
|
||
| While this is a synchronous only approach any latency between notification target and OM should not impact the performance of client requests as the notification does not run on the leader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "synchronous" is the best term to use here. Ceph's definitions are that synchronous blocks the writer and asynchronous does not. It's probably clearer to use similar definitions here, so the system would be asynchronous. I think the point being made here is that the notifier will block until the consumer acks the event but that this will not block the writer.
|
Sharing recordings of design discussions that happened in the last two week's community sync meetings: https://cloudera.zoom.us/rec/share/TwWtWixjJ3tF8wM5n7o7Gxx3CEeMv32uJASsvqk-UfuoBJIbaU2Z2MIqAELykPI5.5JXR4fzks37Lsi1n?startTime=1753718482000 https://cloudera.zoom.us/rec/share/Sc1klNNvDvIZN0n5yZVxOrV2jqdgT-djeRFke8y6UCPGUnNmzJFV2VMqbD4vcEEb.0myqZ0KY3JuiNXdn?startTime=1754323258000 |
|
There was a healthy discussion on this topic today. Can any one summarize what we plan to do (or at least where we think we're converging onto?) |
|
Here is the recording of today's discussion: Here is a high level summary of what was discussed. Feel free to add other details I may have missed. The high level design discussed involves the Ozone Manager persisting a bounded number of events to RocksDB as they occur. Plugins decoupled from the OM can consume these events by iterating the RocksDB column family which stores these events. Plugins would have freedom to make many decisions as needed for their use case:
Internally, the Ozone Manager can persist events to its state machine as part of the This provides a spec for plugins consuming events: They can be guaranteed to see at least the last To run plugins, a few options were discussed:
For either approach, we could create a base plugin class that would handle most of the common tasks. The OM already supports dynamically loading plugins that implement a given interface like
The biggest open item at the end of this discussion was whether to run the plugins within the OM as threads or as a separate process. |
|
We are just updating the design doc based on the latest feedback and the discussion from last week's community call. I have realized I was left with some uncertainty over the ledger schema which I thought would be good to flesh out here before re-publishing the design doc. My original strawman schema as discussed in last week's call: @errose28 made a good point that the use of the extraArgs field (effectively a Map<String, String>) for miscellaneous fields (such as the rename "to" path or acl fields) would lead to a brittle schema management. @errose28 you referenced a pattern where such things can be provided as a bunch of "optional" fields (for examples, as per OMRequest) but it wasn't clear exactly what you had in mind therefore I thought it would be useful to discuss a few possible intrepretations:
initially this is what I inferred that you meant but I may have misunderstood. At any rate - I'm not sure this is ideal - for a couple of reasons:
This is effectively the same "optional" trick but rather than copy in the full request (from the OmRequest) we cherry-pick the fields which are exposed to the ledger via explicitly defined helper messages (RenameKeyOperationArgs, CreateFileOperationArgs etc) and so consumers have a clear contract as to what they can consume. This provides stronger typing around what extra fields are provided and this also makes schema changes simpler. It could be argued that #1 gives more flexibility to consumers of the ledger but I feel like a more explicit contract (as per #2) is ultimately better and to keep it minimal to begin with makes sense. Therefore I would have a preference for #2. But I would like to verify that I correctly understood your feedback @errose28 so please let me know what you (and others) think (or if you have a different suggestion). |
|
Discussed in the community sync today. We agreed approach 2 for proto schema looks better since it allows us to choose which fields to expose to consumers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design proposal LGTM. Exact specifics on items like names of fields in the proto schema and metrics can be worked out in the implementation phase with more context. I will also take a look at the prototype in #9012 for some high level suggestions. From there we can work on splitting the prototype down into mergable pieces which we can review in more detail.
It would be good to have others sign off on this approach as well before we put code into a branch. Tagging @jojochuang @swamirishi and @sumitagrawl since they have been present in the community meetings where this was discussed, but others please check it out as well.
|
|
||
| [https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types](https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types) | ||
|
|
||
| has become a standard for change notifications in S3 compatible storage services such as S3 itself, Ceph, MinIO etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just a passing comment) In the future, we can also support CloudEvents (https://cloudevents.io/) which is also supported by major cloud provider (AWS EventBridge, GCP Eventarc, Azure Event Grid) and ASF project like EventMesh (https://eventmesh.apache.org/docs/design-document/event-handling-and-integration/cloudevents/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I haven't seen this before. With the plugin model we should be able to support any format as needed, and users can bring their own formats if those that end up shipping with Ozone don't cover their needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we adopt Cloud event format from the start?
swamirishi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donalmag thank you for the design doc I have a few doubts, concerns and suggestions in the design would love if we can iron out the last mile details here.
| we should emit some **create** event | ||
|
|
||
| required fields: | ||
| - path (volume + bucket + key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the request we should also track the bucket layout(OBS/FSO). The consumer event may get some additional info on the kind of event. Behaviour of events like rename and delete are different might be a good info to track.
| RenameKeyRequest would be the fromKeyName and the toKeyName of the | ||
| *parent* of the directory being renamed (and not the impacted child | ||
| objects). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the request we should also track the bucket layout(OBS/FSO). The consumer event may get some additional info on the kind of event. Behaviour of events like rename and delete are different might be a good info to track.
| NOTE: directory rename is just one example of a hierarchical FSO | ||
| operation which impacts child objects. There may be other Ozone | ||
| hierarchical FSO operations which will need be catered for in a similar | ||
| way (recursive delete?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 With OM leader execution designing for FSO the from path and toPath might not be defined. I am not sure how we would handle this in case of FSO? In leader execution the gate keeper would do path resolution parallely and only thing valid in that case would be parentId/keyName. I don't think we can really handle this correctly.
| NOTE: directory rename is just one example of a hierarchical FSO | ||
| operation which impacts child objects. There may be other Ozone | ||
| hierarchical FSO operations which will need be catered for in a similar | ||
| way (recursive delete?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should spend some more time what events we want to emit in case of FSO. From what I understand order of event is very important maybe we have to send the entire hierarchy of objectIds witnessed while path resolution so within a ratis transaction batch all rename request we can figure out parents being renamed and the event added within a ratis batch has to be ordered.
For instance within a ratis batch say we have:
d1(O1)/d2(O2)/d3(O3)/d4(O4)/F1(O5)
Now within a ratis transaction batch if I have 2 parallel transactions:
mv d1/d2 -> d1/d5
mv d1/d2/d3/d4/F1 -> d1/d2/d3/F1
then notification should be either
mv d1/d2/d3/F1 -> d1/d2/F1 and mv d1/d2 -> d1/d5
or
mv d1/d2 -> d1/d5 and mv d1/d5/d3/F1 -> d1/d5/F1
mv d1/d2 -> d1/d5 and mv d1/d2/d3/F1 -> d1/d2/F1 would be invalid
So a ratis txn batch should send the entire objectId hierarchy in the batch request to figure out this change and identify all the paths transformation this could get a bit complex I had experienced this first hand when we were implementing snapshot diff and we had scrapped the idea of making snaphshot diff order compliant but I believe this event notification design cannot be agnostic to order of events here.
| 1. Add a new RocksDB column family e.g. om_event_log. | ||
| 2. Add a hook in the OMRequest execution workflow (post successful commit) to persist required events. | ||
| 3. Implement a plugin framework to run notification publishers. | ||
| 4. Implement a new background service for cleaning the events table, similar to KeyDeletingService, which operates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No please don't add another ratis request and a background service in om unnecessarily. This can be done within the same om ratis transaction using rocksdb deleteRange API. as long as the column family key is going to ordered in the same order as the trasaction id. This would be a very cheap operation as this would just add another tombstone to rocksdb which should be ok
Please look at this implementation
https://github.com/apache/ozone/pull/8779/files#r2510853726
| ## Performance | ||
|
|
||
| Writes to the RocksDB table happen synchronously in the OM Commit path but are a single put operation. | ||
| Deletes are to be executed by the OM in a separate thread ensuring the table is bounded to a specified limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have faced issues in DelegationToken implementation in the past because of the divergence. We don't want the event notification be another cause for this. Look at the comment above which has the way we can handle this.
|
@donalmag You can use this Feature Branch https://github.com/apache/ozone/tree/HDDS-13513_Event_Notification_FeatureBranch |
@swamirishi if you push and then delete a feature branch, please make sure to cancel the corresponding workflow run. |
But not the run for the feature branch you intend to keep. |
|
|
||
| nice to have fields: | ||
| - recursive (if known) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to have 'Size to be reclaimed'
| {"Records":[ | ||
| { | ||
| "eventVersion":"2.1", | ||
| "eventSource":"ceph:s3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be "ozon:s3"?
| { | ||
| "eventVersion":"2.1", | ||
| "eventSource":"ceph:s3", | ||
| "awsRegion":"us-east-1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we set regions for private deployments?
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
Includes event notification schema design
What changes were proposed in this pull request?
Design doc for an Ozone Manager Notify node to generate Ozone Filesystem event notifications
What is the link to the Apache JIRA
HDDS-13513
How was this patch tested?
N/A