Skip to content

Conversation

@chungen0126
Copy link
Contributor

What changes were proposed in this pull request?

This is the design proposal for the S3 Event Notification feature. The design document is shared in the Markdown file. Please comment in the document if you have any feedback.

What is the link to the Apache JIRA

HDDS-5984

How was this patch tested?

NA

@adoroszlai adoroszlai added documentation Improvements or additions to documentation design labels May 14, 2025
@jojochuang jojochuang requested review from ChenSammi and ivandika3 May 14, 2025 21:15
@jojochuang
Copy link
Contributor

@ivandika3 please review as it may be related to your S3 real time replication project.

@jojochuang jojochuang added the s3 S3 Gateway label May 16, 2025
peterxcli

This comment was marked as duplicate.

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this design!
I think we can add a NotificationManager, which would receive event message from OMExecutionFlow and store in queue. So if those sender client support batch operation, we would have better throughput. Additionally, the logic between OMExecutionFlow, OMMetadataManager and NotificationSenderManager can move into itself, make OMExecutionFlow cleaner.
(can have a better name, NotificationManager is easy to collide in name)


### Overview

A callback is introduced post-Ratis commit and pre-client response to handle event notification logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the notification logic would happend after ratis commit and before OMKeyRequest#validateAndUpdateCache here? If there are some exception happened, then the action on that key wouldn't be done, but the notification of it would be sent out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually the callback will run after the request has been completed. Maybe I wasn't clear enough.


Furthermore, the creation and reuse of notification client instances (such as Kafka or RabbitMQ producers) within the Ozone Manager must be carefully controlled. Poor resource management in this area could increase memory or thread usage and degrade OM performance over time.

### Metrics
Copy link
Member

@peterxcli peterxcli May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add some metrics about: event_triggered, event_lost, push_ok, push_fail and push_pending.
ref: https://docs.ceph.com/en/quincy/radosgw/notifications/#notification-performance-stats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between event_lost and push_fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't notice that I pasted the wrong link. Updated, please take another look~

https://docs.ceph.com/en/quincy/radosgw/notifications/#notification-performance-stats

What's the difference between event_lost and push_fail?

In https://docs.ceph.com/en/quincy/radosgw/notifications/#notification-performance-stats:

  • pubsub_event_lost: running counter of events that had topics associated with them but that were not pushed to any of the endpoints
  • pubsub_push_fail: running counter, for all notifications, of events failed to be pushed to their endpoint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I not sure if the notification here that can be create through java api is similar to topic in ceph pubsub.

And if yes, then:
In ozone(or this design), it would become:

  • event_lost: running counter of events that had notification config associated with them but that were not pushed to any of the target
  • push_fail: running counter, for all notifications, of targets failed to be pushed to their endpoint

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this document!


Furthermore, the creation and reuse of notification client instances (such as Kafka or RabbitMQ producers) within the Ozone Manager must be carefully controlled. Poor resource management in this area could increase memory or thread usage and degrade OM performance over time.

### Metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't notice that I pasted the wrong link. Updated, please take another look~

https://docs.ceph.com/en/quincy/radosgw/notifications/#notification-performance-stats

What's the difference between event_lost and push_fail?

In https://docs.ceph.com/en/quincy/radosgw/notifications/#notification-performance-stats:

  • pubsub_event_lost: running counter of events that had topics associated with them but that were not pushed to any of the endpoints
  • pubsub_push_fail: running counter, for all notifications, of events failed to be pushed to their endpoint


### Overview

A callback is introduced after the Ratis request is completed and before the response is returned to the client, to handle event notification logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the event notification logic be handled asynchronously? If not, I’m concerned that the sender could slow down the end-to-end response time. This also raises another question: if the event notification fails, should the client still be informed that the original request succeeded, or should the failure be communicated to the client as well?


Furthermore, the creation and reuse of notification client instances (such as Kafka or RabbitMQ producers) within the Ozone Manager must be carefully controlled. Poor resource management in this area could increase memory or thread usage and degrade OM performance over time.

### Metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I not sure if the notification here that can be create through java api is similar to topic in ceph pubsub.

And if yes, then:
In ozone(or this design), it would become:

  • event_lost: running counter of events that had notification config associated with them but that were not pushed to any of the target
  • push_fail: running counter, for all notifications, of targets failed to be pushed to their endpoint

@github-actions
Copy link

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Nov 12, 2025
@github-actions
Copy link

Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it.

@github-actions github-actions bot closed this Nov 19, 2025
@chungen0126
Copy link
Contributor Author

This PR is now covered by #8871 . Please see/refer to that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

design documentation Improvements or additions to documentation s3 S3 Gateway stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants