Skip to content

HDDS-13618. Avoid frequent pipeline close action from DN#9024

Merged
sodonnel merged 6 commits intoapache:masterfrom
sarvekshayr:HDDS-13618
Sep 18, 2025
Merged

HDDS-13618. Avoid frequent pipeline close action from DN#9024
sodonnel merged 6 commits intoapache:masterfrom
sarvekshayr:HDDS-13618

Conversation

@sarvekshayr
Copy link
Contributor

What changes were proposed in this pull request?

When DN ratis identifies issues with a pipeline, it triggers close pipeline action for below cases:

  • When follower is slow
  • When leader is not elected and remain in candidate for longer duration
  • Any other issue in pipeline like disk out of space trigger failure

This keeps triggering close pipeline action, even though close actions are queued in DN command queue. If there is any issue in DN pipeline closure, it can result in huge such action every HB.
There is an optimisation that within one HB, it will avoid duplicate pipeline close action.

With this change, repeated triggers across HBs are also prevented by checking the command queue before adding new close actions.

What is the link to the Apache JIRA

HDDS-13618

How was this patch tested?

Added a unit test in TestClosePipelineCommandHandler.

@sarvekshayr
Copy link
Contributor Author

@sumitagrawl please review this PR.

@sodonnel
Copy link
Contributor

There is a failing test that might be related - could you check it please?

Also:

With this change, repeated triggers across HBs are also prevented by checking the command queue before adding new close actions.

What if there has been an SCM failover? Do the follower SCMs get these commands and drop them, or store them? If there is a failover, will this change prevent the command from reaching the new leader SCM?

@sarvekshayr
Copy link
Contributor Author

What if there has been an SCM failover? Do the follower SCMs get these commands and drop them, or store them? If there is a failover, will this change prevent the command from reaching the new leader SCM?

Follower SCM also get the command but drop the command and do not store.
Here, DN already received the close pipeline command in queue, so no need send again to SCM to retry. So in this case, this is not a problem and avoid un-necessary sending again to SCM.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sarvekshayr
Copy link
Contributor Author

Addressed the flakiness in TestPipelineClose test.
CI: https://github.com/sarvekshayr/ozone/actions/runs/17802540731

@sodonnel sodonnel merged commit 2a6e31d into apache:master Sep 18, 2025
42 checks passed
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants