-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates #3453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should still have a check here, and at least log if the pipeline does not have enough nodes and skip it, rather than loading it? I guess we should not get a "bad pipeline" here, as it never should be allowed to be added anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it make sense to leave this preconditions as is. Anyway we are making sure to pass the correct pipeline with other checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this change is required due what we suspect is a bug in the RackScatterPlacementPolicy, where it returns less nodes than expected without giving an exception, I think we should make a change to the RackScatter policy too, so that it throws an exception indicating it is about to return successfully with less than expected. You could add a reference to this Jira in a comment so we can remove it later if we get to the bottom of the suspected bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yeah I remember we talked about this. Let me add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with stephen. maybe we would better to check and handle this in that certain placement policy. if the placement policy can not choose enough datanodes for the new pipeline , an exception should be thrown from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the check into pipelineFactory#create
c7f1238 to
ca6cb66
Compare
| if (nodesRequiredToChoose != chosenNodes.size()) { | ||
| String reason = "Chosen nodes size: " + chosenNodes | ||
| .size() + ", but required nodes to choose: " + nodesRequiredToChoose | ||
| + " do not match."; | ||
| LOG.warn("Placement policy could not choose the enough nodes." | ||
| + " {} Available nodes count: {}, Excluded nodes count: {}", | ||
| reason, totalNodesCount, excludedNodesCount); | ||
| throw new SCMException(reason, | ||
| SCMException.ResultCodes.FAILED_TO_FIND_HEALTHY_NODES); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if validateContainerPlacement should verify the number of chosen nodes vs. required nodes (either in base class or this specific subclass).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that is more validating the placement with respective to racks. Some how we are still seeing the issue of getting less nodes than requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I see the message is referencing racks.
|
Thanks @adoroszlai for the review |
* master: (87 commits) HDDS-6686. Do Leadship check before SASL token verification. (apache#3382) HDDS-4364: [FSO]List FileStatus : startKey can be a non-existed path (apache#3481) HDDS-6091. Add file checksum to OmKeyInfo (apache#3201) HDDS-6706. Exposing Volume Information Metrics to the DataNode UI (apache#3478) HDDS-6759: Add listblock API in MockDatanodeStorage (apache#3452) HDDS-5821 Container cache management for closing RockDB (apache#3426) HDDS-6683. Refactor OM server bucket layout configuration usage (apache#3477) HDDS-6824. Revert changes made in proto.lock by HDDS-6768. (apache#3480) HDDS-6811. Bucket create message with layout type (apache#3479) HDDS-6810. Add a optional flag to trigger listStatus as part of listKeys for FSO buckets. (apache#3461) HDDS-6828. Revert RockDB version pending leak fixes (apache#3475) HDDS-6764: EC: DN ability to create RECOVERING containers for EC reconstruction. (apache#3458) HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates (apache#3453) HDDS-6823. Intermittent failure in TestOzoneECClient#testExcludeOnDNMixed (apache#3476) HDDS-6820. Bucket Layout Post-Finalization Validators for ACL Requests. (apache#3472) HDDS-6819. Add LEGACY to AllowedBucketLayouts in CreateBucketHandler (apache#3473) HDDS-4859. [FSO]ListKeys: seek all the files/dirs from startKey to keyPrefix (apache#3466) HDDS-6705 Add metrics for volume statistics including disk capacity, usage, Reserved (apache#3430) HDDS-6474. Add test to cover the FSO bucket list status with beyond batch boundary and cache. (apache#3379). Contributed by aswinshakil HDDS-6280. Support Container Balancer HA (apache#3423) ...
What changes were proposed in this pull request?
Moved the condition before getting pipeline. We should not have sanity checks after adding pipelines into DB. If there is an issue and failed with this sanity checks, SCM can crash and it will never come back as the same issue can trigger again when loading them from db.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6795
How was this patch tested?
Checking on existing tests.