Skip to content

Conversation

@umamaheswararao
Copy link
Contributor

What changes were proposed in this pull request?

Moved the condition before getting pipeline. We should not have sanity checks after adding pipelines into DB. If there is an issue and failed with this sanity checks, SCM can crash and it will never come back as the same issue can trigger again when loading them from db.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6795

How was this patch tested?

Checking on existing tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should still have a check here, and at least log if the pipeline does not have enough nodes and skip it, rather than loading it? I guess we should not get a "bad pipeline" here, as it never should be allowed to be added anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it make sense to leave this preconditions as is. Anyway we are making sure to pass the correct pipeline with other checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this change is required due what we suspect is a bug in the RackScatterPlacementPolicy, where it returns less nodes than expected without giving an exception, I think we should make a change to the RackScatter policy too, so that it throws an exception indicating it is about to return successfully with less than expected. You could add a reference to this Jira in a comment so we can remove it later if we get to the bottom of the suspected bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah I remember we talked about this. Let me add that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with stephen. maybe we would better to check and handle this in that certain placement policy. if the placement policy can not choose enough datanodes for the new pipeline , an exception should be thrown from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the check into pipelineFactory#create

@adoroszlai adoroszlai self-requested a review June 2, 2022 16:17
Comment on lines +230 to +239
if (nodesRequiredToChoose != chosenNodes.size()) {
String reason = "Chosen nodes size: " + chosenNodes
.size() + ", but required nodes to choose: " + nodesRequiredToChoose
+ " do not match.";
LOG.warn("Placement policy could not choose the enough nodes."
+ " {} Available nodes count: {}, Excluded nodes count: {}",
reason, totalNodesCount, excludedNodesCount);
throw new SCMException(reason,
SCMException.ResultCodes.FAILED_TO_FIND_HEALTHY_NODES);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if validateContainerPlacement should verify the number of chosen nodes vs. required nodes (either in base class or this specific subclass).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that is more validating the placement with respective to racks. Some how we are still seeing the issue of getting less nodes than requested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I see the message is referencing racks.

@umamaheswararao umamaheswararao merged commit 15257c7 into apache:master Jun 2, 2022
@umamaheswararao
Copy link
Contributor Author

Thanks @adoroszlai for the review

errose28 added a commit to errose28/ozone that referenced this pull request Jun 7, 2022
* master: (87 commits)
  HDDS-6686. Do Leadship check before SASL token verification. (apache#3382)
  HDDS-4364: [FSO]List FileStatus : startKey can be a non-existed path (apache#3481)
  HDDS-6091. Add file checksum to OmKeyInfo (apache#3201)
  HDDS-6706. Exposing Volume Information Metrics to the DataNode UI (apache#3478)
  HDDS-6759: Add listblock API in MockDatanodeStorage (apache#3452)
  HDDS-5821 Container cache management for closing RockDB  (apache#3426)
  HDDS-6683. Refactor OM server bucket layout configuration usage (apache#3477)
  HDDS-6824. Revert changes made in proto.lock by HDDS-6768. (apache#3480)
  HDDS-6811. Bucket create message with layout type (apache#3479)
  HDDS-6810. Add a optional flag to trigger listStatus as part of listKeys for FSO buckets. (apache#3461)
  HDDS-6828. Revert RockDB version pending leak fixes (apache#3475)
  HDDS-6764: EC: DN ability to create RECOVERING containers for EC reconstruction. (apache#3458)
  HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates (apache#3453)
  HDDS-6823. Intermittent failure in TestOzoneECClient#testExcludeOnDNMixed (apache#3476)
  HDDS-6820. Bucket Layout Post-Finalization Validators for ACL Requests. (apache#3472)
  HDDS-6819. Add LEGACY to AllowedBucketLayouts in CreateBucketHandler (apache#3473)
  HDDS-4859. [FSO]ListKeys: seek all the files/dirs from startKey to keyPrefix (apache#3466)
  HDDS-6705 Add metrics for volume statistics including disk capacity, usage, Reserved (apache#3430)
  HDDS-6474. Add test to cover the FSO bucket list status with beyond batch boundary and cache. (apache#3379). Contributed by aswinshakil
  HDDS-6280. Support Container Balancer HA (apache#3423)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants