Skip to content

HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler#3802

Merged
sodonnel merged 6 commits intoapache:masterfrom
sodonnel:ec-HDDS-7058-ratis-handler
Oct 19, 2022
Merged

HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler#3802
sodonnel merged 6 commits intoapache:masterfrom
sodonnel:ec-HDDS-7058-ratis-handler

Conversation

@sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Oct 6, 2022

What changes were proposed in this pull request?

Create a handler for the new replication manager to process Ratis container and detect under / over / mis-replication issues.

The logic is largely unchanged from the LegacyReplication manager - simply packaged into the new "handler" structure.

At the moment, this code will not be executed by the new replication manager, as all non-EC container will be directed to the Legacy Replication Manager for processing.

This Jira is part of the work to remove the Legacy Replication Manager.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7058

How was this patch tested?

New unit tests

@kerneltime
Copy link
Contributor

@aswinshakil

@siddhantsangwan siddhantsangwan self-requested a review October 10, 2022 17:16
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments left inline. Have some confusion on certain flows & race condition it would be really great if you could clear my doubts there. Haven't checked unit tests yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have changed this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have another constructor which initializes the following arguments?
public UnderReplicatedHealthResult(ContainerInfo containerInfo,
int remainingRedundancy, boolean dueToDecommission, boolean replicatedOkWithPending, boolean unrecoverable,boolean dueToMisReplication, boolean isMisReplicated, boolean isMisReplicatedAfterPending)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change the existing constructor, as then we need to change it everywhere it is used. Adding a new constructor starts a bad pattern where each new parameter needs a new constructor, and what we really need is a builder.

At the moment I think these 3 parameters have good defaults for the common case and then using the settings when needed is a good compromise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a boolean with includePendingAdd as well. I see redundant duplicate code logic in sufficientlyReplicated & isOverReplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the logic is very similar. I have added a new private method both can call.

Copy link
Contributor

@swamirishi swamirishi Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make this method as public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably could add @VisibleForTesting if you want to add unit tests or could even change the access specifier using replication using reflection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will need to be public when we implement the under / over replication handler to process the under / over replicated queue. This is the same as in the ECHandler, where it has this same method public for that reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
getPlacementStatus(replicas, requiredNodes, Collections.EMPTY_LIST);
getPlacementStatus(replicas, requiredNodes, Collections.emptyList());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a case where there is pending ADD & pending DELETE to the same node? Some kind of race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use a Map<DataNodeDetails,Integer> in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be able to happen, as if a node has a replica it cannot get another copy of it. For a delete to be scheduled it must have a copy which will prevent an add etc.

However I will change this to two IF statements rather than if .. else if

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel The handling logic looks good. I haven't checked the tests yet.

Comment on lines 260 to 264
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we deliberately not considering pending deletes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct - I have missed this. We should be removing the inflight deletes as per the original method defined just above this one. I will fix this and modidy a test to validate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Let's add a new line after this brace for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - I added that in.

@sodonnel sodonnel force-pushed the ec-HDDS-7058-ratis-handler branch from 19657b8 to 520c6b9 Compare October 12, 2022 10:19
Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good! I just have 2 minor comments for the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're testing the HEALTHY case, let's change the test's name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I added IsHealthy to the end of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replica index should be 0 instead of 2 I think. (IN_MAINTENANCE, 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, well spotted.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending green CI.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sodonnel sodonnel merged commit 237a9a1 into apache:master Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments