-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. #3514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
42fc58c to
20f6b70
Compare
|
Hi @umamaheswararao any updates on this one? |
|
@guihecheng, I have discovered several issues with this. So, I have just kept this issue a side currently as I am focussing on other prioritized tasks. Here just empty put block will not work:
So, the best way could be to create the container's from client on initializing the streams first. With offline recovery, we are creating the container all the time to target nodes first. So, even though no blocks needs to be recovered, we may create the container. So, first recovery task will solve the problem and from then onward we will have empty containers I guess. Only thing we need to make sure is, empty containers should not get delete in EC case from RM. With this, I felt this can be low prioritized, until RM work is done. Doing it as part of close or start of input stream should not have any difference. In both cases, createContainer will fail if already exist. Please note current code is just experimental changes. Not for commit or review. Feel free to suggest or take up if you have any other thoughts which may be simpler and low risk. |
Thanks for the info, it is really a problem that needs to be discuss more about, since the EC on-disk container replica is different from the Ratis-Replicated container replica, then some old assumptions(e.g. create container replica on the first WriteChunk) may not apply. I should say that RM won't be able to trigger Offline Recovery facing a CLOSED tiny container with possibly only one partial stripe since it doesn't have enough container replica reports gathered. So we need to handle this problem specially on the write path or somewhere else. The problem is likely to hit when doing Offline Recovery test with small files(e.g. 1MB) only, it may not be very easy to reveal on real mixed workloads, but still possible. Create container replicas on init of BlockStreams is a workable idea as I think, but it may possibly bring a performance impact. I'm thinking that maybe we could handle this during the OPEN/CLOSE of the container/pipeline on SCM side, but it also has problems that we don't have synchronous view of container replica creation, and we can't wait for the creations. Let's dig more about this later. |
b7c3c32 to
5e76601
Compare
|
Thanks for working on this @umamaheswararao. Actually the normal EC write path doesn't create blocks on non-writing node, that's why |
This is indeed a problem. I would suggest to allow empty EC containers, and generalize the problem. |
…he case of partial single stripe.
…t does not exist.
4e74324 to
e26dd7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @umamaheswararao for the patch, very nice work, LGTM!
i believe this patch can solve the problem caused by the first partial strip of EC container!
| DatanodeBlockID.Builder blkIDBuilder = | ||
| DatanodeBlockID.newBuilder().setContainerID(blockID.getContainerID()) | ||
| .setLocalID(blockID.getLocalID()) | ||
| .setBlockCommitSequenceId(blockID.getBlockCommitSequenceId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blockCommitSequenceID is used for handling quosi_closed ratis container which has a non-determined state. so i don`t think it has any use for EC container. maybe we can remove it for EC case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have not really thought much about BCSID. Currently we are not using it, but we need to think whether we have benefit or take advantage of any other scenarios which we have not covered yet. So, I think it may be too early to discard. Eventually we can ignore that later times if we never find any use of it. I suggest to get set as it is available now. Also we surely don't want to ignore in BlockInputStream as that is non EC code flow too.
|
Thanks a lot @JacksonYao287 for the reviews! |
|
Since I got a +1, I am going ahead to commit this. @nandakumar131 @sodonnel @guihecheng @adoroszlai Let me know if you have concerns on this approach or anything we missed to take care with respective to SCM HA etc. Thanks |
* master: (46 commits) HDDS-6901. Configure HDDS volume reserved as percentage of the volume space. (apache#3532) HDDS-6978. EC: Cleanup RECOVERING container on DN restarts (apache#3585) HDDS-6982. EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. (apache#3583) HDDS-6968. Addendum: [Multi-Tenant] Fix USER_MISMATCH error even on correct user. (apache#3578) HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. (apache#3514) HDDS-6900. Propagate TimeoutException for all SCM HA Ratis calls. (apache#3564) HDDS-6938. handle NPE when removing prefixAcl (apache#3568) HDDS-6960. EC: Implement the Over-replication Handler (apache#3572) HDDS-6979. Remove unused plexus dependency declaration (apache#3579) HDDS-6957. EC: ReplicationManager - priortise under replicated containers (apache#3574) HDDS-6723. Close Rocks objects properly in OzoneManager (apache#3400) HDDS-6942. Ozone Buckets/Objects created via S3 should not allow group access (apache#3553) HDDS-6965. Increase timeout for basic check (apache#3563) HDDS-6969. Add link to compose directory in smoketest README (apache#3567) HDDS-6970. EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission (apache#3573) HDDS-6977. EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount (apache#3575) HDDS-6217. Cleanup XceiverClientGrpc TODOs, and document how the client works and should be used. (apache#3012) HDDS-6773. Cleanup TestRDBTableStore (apache#3434) - fix checkstyle HDDS-6773. Cleanup TestRDBTableStore (apache#3434) HDDS-6676. KeyValueContainerData#getProtoBufMessage() should set block count (apache#3371) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
HDDS-6794. EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. (apache#3514) (cherry picked from commit ab923a3) Change-Id: I7cd3dfb4b91ee9c7e86f34eb18273d794d4731d2
What changes were proposed in this pull request?
This patch proposes to write the empty putBlocks on non writing DNs in the case of padding.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6794
How was this patch tested?
Adding tests.