-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19180. EC: Fix calculation errors caused by special index order #6813
Conversation
@zhangshuyan0 @haiyang1987 Could you help review this PR? I'm not very familiar with EC, but I've noticed that you have submitted quite a few improvements related to EC. Thank you very much! @zhengchenyu Thank for the contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengchenyu Thanks for your report! If there is a parity index
smaller than data index
, there will be a bug in decoding calculation. Is this what you mean?
Firstly, I am curious when parity index
will be smaller?
Secondly, if this situation occurs, the reading of EC files may also be affected. So, I think we won't actively modify the code to make this happen.
@zhangshuyan0 |
@zhengchenyu Thanks for your explanation! I got it. This PR LGTM. |
HDFS-15186 has described this problem, but does not fundamentally solve it. |
@zhengchenyu Thanks for your reply. I agree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
Maybe you should move this ticket to HADOOP and move the UT to common.
@ZanderXu |
you can move hdfs to common |
@haiyang1987 thank you for your reminder! |
💔 -1 overall
This message was automatically generated. |
The reported error "Doesn't support SM4 CTR." seems to be not related to this PR. Perhaps the compiler environment has changed? |
💔 -1 overall
This message was automatically generated. |
This ci is blocked by #6973, wait for that pr. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengchenyu Thanks for your report and PR. LGTM. One nit thing, please check the report by Yetus if failed test is relate to this changes. Thanks again.
@Hexiaoqiao the failed unit test is not related to this issue. I have submit another pr-6987 to fix the unit test. I will rerun this ci until pr-6987 is merged to master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. cc @zhangshuyan0 Please check again, and checkin if no more other concerns. Thanks.
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@zhengchenyu Please check the Yetus report, We need to get one green result before check in. Thanks. |
💔 -1 overall
This message was automatically generated. |
The failed tests are not related to this PR. Committed to trunk. Thanks for your contribution @zhengchenyu @Hexiaoqiao @slfan1989 @haiyang1987 @ZanderXu . |
…dex order (apache#6813). Contributed by zhengchenyu." This reverts commit e5b76dc.
@zhangshuyan0 @Hexiaoqiao @slfan1989 @haiyang1987 @ZanderXu
It is strange! I found that only "precommit-run CentOS 7" will reproduce the failures of TestServiceInterruptHandling:: testRegisterAndRaise. I can not reproduce this in my environment, even though my environment have been started by start-build-env.sh. I will try to fix it in another pr. |
apache#6813). Contributed by zhengchenyu. Reviewed-by: He Xiaoqiao <[email protected]> Signed-off-by: Shuyan Zhang <[email protected]>
apache#6813). Contributed by zhengchenyu. Reviewed-by: He Xiaoqiao <[email protected]> Signed-off-by: Shuyan Zhang <[email protected]>
Description of PR
I found that if the erasedIndexes distribution is such that the parity index is in front of the data index, ec will produce wrong results when decoding.
In fact, HDFS-15186 has described this problem, but does not fundamentally solve it.
The reason is that the code assumes that erasedIndexes is preceded by the data index and followed by parity index. If there is a parity index placed in front of the data index in the incoming code, a calculation error will occur.
When we decode the data unit, we multiply the existing data by the decoding matrix. (Look at the formula doc in 1.2)
When we decode the parity unit, we multiply the existing data by the decoding matrix, get data unit, then multiply by encoding matrix. (Look at the formula doc in 1.1 and 1.2 )
The calculations for parity and block are different. But They calculate in two separate loops, then the code requires that the data index must precede the parity index.
How was this patch tested?
The TestErasureCodingEncodeAndDecode unit test and the erasure_code_test binary were executed on different machines. The test machines include those with isa-l installed and those without isa-l installed.
For code changes: