-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-11171. [DN] Add EC Block Recover Audit Log. #6936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sodonnel Can you help review this pr? Thank you very much! |
|
Thanks @slfan1989 for working on this.
The coordinator datanode already logs status/result of EC reconstruction, search for
Datanode currently only writes audit log for container state machine commands in
Start: Success: Failure: |
|
Hi @slfan1989, I agree more diagnostic info would be helpful. As Attila said, the datanode audit log is for requests it receives from clients. That is why it has Also, I see the example success log is at the block level, but EC reconstruction happens at the container level. Having one success or failure log for every block would probably be too verbose. If we can just log once per container on success or failure that would be more concise. |
|
@adoroszlai @errose28 Thank you very much for your response! it has been very helpful and insightful. I understand that adding this operation to the audit log may not be appropriate. The primary reason for logging this is to address a practical issue: some EC blocks currently fail to recover successfully due to variations in their BlockGroupLength. This results in repeated repair attempts on different DataNodes in the cluster. My objective with the audit log is to identify the blocks that failed reconstruction. Subsequently, by using containerId and locationId, I aim to locate the corresponding files. Given the potentially limited number of affected files, my current approach involves retrieving these files using "get" operations and then restoring them using "put" operations. Here are the detailed steps of the operation:
I understand this process is quite cumbersome, but we frequently encounter issues with inconsistent BlockGroupLength during reconstruction, which can lead to errors such as the following: This operation provides the benefit of repairing both damaged EC blocks and blocks with insufficient BlockGroupLength. Both of these issues can be addressed through this process. I am currently working on improving the repair code in ECReconstructionCoordinator to trigger repairs for EC blocks where only one BlockGroupLength is different. |
What changes were proposed in this pull request?
In our internal use of Ozone, we heavily utilize EC (Erasure Coding) functionality. When a DN (DataNode) disk fails, it leads to the loss of some EC replica data, which will be reconstructed on other DNs (DataNodes). This reconstruction process may either succeed or fail. To swiftly grasp the outcome of EC block reconstruction, I intend to implement an auditing feature dedicated to EC reconstruction logs. This is crucial, especially in instances of failure, to promptly pinpoint the reasons for reconstruction failures.
Success log:
Failure log:
What is the link to the Apache JIRA
JIRA: HDDS-11171. [DN] Add EC Block Recover Audit Log.
How was this patch tested?
Validate in production environment.