HDDS-11171. [DN] Add EC Block Recover Audit Log. #6936

slfan1989 · 2024-07-13T07:25:55Z

What changes were proposed in this pull request?

In our internal use of Ozone, we heavily utilize EC (Erasure Coding) functionality. When a DN (DataNode) disk fails, it leads to the loss of some EC replica data, which will be reconstructed on other DNs (DataNodes). This reconstruction process may either succeed or fail. To swiftly grasp the outcome of EC block reconstruction, I intend to implement an auditing feature dedicated to EC reconstruction logs. This is crucial, especially in instances of failure, to promptly pinpoint the reasons for reconstruction failures.

Success log:

2024-07-13 12:06:25,371 | INFO  | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: 113750155051714398 bcsId: 0}, length=4766503, offset=0, token=null, pipeline=Pipeline[ Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=SUCCESS |

Failure log:

2024-07-13 12:06:25,751 | ERROR | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK {blockId={blockID={conID: 964637 locID: 113750155051715549 bcsId: 0}, length=163577856, offset=0, token=null, pipeline=Pipeline[Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | java.lang.IllegalArgumentException: The chunk list has 26 entries, but the checksum chunks has 27 entries. They should be equal in size.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147)

What is the link to the Apache JIRA

JIRA: HDDS-11171. [DN] Add EC Block Recover Audit Log.

How was this patch tested?

Validate in production environment.

slfan1989 · 2024-07-15T05:19:30Z

@sodonnel Can you help review this pr? Thank you very much!

adoroszlai · 2024-07-15T05:47:17Z

Thanks @slfan1989 for working on this.

reconstruction process may either succeed or fail. To swiftly grasp the outcome of EC block reconstruction

The coordinator datanode already logs status/result of EC reconstruction, search for reconstructECContainersCommand (see examples below).

implement an auditing feature dedicated to EC reconstruction logs

Datanode currently only writes audit log for container state machine commands in HddsDispatcher. Auditing commands from SCM might be useful, but:

I don't think it should be specific to EC recovery. It would be better added in a generic way in CommandDispatcher.
Need to consider both sync and async command handlers.

Start:

2024-07-13 14:27:56,098 [f3ffd8a7-4b5c-47f5-98af-404eba82b826-ContainerReplicationThread-0] INFO reconstruction.ECReconstructionCoordinatorTask: IN_PROGRESS reconstructECContainersCommand: containerID=1, replication=rs-3-2-1024k, missingIndexes=, sources={1=e237cf49-df14-416e-ab15-8469b59d04c8(ozone-datanode-1.ozone_default/172.18.0.9), 3=8ee9d1ce-75a9-493f-905d-b1981f13aa00(ozone-datanode-3.ozone_default/172.18.0.10), 5=033eca1e-07e5-4893-b5a1-68dcccbe8c9c(ozone-datanode-2.ozone_default/172.18.0.8)}, targets={2=f3ffd8a7-4b5c-47f5-98af-404eba82b826(ozone-datanode-4.ozone_default/172.18.0.6), 4=2c4209f5-c826-4ed2-9658-b23961f074d6(ozone-datanode-5.ozone_default/172.18.0.11)}

Success:

2024-07-13 14:28:04,066 [f3ffd8a7-4b5c-47f5-98af-404eba82b826-ContainerReplicationThread-0] INFO reconstruction.ECReconstructionCoordinatorTask: DONE reconstructECContainersCommand: containerID=1, replication=rs-3-2-1024k, missingIndexes=, sources={1=e237cf49-df14-416e-ab15-8469b59d04c8(ozone-datanode-1.ozone_default/172.18.0.9), 3=8ee9d1ce-75a9-493f-905d-b1981f13aa00(ozone-datanode-3.ozone_default/172.18.0.10), 5=033eca1e-07e5-4893-b5a1-68dcccbe8c9c(ozone-datanode-2.ozone_default/172.18.0.8)}, targets={2=f3ffd8a7-4b5c-47f5-98af-404eba82b826(ozone-datanode-4.ozone_default/172.18.0.6), 4=2c4209f5-c826-4ed2-9658-b23961f074d6(ozone-datanode-5.ozone_default/172.18.0.11)} in 7968 ms

Failure:

2024-03-24 12:17:19,479 [nullContainerReplicationThread-1] WARN reconstruction.ECReconstructionCoordinatorTask: FAILED reconstructECContainersCommand: containerID=5, replication=rs-3-2-1024k, missingIndexes=[1], sources={2=165ad6c9-23bb-4b60-a209-0f37fb43c778(ozonesecure_datanode_1.ozonesecure_default/172.19.0.8), 3=33dddf06-b56a-4475-8523-ad54a589c1b3(ozonesecure_datanode_2.ozonesecure_default/172.19.0.10), 4=3575b633-fb5a-4c41-9ece-c2f97994544f(ozonesecure_datanode_4.ozonesecure_default/172.19.0.13), 5=b56dc487-1437-4c61-a277-f7318a61d93d(ozonesecure_datanode_3.ozonesecure_default/172.19.0.11)}, targets={1=ea3e9265-f4d3-4424-b4b2-d14a5cf5c9a7(ozonesecure_datanode_5.ozonesecure_default/172.19.0.13)} after 1090 ms
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: ContainerID 5 does not exist
	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:653)
	...
	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.listBlock(ContainerProtocolCalls.java:134)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.listBlock(ECContainerOperationClient.java:96)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.getBlockDataMap(ECReconstructionCoordinator.java:505)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:153)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)

errose28 · 2024-07-16T00:49:06Z

Hi @slfan1989, I agree more diagnostic info would be helpful. As Attila said, the datanode audit log is for requests it receives from clients. That is why it has user and ip fields which are null in your example. It currently does not have logging for intra-datanode operations. Ozone already has a feature to "audit" changes to containers which was added in HDDS-8062 and might be a better fit for this type of thing. The file dn-container.log currently only logs a container as recovered on success, but adding logging for failure scenarios would be useful too. There is logging for failed container scans there as well.

Also, I see the example success log is at the block level, but EC reconstruction happens at the container level. Having one success or failure log for every block would probably be too verbose. If we can just log once per container on success or failure that would be more concise.

slfan1989 · 2024-07-16T03:08:37Z

@adoroszlai @errose28 Thank you very much for your response! it has been very helpful and insightful. I understand that adding this operation to the audit log may not be appropriate. The primary reason for logging this is to address a practical issue: some EC blocks currently fail to recover successfully due to variations in their BlockGroupLength. This results in repeated repair attempts on different DataNodes in the cluster. My objective with the audit log is to identify the blocks that failed reconstruction. Subsequently, by using containerId and locationId, I aim to locate the corresponding files. Given the potentially limited number of affected files, my current approach involves retrieving these files using "get" operations and then restoring them using "put" operations.

Here are the detailed steps of the operation:

Parse OM's RockDB data, primarily through parsing files provided by Recon, to obtain mappings of files, Container IDs, and Location IDs.
Synchronize the full audit logs of DataNodes in the cluster to HDFS and cleanse out the damaged EC blocks' Container IDs and Location IDs using a Spark job.
Use the mappings from steps 1 and 2 to locate the corresponding files. Then, use a Shell script to fetch these files locally and re-upload them to the cluster.

I understand this process is quite cumbersome, but we frequently encounter issues with inconsistent BlockGroupLength during reconstruction, which can lead to errors such as the following:

The chunk list has 26 entries, but the checksum chunks has 27 entries. They should be equal in size.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) 
at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147)

This operation provides the benefit of repairing both damaged EC blocks and blocks with insufficient BlockGroupLength. Both of these issues can be addressed through this process.

I am currently working on improving the repair code in ECReconstructionCoordinator to trigger repairs for EC blocks where only one BlockGroupLength is different.

slfan1989 added 2 commits July 13, 2024 15:25

HDDS-11171. [DN] Add EC Block Recover Audit Log.

77ad31b

HDDS-11171. [DN] Add EC Block Recover Audit Log.

985ab8f

adoroszlai requested a review from errose28 July 15, 2024 05:47

slfan1989 mentioned this pull request Aug 8, 2024

HDDS-10985. EC Reconstruction failed because the size of currentChunks was not equal to checksumBlockDataChunks. #7009

Merged

slfan1989 closed this Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-11171. [DN] Add EC Block Recover Audit Log. #6936

HDDS-11171. [DN] Add EC Block Recover Audit Log. #6936

Uh oh!

slfan1989 commented Jul 13, 2024 •

edited

Loading

Uh oh!

slfan1989 commented Jul 15, 2024

Uh oh!

adoroszlai commented Jul 15, 2024

Uh oh!

errose28 commented Jul 16, 2024

Uh oh!

slfan1989 commented Jul 16, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-11171. [DN] Add EC Block Recover Audit Log. #6936

HDDS-11171. [DN] Add EC Block Recover Audit Log. #6936

Uh oh!

Conversation

slfan1989 commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

slfan1989 commented Jul 15, 2024

Uh oh!

adoroszlai commented Jul 15, 2024

Uh oh!

errose28 commented Jul 16, 2024

Uh oh!

slfan1989 commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slfan1989 commented Jul 13, 2024 •

edited

Loading

slfan1989 commented Jul 16, 2024 •

edited

Loading