Skip to content

Conversation

@slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Jul 13, 2024

What changes were proposed in this pull request?

In our internal use of Ozone, we heavily utilize EC (Erasure Coding) functionality. When a DN (DataNode) disk fails, it leads to the loss of some EC replica data, which will be reconstructed on other DNs (DataNodes). This reconstruction process may either succeed or fail. To swiftly grasp the outcome of EC block reconstruction, I intend to implement an auditing feature dedicated to EC reconstruction logs. This is crucial, especially in instances of failure, to promptly pinpoint the reasons for reconstruction failures.

Success log:

2024-07-13 12:06:25,371 | INFO  | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: 113750155051714398 bcsId: 0}, length=4766503, offset=0, token=null, pipeline=Pipeline[ Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=SUCCESS |

Failure log:

2024-07-13 12:06:25,751 | ERROR | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK {blockId={blockID={conID: 964637 locID: 113750155051715549 bcsId: 0}, length=163577856, offset=0, token=null, pipeline=Pipeline[Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | java.lang.IllegalArgumentException: The chunk list has 26 entries, but the checksum chunks has 27 entries. They should be equal in size.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147)

What is the link to the Apache JIRA

JIRA: HDDS-11171. [DN] Add EC Block Recover Audit Log.

How was this patch tested?

Validate in production environment.

@slfan1989
Copy link
Contributor Author

@sodonnel Can you help review this pr? Thank you very much!

@adoroszlai
Copy link
Contributor

Thanks @slfan1989 for working on this.

reconstruction process may either succeed or fail. To swiftly grasp the outcome of EC block reconstruction

The coordinator datanode already logs status/result of EC reconstruction, search for reconstructECContainersCommand (see examples below).

implement an auditing feature dedicated to EC reconstruction logs

Datanode currently only writes audit log for container state machine commands in HddsDispatcher. Auditing commands from SCM might be useful, but:

  • I don't think it should be specific to EC recovery. It would be better added in a generic way in CommandDispatcher.
  • Need to consider both sync and async command handlers.

Start:

2024-07-13 14:27:56,098 [f3ffd8a7-4b5c-47f5-98af-404eba82b826-ContainerReplicationThread-0] INFO reconstruction.ECReconstructionCoordinatorTask: IN_PROGRESS reconstructECContainersCommand: containerID=1, replication=rs-3-2-1024k, missingIndexes=, sources={1=e237cf49-df14-416e-ab15-8469b59d04c8(ozone-datanode-1.ozone_default/172.18.0.9), 3=8ee9d1ce-75a9-493f-905d-b1981f13aa00(ozone-datanode-3.ozone_default/172.18.0.10), 5=033eca1e-07e5-4893-b5a1-68dcccbe8c9c(ozone-datanode-2.ozone_default/172.18.0.8)}, targets={2=f3ffd8a7-4b5c-47f5-98af-404eba82b826(ozone-datanode-4.ozone_default/172.18.0.6), 4=2c4209f5-c826-4ed2-9658-b23961f074d6(ozone-datanode-5.ozone_default/172.18.0.11)}

Success:

2024-07-13 14:28:04,066 [f3ffd8a7-4b5c-47f5-98af-404eba82b826-ContainerReplicationThread-0] INFO reconstruction.ECReconstructionCoordinatorTask: DONE reconstructECContainersCommand: containerID=1, replication=rs-3-2-1024k, missingIndexes=, sources={1=e237cf49-df14-416e-ab15-8469b59d04c8(ozone-datanode-1.ozone_default/172.18.0.9), 3=8ee9d1ce-75a9-493f-905d-b1981f13aa00(ozone-datanode-3.ozone_default/172.18.0.10), 5=033eca1e-07e5-4893-b5a1-68dcccbe8c9c(ozone-datanode-2.ozone_default/172.18.0.8)}, targets={2=f3ffd8a7-4b5c-47f5-98af-404eba82b826(ozone-datanode-4.ozone_default/172.18.0.6), 4=2c4209f5-c826-4ed2-9658-b23961f074d6(ozone-datanode-5.ozone_default/172.18.0.11)} in 7968 ms

Failure:

2024-03-24 12:17:19,479 [nullContainerReplicationThread-1] WARN reconstruction.ECReconstructionCoordinatorTask: FAILED reconstructECContainersCommand: containerID=5, replication=rs-3-2-1024k, missingIndexes=[1], sources={2=165ad6c9-23bb-4b60-a209-0f37fb43c778(ozonesecure_datanode_1.ozonesecure_default/172.19.0.8), 3=33dddf06-b56a-4475-8523-ad54a589c1b3(ozonesecure_datanode_2.ozonesecure_default/172.19.0.10), 4=3575b633-fb5a-4c41-9ece-c2f97994544f(ozonesecure_datanode_4.ozonesecure_default/172.19.0.13), 5=b56dc487-1437-4c61-a277-f7318a61d93d(ozonesecure_datanode_3.ozonesecure_default/172.19.0.11)}, targets={1=ea3e9265-f4d3-4424-b4b2-d14a5cf5c9a7(ozonesecure_datanode_5.ozonesecure_default/172.19.0.13)} after 1090 ms
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: ContainerID 5 does not exist
	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:653)
	...
	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.listBlock(ContainerProtocolCalls.java:134)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.listBlock(ECContainerOperationClient.java:96)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.getBlockDataMap(ECReconstructionCoordinator.java:505)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:153)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)

@adoroszlai adoroszlai requested a review from errose28 July 15, 2024 05:47
@errose28
Copy link
Contributor

Hi @slfan1989, I agree more diagnostic info would be helpful. As Attila said, the datanode audit log is for requests it receives from clients. That is why it has user and ip fields which are null in your example. It currently does not have logging for intra-datanode operations. Ozone already has a feature to "audit" changes to containers which was added in HDDS-8062 and might be a better fit for this type of thing. The file dn-container.log currently only logs a container as recovered on success, but adding logging for failure scenarios would be useful too. There is logging for failed container scans there as well.

Also, I see the example success log is at the block level, but EC reconstruction happens at the container level. Having one success or failure log for every block would probably be too verbose. If we can just log once per container on success or failure that would be more concise.

@slfan1989
Copy link
Contributor Author

slfan1989 commented Jul 16, 2024

@adoroszlai @errose28 Thank you very much for your response! it has been very helpful and insightful. I understand that adding this operation to the audit log may not be appropriate. The primary reason for logging this is to address a practical issue: some EC blocks currently fail to recover successfully due to variations in their BlockGroupLength. This results in repeated repair attempts on different DataNodes in the cluster. My objective with the audit log is to identify the blocks that failed reconstruction. Subsequently, by using containerId and locationId, I aim to locate the corresponding files. Given the potentially limited number of affected files, my current approach involves retrieving these files using "get" operations and then restoring them using "put" operations.

Here are the detailed steps of the operation:

  1. Parse OM's RockDB data, primarily through parsing files provided by Recon, to obtain mappings of files, Container IDs, and Location IDs.
  2. Synchronize the full audit logs of DataNodes in the cluster to HDFS and cleanse out the damaged EC blocks' Container IDs and Location IDs using a Spark job.
  3. Use the mappings from steps 1 and 2 to locate the corresponding files. Then, use a Shell script to fetch these files locally and re-upload them to the cluster.

I understand this process is quite cumbersome, but we frequently encounter issues with inconsistent BlockGroupLength during reconstruction, which can lead to errors such as the following:

The chunk list has 26 entries, but the checksum chunks has 27 entries. They should be equal in size.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) 
at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147)

This operation provides the benefit of repairing both damaged EC blocks and blocks with insufficient BlockGroupLength. Both of these issues can be addressed through this process.

I am currently working on improving the repair code in ECReconstructionCoordinator to trigger repairs for EC blocks where only one BlockGroupLength is different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants