HDDS-7787. GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception #4180

sodonnel · 2023-01-16T13:13:46Z

What changes were proposed in this pull request?

When calculating a checksum for an EC file with Rack Topology enabled, you can get the following error intermittently:

ERROR : Failed with exception null
  java.lang.IndexOutOfBoundsException
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:375)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.computeCompositeCrc(ECBlockChecksumComputer.java:163)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.compute(ECBlockChecksumComputer.java:65)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.getBlockChecksumFromChunkChecksums(ECFileChecksumHelper.java:148)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlock(ECFileChecksumHelper.java:106)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlocks(ECFileChecksumHelper.java:73)
        at org.apache.hadoop.ozone.client.checksum.BaseFileChecksumHelper.compute(BaseFileChecksumHelper.java:220)
        at org.apache.hadoop.fs.ozone.OzoneClientUtils.getFileChecksumWithCombineMode(OzoneClientUtils.java:223)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.getFileChecksum(BasicRootedOzoneClientAdapterImpl.java:1123)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.getFileChecksum(BasicRootedOzoneFileSystem.java:955)
        at org.apache.hadoop.fs.FileSystem.getFileChecksum(FileSystem.java:2831)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertNonDirectoryInformation(Hive.java:3659)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertFileInformation(Hive.java:3632)
...
ERROR : FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.exec.MoveTask. java.lang.IndexOutOfBoundsException
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:375)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.computeCompositeCrc(ECBlockChecksumComputer.java:163)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.compute(ECBlockChecksumComputer.java:65)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.getBlockChecksumFromChunkChecksums(ECFileChecksumHelper.java:148)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlock(ECFileChecksumHelper.java:106)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlocks(ECFileChecksumHelper.java:73)
        at org.apache.hadoop.ozone.client.checksum.BaseFileChecksumHelper.compute(BaseFileChecksumHelper.java:220)
        at org.apache.hadoop.fs.ozone.OzoneClientUtils.getFileChecksumWithCombineMode(OzoneClientUtils.java:223)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.getFileChecksum(BasicRootedOzoneClientAdapterImpl.java:1123)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.getFileChecksum(BasicRootedOzoneFileSystem.java:955)
        at org.apache.hadoop.fs.FileSystem.getFileChecksum(FileSystem.java:2831)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertNonDirectoryInformation(Hive.java:3659)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertFileInformation(Hive.java:3632)
...
INFO  : Completed executing command(queryId=hive_20221214035652_bc45477d-98df-408e-b945-a63b4ac6896a); Time taken: 22.167 seconds
  INFO  : OK
  Error: Error while compiling statement: FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.exec.MoveTask. java.lang.IndexOutOfBoundsException
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:375)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.computeCompositeCrc(ECBlockChecksumComputer.java:163)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.compute(ECBlockChecksumComputer.java:65)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.getBlockChecksumFromChunkChecksums(ECFileChecksumHelper.java:148)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlock(ECFileChecksumHelper.java:106)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlocks(ECFileChecksumHelper.java:73)
        at org.apache.hadoop.ozone.client.checksum.BaseFileChecksumHelper.compute(BaseFileChecksumHelper.java:220)
        at org.apache.hadoop.fs.ozone.OzoneClientUtils.getFileChecksumWithCombineMode(OzoneClientUtils.java:223)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.getFileChecksum(BasicRootedOzoneClientAdapterImpl.java:1123)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.getFileChecksum(BasicRootedOzoneFileSystem.java:955)
        at org.apache.hadoop.fs.FileSystem.getFileChecksum(FileSystem.java:2831)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertNonDirectoryInformation(Hive.java:3659)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertFileInformation(Hive.java:3632)
        ...

This is because the wrong nodes are used to obtain the stripe checksum sometimes as the node does not correctly use the replicaIndex in the pipeline to order the nodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7787

How was this patch tested?

An existing test covers the checksum validate, so it confirms this change has not broken anything. The actual problem is difficult to reproduce in a unit test as the rack awareness is not easy to setup in such a way to affect the node order in the pipeline. We do have a reproducible test with a Hive workload that causes this, so we can validate the fix that way after this has been committed.

…xOutOfBounds exception

…xOutOfBounds exception (apache#4180) (cherry picked from commit c61ee3e) Change-Id: I0e7964c9ce10aaa29ca3b85c1407bd289ea2c467

* master: (209 commits) HDDS-7097. Container scanner log output lacks useful information (apache#4169) HDDS-7813. Handle Mismatched Replicas (OPEN or CLOSING) of QUASI-CLOSED containers in RM (apache#4195) HDDS-7625. Do not compress OM/SCM checkpoints (apache#4130) HDDS-7801. Bucket not found when calling getKeyInfo with tenant context (apache#4189) HDDS-7807. TarContainerPacker closes streams multiple times (apache#4193) HDDS-7755. Ensure that acquired locks are always released. (apache#4191) HDDS-7804. UNHEALTHY replicas will not contribute to sufficient replication in RatisContainerReplicaCount (apache#4192) HDDS-7748. Rename OMFileRequest.addToOpenFileTable() to avoid misuse. (apache#4176) HDDS-7723. Refresh Keys and Certificate used in OzoneSecretManager after certificate renewed (apache#4179) HDDS-7788. Ratis OverReplicationHandler should exclude stale replicas (apache#4183) HDDS-7718. Bump Netty to 4.1.86 and gRPC to 1.51.1 (apache#4139) HDDS-7542. Refactor DefaultReplicationConfig (apache#4005) HDDS-7787. GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception (apache#4180) HDDS-7754. Download of container is failing with SSL/TLS error during re-replication (apache#4174) HDDS-7455. ClassCastException: OzoneTokenIdentifier cannot be cast to String (apache#4159) HDDS-7441. Rename function names of retrieving metadata keys (apache#3918) HDDS-7722. FSO buckets fail to invalidate open file table cache when committing a key (apache#4156) HDDS-7774. Update outdated Trash documentation (apache#4172) HDDS-7761. EC: ReplicationManager - Use placementPolicy.replicasToRemoveToFixOverreplication in EC Over replication handler (apache#4166) HDDS-7775. EC: Exception encountered while deleting UNHEALTHY replica in Datanode (apache#4173) ...

HDDS-7787. GetChecksum for EC files can fail intermittently with Inde…

4d77d1a

…xOutOfBounds exception

adoroszlai approved these changes Jan 17, 2023

View reviewed changes

sodonnel merged commit c61ee3e into apache:master Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7787. GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception #4180

HDDS-7787. GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception #4180

Uh oh!

sodonnel commented Jan 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-7787. GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception #4180

HDDS-7787. GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception #4180

Uh oh!

Conversation

sodonnel commented Jan 16, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants