Skip to content

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

When calculating a checksum for an EC file with Rack Topology enabled, you can get the following error intermittently:

ERROR : Failed with exception null
  java.lang.IndexOutOfBoundsException
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:375)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.computeCompositeCrc(ECBlockChecksumComputer.java:163)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.compute(ECBlockChecksumComputer.java:65)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.getBlockChecksumFromChunkChecksums(ECFileChecksumHelper.java:148)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlock(ECFileChecksumHelper.java:106)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlocks(ECFileChecksumHelper.java:73)
        at org.apache.hadoop.ozone.client.checksum.BaseFileChecksumHelper.compute(BaseFileChecksumHelper.java:220)
        at org.apache.hadoop.fs.ozone.OzoneClientUtils.getFileChecksumWithCombineMode(OzoneClientUtils.java:223)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.getFileChecksum(BasicRootedOzoneClientAdapterImpl.java:1123)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.getFileChecksum(BasicRootedOzoneFileSystem.java:955)
        at org.apache.hadoop.fs.FileSystem.getFileChecksum(FileSystem.java:2831)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertNonDirectoryInformation(Hive.java:3659)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertFileInformation(Hive.java:3632)
...
ERROR : FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.exec.MoveTask. java.lang.IndexOutOfBoundsException
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:375)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.computeCompositeCrc(ECBlockChecksumComputer.java:163)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.compute(ECBlockChecksumComputer.java:65)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.getBlockChecksumFromChunkChecksums(ECFileChecksumHelper.java:148)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlock(ECFileChecksumHelper.java:106)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlocks(ECFileChecksumHelper.java:73)
        at org.apache.hadoop.ozone.client.checksum.BaseFileChecksumHelper.compute(BaseFileChecksumHelper.java:220)
        at org.apache.hadoop.fs.ozone.OzoneClientUtils.getFileChecksumWithCombineMode(OzoneClientUtils.java:223)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.getFileChecksum(BasicRootedOzoneClientAdapterImpl.java:1123)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.getFileChecksum(BasicRootedOzoneFileSystem.java:955)
        at org.apache.hadoop.fs.FileSystem.getFileChecksum(FileSystem.java:2831)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertNonDirectoryInformation(Hive.java:3659)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertFileInformation(Hive.java:3632)
...
INFO  : Completed executing command(queryId=hive_20221214035652_bc45477d-98df-408e-b945-a63b4ac6896a); Time taken: 22.167 seconds
  INFO  : OK
  Error: Error while compiling statement: FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.exec.MoveTask. java.lang.IndexOutOfBoundsException
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:375)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.computeCompositeCrc(ECBlockChecksumComputer.java:163)
        at org.apache.hadoop.ozone.client.checksum.ECBlockChecksumComputer.compute(ECBlockChecksumComputer.java:65)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.getBlockChecksumFromChunkChecksums(ECFileChecksumHelper.java:148)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlock(ECFileChecksumHelper.java:106)
        at org.apache.hadoop.ozone.client.checksum.ECFileChecksumHelper.checksumBlocks(ECFileChecksumHelper.java:73)
        at org.apache.hadoop.ozone.client.checksum.BaseFileChecksumHelper.compute(BaseFileChecksumHelper.java:220)
        at org.apache.hadoop.fs.ozone.OzoneClientUtils.getFileChecksumWithCombineMode(OzoneClientUtils.java:223)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneClientAdapterImpl.getFileChecksum(BasicRootedOzoneClientAdapterImpl.java:1123)
        at org.apache.hadoop.fs.ozone.BasicRootedOzoneFileSystem.getFileChecksum(BasicRootedOzoneFileSystem.java:955)
        at org.apache.hadoop.fs.FileSystem.getFileChecksum(FileSystem.java:2831)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertNonDirectoryInformation(Hive.java:3659)
        at org.apache.hadoop.hive.ql.metadata.Hive.addInsertFileInformation(Hive.java:3632)
        ...

This is because the wrong nodes are used to obtain the stripe checksum sometimes as the node does not correctly use the replicaIndex in the pipeline to order the nodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7787

How was this patch tested?

An existing test covers the checksum validate, so it confirms this change has not broken anything. The actual problem is difficult to reproduce in a unit test as the rack awareness is not easy to setup in such a way to affect the node order in the pipeline. We do have a reproducible test with a Hive workload that causes this, so we can validate the fix that way after this has been committed.

@sodonnel sodonnel merged commit c61ee3e into apache:master Jan 17, 2023
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 6, 2023
…xOutOfBounds exception (apache#4180)

(cherry picked from commit c61ee3e)
Change-Id: I0e7964c9ce10aaa29ca3b85c1407bd289ea2c467
errose28 added a commit to errose28/ozone that referenced this pull request Apr 20, 2023
* master: (209 commits)
  HDDS-7097. Container scanner log output lacks useful information (apache#4169)
  HDDS-7813. Handle Mismatched Replicas (OPEN or CLOSING) of QUASI-CLOSED containers in RM (apache#4195)
  HDDS-7625. Do not compress OM/SCM checkpoints (apache#4130)
  HDDS-7801. Bucket not found when calling getKeyInfo with tenant context (apache#4189)
  HDDS-7807. TarContainerPacker closes streams multiple times (apache#4193)
  HDDS-7755. Ensure that acquired locks are always released. (apache#4191)
  HDDS-7804. UNHEALTHY replicas will not contribute to sufficient replication in RatisContainerReplicaCount (apache#4192)
  HDDS-7748. Rename OMFileRequest.addToOpenFileTable() to avoid misuse. (apache#4176)
  HDDS-7723. Refresh Keys and Certificate used in OzoneSecretManager after certificate renewed (apache#4179)
  HDDS-7788. Ratis OverReplicationHandler should exclude stale replicas (apache#4183)
  HDDS-7718. Bump Netty to 4.1.86 and gRPC to 1.51.1 (apache#4139)
  HDDS-7542. Refactor DefaultReplicationConfig (apache#4005)
  HDDS-7787. GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception (apache#4180)
  HDDS-7754. Download of container is failing with SSL/TLS error during re-replication (apache#4174)
  HDDS-7455. ClassCastException: OzoneTokenIdentifier cannot be cast to String (apache#4159)
  HDDS-7441. Rename function names of retrieving metadata keys (apache#3918)
  HDDS-7722. FSO buckets fail to invalidate open file table cache when committing a key (apache#4156)
  HDDS-7774. Update outdated Trash documentation (apache#4172)
  HDDS-7761. EC: ReplicationManager - Use placementPolicy.replicasToRemoveToFixOverreplication in EC Over replication handler (apache#4166)
  HDDS-7775. EC: Exception encountered while deleting UNHEALTHY replica in Datanode (apache#4173)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants