Skip to content

Conversation

@xichen01
Copy link
Contributor

@xichen01 xichen01 commented Jul 3, 2023

What changes were proposed in this pull request?

Currently when we manually transfer SCM leadership, the old SCM maybe allocates some IDs duplicate with the new SCM leadership. This will cause some serious issue, such as if the SCM allocates the same ID for Block for CreateKey request and those Block is allocated to the same Container, Block will overwrite each other Chunk file, the data will be lost.

Reproduce

  • A simple way to reproduce this issue
    Generate a consistently faster write load and switch the SCM with the command, then you can observe log message on the DN
2023-07-03 16:17:31,278 [ChunkWriter-227-0] WARN org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: Duplicate write chunk request. Chunk overwrite without explicit request. ChunkInfo{chunkName='109611007626090607_chunk_1, offset=0, len=4096}
  • Generate a constant write load
ozone freon ommg --operation CREATE_KEY -n 100000 -t 20 --runtime 3000 --timebase --size=4096

Root Cause

The reason for this problem is that the batch.lastId is updated before the successful execution of stateManager.allocateBatch.

which causes the subsequent requests from other threads will get an illegitimate ID

if (batch.nextId <= batch.lastId) {
return batch.nextId++;
}

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8973

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

unit test

@ChenSammi
Copy link
Contributor

ChenSammi commented Jul 5, 2023

@xichen01 , good findings! The overall change looks good.

Could you take care of the conflicting files?

# Conflicts:
#	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SequenceIdGenerator.java
Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for report and fix the issue.

Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for report and verify this issue.

@ChenSammi ChenSammi merged commit 76267a3 into apache:master Jul 6, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jul 10, 2023
* master: (36 commits)
  HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039)
  Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)"
  HDDS-7750. Incorrect WRITE ACL check. (apache#4992)
  HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033)
  HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030)
  HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028)
  HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029)
  HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025)
  HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode
  HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961)
  HDDS-8979. error validating kustomization.yaml (apache#5024)
  HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018)
  HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015)
  HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021)
  HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943)
  HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825)
  HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022)
  HDDS-8962. Ensure docker env is stopped (apache#5011)
  HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007)
  HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants