Skip to content

Conversation

@sarvekshayr
Copy link
Contributor

@sarvekshayr sarvekshayr commented Jun 12, 2025

What changes were proposed in this pull request?

Added acceptance tests for the ozone debug replicas verify tool.

Checksum verification:

  • Simulated a checksum mismatch by corrupting the data file on one replica.
  • Tested datanode unavailability by marking a datanode as stale.

Block existence check:

  • Modified the container.db to simulate a missing block scenario.

Also renamed the robot test files to better reflect the test cases.

What is the link to the Apache JIRA

HDDS-12890

How was this patch tested?

CI: https://github.com/sarvekshayr/ozone/actions/runs/15672437320

Change-Id: I4e45e522a69c8ead36fa53df503f5c149a5fe555
Change-Id: I1f44cae5a7c6a093c5f568a96fde4d1ba743e33f
Change-Id: I3b5c93d325ca1fcbf81eaab2c49ecbcbfbb5662f
@dombizita
Copy link
Contributor

dombizita commented Jun 15, 2025

@adoroszlai could you please help us? Locally this test is passing, but on the CI it is failing, the datanode is not able to come up after a stop. I found this in the DN logs:

2025-06-12 12:29:49,098 [ForkJoinPool.commonPool-worker-1] ERROR ozoneimpl.OzoneContainer: Load db store for HddsVolume /data/hdds/hdds failed
java.io.IOException: Can't init db instance under path /data/hdds/hdds/CID-30e32454-4a0d-413e-9242-916f8be902f5/DS-336374ce-4a59-4b63-b83a-5b18029927b0/container.db for volume DS-336374ce-4a59-4b63-b83a-5b18029927b0
	at org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:446)
	at org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:110)
	at org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:96)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.io.IOException: Failed to create RDBStore from /data/hdds/hdds/CID-30e32454-4a0d-413e-9242-916f8be902f5/DS-336374ce-4a59-4b63-b83a-5b18029927b0/container.db
	at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:178)
	at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:226)
	at org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.initDBStore(AbstractDatanodeStore.java:96)
	at org.apache.hadoop.ozone.container.metadata.AbstractRDBStore.start(AbstractRDBStore.java:75)
	at org.apache.hadoop.ozone.container.metadata.AbstractRDBStore.<init>(AbstractRDBStore.java:56)
	at org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(AbstractDatanodeStore.java:72)
	at org.apache.hadoop.ozone.container.metadata.DatanodeStoreWithIncrementalChunkList.<init>(DatanodeStoreWithIncrementalChunkList.java:53)
	at org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaThreeImpl.<init>(DatanodeStoreSchemaThreeImpl.java:73)
	at org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(BlockUtils.java:83)
	at org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.initPerDiskDBStore(HddsVolumeUtil.java:73)
	at org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:442)
	... 9 more
Caused by: org.apache.hadoop.hdds.utils.db.RocksDatabaseException: IOError: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /data/hdds/hdds/CID-30e32454-4a0d-413e-9242-916f8be902f5/DS-336374ce-4a59-4b63-b83a-5b18029927b0/container.db
	at org.apache.hadoop.hdds.utils.db.RocksDatabase.toRocksDatabaseException(RocksDatabase.java:111)
	at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:180)
	at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:110)
	... 19 more
Caused by: org.rocksdb.RocksDBException: While open a file for appending: /data/hdds/hdds/CID-30e32454-4a0d-413e-9242-916f8be902f5/DS-336374ce-4a59-4b63-b83a-5b18029927b0/container.db/LOG: Permission denied
	at org.rocksdb.RocksDB.open(Native Method)
	at org.rocksdb.RocksDB.open(RocksDB.java:307)
	at org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:83)
	at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:174)
	... 20 more

After this I found this in the acceptance logs:

Using Docker Compose v2
Executing test ozonesecure-ha/test-debug-tools.sh
chown: changing ownership of '/home/runner/work/ozone/ozone/hadoop-ozone/dist/target/ozone-2.1.0-SNAPSHOT/compose/ozonesecure-ha/data/om2': Operation not permitted
chown: changing ownership of '/home/runner/work/ozone/ozone/hadoop-ozone/dist/target/ozone-2.1.0-SNAPSHOT/compose/ozonesecure-ha/data/dn3': Operation not permitted
chown: changing ownership of '/home/runner/work/ozone/ozone/hadoop-ozone/dist/target/ozone-2.1.0-SNAPSHOT/compose/ozonesecure-ha/data/scm2': Operation not permitted
chown: changing ownership of '/home/runner/work/ozone/ozone/hadoop-ozone/dist/target/ozone-2.1.0-SNAPSHOT/compose/ozonesecure-ha/data/om3': Operation not permitted
chown: changing ownership of '/home/runner/work/ozone/ozone/hadoop-ozone/dist/target/ozone-2.1.0-SNAPSHOT/compose/ozonesecure-ha/data/kms': Operation not permitted
chown: changing ownership of '/home/runner/work/ozone/ozone/hadoop-ozone/dist/target/ozone-2.1.0-SNAPSHOT/compose/ozonesecure-ha/data/dn1': Operation not permitted
...

This leads us to a problem with this part of the debug tools testing

# Clean up saved internal state from each container's volume for the next run.
rm -rf "${OZONE_VOLUME}"
mkdir -p "${OZONE_VOLUME}"/{dn1,dn2,dn3,dn4,dn5,om1,om2,om3,scm1,scm2,scm3,recon,s3g,kms}
if [[ -n "${OZONE_VOLUME_OWNER}" ]]; then
current_user=$(whoami)
if [[ "${OZONE_VOLUME_OWNER}" != "${current_user}" ]]; then
chown -R "${OZONE_VOLUME_OWNER}" "${OZONE_VOLUME}" \
|| sudo chown -R "${OZONE_VOLUME_OWNER}" "${OZONE_VOLUME}"
fi
fi

This way is used at other acceptance tests as well, do you see any issue with the debug tools testing? Thanks in advance!

@adoroszlai
Copy link
Contributor

chown: changing ownership of '/home/runner/work/ozone/ozone/hadoop-ozone/dist/target/ozone-2.1.0-SNAPSHOT/compose/ozonesecure-ha/data/dn1': Operation not permitted

This is fine, chown is retried with sudo, then it succeeds.

Caused by: org.rocksdb.RocksDBException: While open a file for appending: /data/hdds/hdds/CID-30e32454-4a0d-413e-9242-916f8be902f5/DS-336374ce-4a59-4b63-b83a-5b18029927b0/container.db/LOG: Permission denied

This happens because in GitHub's environment, the user ID is different between the host and the container. Overwriting the container changes its owner.

docker cp "${local_db_backup_path}/container.db" "${container}:${target_container_dir}/"

Before:

    88 -rw-r--r--   1 hadoop   hadoop      87535 Jun 16 04:04 /data/hdds/hdds/CID-aa9f8325-4197-4d3e-a929-ed6554243fad/DS-a49f9006-5e63-4de4-969f-db0b04c24c27/container.db/LOG

After:

    88 -rw-r--r--   1 om       118         87535 Jun 16 04:04 /data/hdds/hdds/CID-aa9f8325-4197-4d3e-a929-ed6554243fad/DS-a49f9006-5e63-4de4-969f-db0b04c24c27/container.db/LOG

https://github.com/adoroszlai/ozone/actions/runs/15671434462/job/44143341797#step:13:212

@adoroszlai
Copy link
Contributor

@sarvekshayr Sorry, I accidentally pushed my second batch of commits to your fork. Please feel free to restore 72d8c44 by force-push.

@sarvekshayr sarvekshayr marked this pull request as ready for review June 16, 2025 04:45
@sarvekshayr
Copy link
Contributor Author

@sarvekshayr Sorry, I accidentally pushed my second batch of commits to your fork. Please feel free to restore 72d8c44 by force-push.

Thank you @adoroszlai for fixing the issue.
I was just about to revert the additional commits myself, thanks for taking care of it!

@dombizita dombizita added the tools Tools that helps with debugging label Jun 16, 2025
Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sarvekshayr, please find my comments below

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. LGTM

Copy link
Contributor

@dombizita dombizita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @sarvekshayr! These are very useful changes, thanks for keeping in my mind all the needed changes!

@sarvekshayr
Copy link
Contributor Author

sarvekshayr commented Jun 26, 2025

The integration check was skipped. Merged the latest master to run the skipped checks.
Since the PR does not change any files that could possibly affect integration tests, the check is skipped.

@Tejaskriya Tejaskriya merged commit bb3d287 into apache:master Jun 27, 2025
27 checks passed
@Tejaskriya
Copy link
Contributor

Thanks for the patch @sarvekshayr , thanks for the reviews and assists @dombizita @adoroszlai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tools Tools that helps with debugging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants