Skip to content

Conversation

@xichen01
Copy link
Contributor

What changes were proposed in this pull request?

Some performance issues are mentioned here. https://issues.apache.org/jira/browse/HDDS-8865 so in this PR:
We Make DN DeleteBlocksCommandHandler wait for the lock to timeout. This way, no other transaction will be blocked because the lock cannot be acquired, and for locks that time out, we can retry once on the DN if can not acquired the lock in specified time, transaction will failed and the SCM can resent it after.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8869

How was this patch tested?

unit test

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the patch. I haven't checked in detail, but would like to point out some minor issues noticed.

Comment on lines 74 to 75
@RunWith(Parameterized.class)
@Timeout(300)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: @Timeout is from JUnit5, all other test annotations are from JUnit4. I don't think timeout is enforced with this setup.

Comment on lines 228 to 230
conf, dnConf.getBlockDeleteThreads(),
dnConf.getBlockDeleteQueueLimit()))
dnConf.getBlockDeleteQueueLimit(),
dnConf.getBlockDeleteCommandHandleLockTimeoutMs()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: pass DatanodeConfiguration to DeleteBlocksCommandHandler instead of individual values.

Comment on lines 55 to 56
@Metric(about = "The total number of transactions witch was failed " +
"cause by wait Container lock timeout.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: witch -> which

xichen01 added 2 commits June 20, 2023 00:42
# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/helpers/BlockDeletingServiceMetrics.java
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java
@xichen01
Copy link
Contributor Author

@adoroszlai PTAL. Thanks.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for updating the patch.

Note: there is a findbugs failure:

M D UrF: Unread public/protected field: org.apache.hadoop.ozone.container.common.statemachine.commandhandler.TestDeleteBlocksCommandHandler.testTimeout  At TestDeleteBlocksCommandHandler.java:[line 80]

https://github.com/xichen01/ozone/actions/runs/5358921181/jobs/9721855321#step:6:2324

Once you get a clean run, please ask @sumitagrawl for review.

Comment on lines 97 to 104
@Parameterized.Parameters(name = "{index}: Schema Version {0}")
public static Collection<Object[]> data() {
return Arrays.asList(new Object[][]{
{SCHEMA_V1},
{SCHEMA_V2},
{SCHEMA_V3},
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ContainerTestVersionInfo to parameterize by schema version and layout version.

@xichen01
Copy link
Contributor Author

xichen01 commented Jul 6, 2023

@sumitagrawl PTAL. Thanks

@xichen01
Copy link
Contributor Author

@adoroszlai PTAL, Thanks

* after waiting timeout, the hold time spent waiting for the lock
* is not greater than the time spent operating RocksDB
*/
@Config(key = "block.delete.command.handle.lock.timeout",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest change the property name from block.delete.command.handle.lock.timeout ->
block.delete.max.lock.wait.timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 , is the slow RocksDB open a major cause behind this patch? I have one question, it takes 200ms to open a rocksdb, you set this timeout to 100ms, and retry once in the block deletion logic, why not set the default value to 200ms, with a simpler handler logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this property be a reconfigurable property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may have caused some misunderstanding, I'll modify it.

The retry does not happen immediately, it waits until all DeleteBlockTransactions have been processed, so the retry may happen after a few seconds or tens of seconds.

Slow RocksDB open (Schema V2 only), rebalancer, BlockDeletingService, etc, may cause here cannot get the lock, retry to increase the chance of success.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 , thanks for the further info.

@Metric(about = "The total number of Container chosen to be deleted.")
private MutableGaugeLong totalContainerChosenCount;

@Metric(about = "The total number of transactions which was failed " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which was failed cause by wait Container lock timeout -> which failed due to container lock wait timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"containerType {}", containerType);
}
} catch (IOException e) {
} catch (IOException | InterruptedException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call Thread.currentThread().interrupt() if it's an InterruptedException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int newDeletionBlocks, long txnID)
KeyValueContainerData containerData, DeletedBlocksTransaction delTX)
throws IOException {
int newDeletionBlocks = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do some refactor on markBlocksForDeletionTransaction too? This newDeletionBlocks is always 0. It's better to be declared in markBlocksForDeletionTransaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already modified.

@ChenSammi
Copy link
Contributor

ChenSammi commented Sep 22, 2023

Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the contribution, @adoroszlai for the review.

@ChenSammi ChenSammi merged commit 5b5d352 into apache:master Sep 22, 2023
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants