HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID #4384

sumitagrawl · 2023-03-13T07:07:02Z

What changes were proposed in this pull request?

Remove the check for "delete transactionId" to be in order and skip any transaction coming out of order

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8139

How was this patch tested?

This is tested for impact using E2E setup and observing frequency of logs for out-of-order or retry of deleteTransactionId comming again.
Result: Its observed only once when had many blocks, but its not coming repeated always. So no impact is observed.

…on transaction ID

szetszwo

+1 the change looks good.

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

errose28 · 2023-03-14T04:41:41Z

With this implementation we are assuming duplicate work from retries will be minimal, hence no explicit retry check is needed. If SCM is retrying the command, it means it did not get an ack back from the datanode. Since these command acks are attached to the DN heartbeats, the node should go stale or dead before retries pile up excessively.

@sumitagrawl it currently looks like we are still sending commands to nodes that may be stale/dead/decommissioning nothing significant is done with the result of this check. Could you look in to this?

kerneltime · 2023-03-14T06:06:01Z

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

-            delTX.getTxID(), containerData.getDeleteTransactionId()));
-      }
-      b = false;
+      LOG.info(String.format("Delete blocks for containerId: %d"


Add a metric for this. It is a lot quicker to scan across a cluster.

@kerneltime This metrics will not have value, its just retry which will not happen frequently, and no impact to system.

You can add it as a part of a separate Jira, but a count of how many times this occurred is interesting.

sumitagrawl · 2023-03-14T07:17:33Z

With this implementation we are assuming duplicate work from retries will be minimal, hence no explicit retry check is needed. If SCM is retrying the command, it means it did not get an ack back from the datanode. Since these command acks are attached to the DN heartbeats, the node should go stale or dead before retries pile up excessively.

@sumitagrawl it currently looks like we are still sending commands to nodes that may be stale/dead/decommissioning nothing significant is done with the result of this check. Could you look in to this?

Created new jira HDDS-6548 for this, where need avoid caching up repeated transaction when DN is not responsive and need cleanup cache when DNs are stale/dead/decommissioning

…on transaction ID

errose28

Thanks for the updates @sumitagrawl. I'm following HDDS-8155 that you filed as well. LGTM and I agree a metric might be overkill since this is a niche case that we have handling for and not something we will be tracking over time, but let's wait for consensus from @kerneltime.

* master: (43 commits) HDDS-8148. Improve log for Pipeline creation failure (apache#4385) HDDS-7853. Add support for RemoveSCM in SCMRatisServer. (apache#4358) HDDS-8042. Display certificate issuer in cert list command. (apache#4429) HDDS-8189. [Snapshot] renamedKeyTable should only track keys in buckets that has at least one active snapshot. (apache#4436) HDDS-8154. Perf: Reuse Mac instances in S3 token validation (apache#4433) HDDS-8245. Info log for keyDeletingService when nonzero number of keys are deleted. (apache#4451) HDDS-8233. ReplicationManager: Throttle delete container commands from over-replication handlers (apache#4447) HDDS-8220. [Ozone-Streaming] Trigger volume check on IOException in StreamDataChannelBase (apache#4428) HDDS-8173. Fix to remove enrties from RocksDB after container gets deleted. (apache#4445) HDDS-7975. Rebalance acceptance tests (apache#4437) HDDS-8152. Reduce S3 acceptance test setup time (apache#4393) HDDS-8172. ECUnderReplicationHandler should consider commands already sent when processing the container (apache#4435) HDDS-7883. [Snapshot] Accommodate FSO, key renames and implement OMSnapshotPurgeRequest for SnapshotDeletingService (apache#4407) HDDS-8168. Make deadlines inside MoveManager for move commands configurable (apache#4415) HDDS-7918. EC: ECBlockReconstructedStripeInputStream should check for spare replicas before failing an index (apache#4441) HDDS-8222. EndpointBase#getBucket should handle BUCKET_NOT_FOUND (apache#4431) HDDS-8068. Fix Exception: JMXJsonServlet, getting attribute RatisRoles of Hadoop:service=OzoneManager. (apache#4352) HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID (apache#4384) HDDS-8216. EC: OzoneClientConfig is overwritten in ECKeyOutputStream (apache#4425) HDDS-8054. Fix NPE in metrics for failed volume (apache#4340) ...

sumitagrawl added 2 commits March 13, 2023 11:40

HDDS-8139. Datanodes should not drop block delete transactions based …

84bc18b

…on transaction ID

HDDS-8139. Datanodes should not drop block delete transactions based …

389d35d

…on transaction ID

sumitagrawl changed the title ~~HDDS-8139. Datanodes should not drop block delete transactions based on transaction I~~ HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID Mar 13, 2023

kerneltime requested review from errose28, kerneltime and szetszwo March 13, 2023 16:19

szetszwo approved these changes Mar 14, 2023

View reviewed changes

errose28 reviewed Mar 14, 2023

View reviewed changes

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java Show resolved Hide resolved

kerneltime reviewed Mar 14, 2023

View reviewed changes

HDDS-8139. Datanodes should not drop block delete transactions based …

68547f3

…on transaction ID

errose28 approved these changes Mar 14, 2023

View reviewed changes

kerneltime approved these changes Mar 21, 2023

View reviewed changes

kerneltime merged commit ceaaaae into apache:master Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID #4384

HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID #4384

Uh oh!

sumitagrawl commented Mar 13, 2023 •

edited

Loading

Uh oh!

szetszwo left a comment

Uh oh!

Uh oh!

errose28 commented Mar 14, 2023

Uh oh!

kerneltime Mar 14, 2023

Uh oh!

sumitagrawl Mar 14, 2023 •

edited

Loading

Uh oh!

kerneltime Mar 21, 2023

Uh oh!

sumitagrawl commented Mar 14, 2023

Uh oh!

errose28 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID #4384

HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID #4384

Uh oh!

Conversation

sumitagrawl commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

errose28 commented Mar 14, 2023

Uh oh!

kerneltime Mar 14, 2023

Choose a reason for hiding this comment

Uh oh!

sumitagrawl Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kerneltime Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

sumitagrawl commented Mar 14, 2023

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sumitagrawl commented Mar 13, 2023 •

edited

Loading

sumitagrawl Mar 14, 2023 •

edited

Loading