Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan siddhantsangwan commented May 20, 2025

What changes were proposed in this pull request?

This pull request is for implementing a part of the design proposed in HDDS-12929. This only contains the implementation for detecting a full volume, getting the latest storage report, adding the container action, then immediately triggering (or throttling) a heartbeat.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13045

How was this patch tested?

Modified existing unit tests. Also did some manual testing using the ozone docker compose cluster.

a. Simulated a close to full volume with a capacity of 2 GB, available space of 150 MB and min free space of 100 MB. Datanode log:

2025-05-20 09:47:05,899 [main] INFO volume.HddsVolume: HddsVolume: { id=DS-64dd669c-71fe-492f-903c-4fc7dbe4440a dir=/data/hdds/hdds type=DISK capacity=2147268899 used=1990197248 available=157071651 minFree=104857600 committed=0 }

b. Wrote 100 MB of data using freon, with the expectation that an immediate heartbeat will be triggered as soon as the available space drops to 100 MB. Datanode log shows that this happened at 09:50:52:

2025-05-20 09:50:52,028 [f8714dd7-31fc-4c63-9703-6fdb1a59b5c4-ChunkWriter-7-0] INFO impl.HddsDispatcher: Triggering heartbeat for full volume /data/hdds/hdds, with node report storageReport {
   storageUuid: "DS-bd34474b-8fd4-49be-be78-72e708b543c0"
   storageLocation: "/data/hdds/hdds"
   capacity: 2147268899
   scmUsed: 2042626048
   remaining: 104642851
   storageType: DISK
   failed: false
   committed: 0
   freeSpaceToSpare: 104857600
 }
 metadataStorageReport {
   storageLocation: "/data/metadata/ratis"
   storageType: DISK
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   failed: false
 }

c. In the SCM, the last storage report BEFORE the write operation was received at 09:50:09:

2025-05-20 09:50:09,399 [IPC Server handler 12 on default port 9861] INFO server.SCMDatanodeHeartbeatDispatcher: Dispatching Node Report storageReport {
storageUuid: "DS-27210be2-ee53-4035-a3a3-63ec8a162456"
   storageLocation: "/data/hdds/hdds"
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   storageType: DISK
   failed: false
   committed: 0
   freeSpaceToSpare: 104857600
 }
 metadataStorageReport {
   storageLocation: "/data/metadata/ratis"
   storageType: DISK
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   failed: false
 }

So, the next storage report should be received a minute later at 09:51:09, unless it's triggered immediately due to volume full. The SCM log shows that the immediately triggered report was received at 09:50:52, corresponding to the DN log:

2025-05-20 09:50:52,033 [IPC Server handler 4 on default port 9861] INFO server.SCMDatanodeHeartbeatDispatcher: Dispatching Node Report storageReport {
   storageUuid: "DS-bd34474b-8fd4-49be-be78-72e708b543c0"
   storageLocation: "/data/hdds/hdds"
   capacity: 2147268899
   scmUsed: 2042626048
   remaining: 104642851
   storageType: DISK
   failed: false
   committed: 0
   freeSpaceToSpare: 104857600
 }
 metadataStorageReport {
   storageLocation: "/data/metadata/ratis"
   storageType: DISK
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   failed: false
 }

The next storage report is received at the expected time of 09:51:09, showing that throttling also worked.

Green CI in my fork: https://github.com/siddhantsangwan/ozone/actions/runs/15135787944/job/42547140475

@siddhantsangwan siddhantsangwan marked this pull request as ready for review May 21, 2025 05:02
nodeReport = context.getParent().getContainer().getNodeReport();
context.refreshFullReport(nodeReport);
context.getParent().triggerHeartbeat();
LOG.info("Triggering heartbeat for full volume {}, with node report: {}.", volume, nodeReport);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on the write path, so we must be extra careful about performance. An info log will reduce performance, but I wonder if it's ok in this case because this won't happen often? What do others think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover the future plan is to fail the write anyway if the size is exceeding the min free and reserved space boundary.

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddhantsangwan for this improvement!

this.slowOpThresholdNs = getSlowOpThresholdMs(conf) * 1000000;
fullVolumeLastHeartbeatTriggerMs = new AtomicLong(-1);
long heartbeatInterval =
config.getTimeDuration("hdds.heartbeat.interval", 30000, TimeUnit.MILLISECONDS);
Copy link
Contributor

@ChenSammi ChenSammi May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call HddsServerUtil#getScmHeartbeatInterval instead?

And there is HDDS_NODE_REPORT_INTERVAL for node report. Shall we use node report property instead of heartbeat property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDDS_NODE_REPORT_INTERVAL is 1 minute, it may be too long?

Copy link
Contributor

@ChenSammi ChenSammi May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1m or 3s doesn't matter, because you always send out the first heartbeat immediately. This 1m is used to control the throttling, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for throttling

try {
handleFullVolume(container.getContainerData().getVolume());
} catch (StorageContainerException e) {
ContainerUtils.logAndReturnError(LOG, e, msg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to return here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, but I'm not sure. There was an exception in getting the node report, but does that mean we should fail the write? Maybe we should still let the write continue here. Otherwise because of an intermittent or not severe exception we could keep on failing writes. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK not return here, but instead of calling ContainerUtils.logAndReturnError, you can probably just log the failure message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test whether the logging is proper, I added a new test that throws an exception. Here's what the logs look like:

2025-05-30 16:01:08,027 [main] WARN  impl.HddsDispatcher (HddsDispatcher.java:dispatchRequest(354)) - Failed to handle full volume while handling request: cmdType: WriteChunk
containerID: 1
datanodeUuid: "c6842f19-cbc5-47ca-bce0-f5bc859ef807"
writeChunk {
  blockID {
    containerID: 1
    localID: 1
    blockCommitSequenceId: 0
  }
  chunkData {
    chunkName: "36b4d6b58215a7da96e3bf71a602e3ea_stream_1_chunk_1"
    offset: 0
    len: 36
    checksumData {
      type: NONE
      bytesPerChecksum: 0
    }
  }
  data: "b0bc4858-a308-417d-b363-0631e07b97ec"
}

org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: Failed to create node report when handling full volume /var/folders/jp/39hcmgjx4yb_kry3ydxb3c7r0000gn/T/junit-110499014917526916. Volume Report: { id=DS-db481691-4055-404b-8790-f375e6d41215 dir=/var/folders/jp/39hcmgjx4yb_kry3ydxb3c7r0000gn/T/junit-110499014917526916/hdds type=DISK capacity=499 used=390 available=109 minFree=100 committed=50 }
	at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.handleFullVolume(HddsDispatcher.java:481)
	at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:352)
	at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$1(HddsDispatcher.java:199)
	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
	at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:198)
	at org.apache.hadoop.ozone.container.common.impl.TestHddsDispatcher.testExceptionHandlingWhenVolumeFull(TestHddsDispatcher.java:430)
...

*/
private void handleFullVolume(HddsVolume volume) throws StorageContainerException {
long current = System.currentTimeMillis();
long last = fullVolumeLastHeartbeatTriggerMs.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider different volume gets full case , for example, P0, /data1 gets full, P1, /data2 gets full,
(P1-P0) < interval, do we expect two emergent container reports, or one report?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we will only send one report. I think this is fine because in the report we send info about all the volumes. However there's a discussion going on here #8460 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good answer for this after thought for a while. The ideal state is if we want to send immediate heartbeat when one volume is full, we should respect each volume, send a heartbeat for each volume when it's full, but consider the complexity introduced to achieve that, I just doubt whether it's worthy to do that.

Because except the heartbeat sent here, there are regular node reports with storage info sent every 60s. If we only sent one report regardless of which volume, them probably we only need to sent the first one, and let the regular periodic node reports do the rest thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's stick to the current implementation then. I'll change the interval to node report interval instead of heartbeat interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think purpose of sending full volume report is avoiding pipeline and container creation. Now node report is throttled and hence close container is throttled implicitly. Initial purpose was close container immediate to avoid new block allocation for the HB time (ie 30 second).

This may be similar to sending DN HB, only advantage here is for first failure within 1 min, its immediate, but all later failure is throttled.

for node report, there is a new configuration at SCM discovered to avoid new container allocation, "hdds.datanode.storage.utilization.critical.threshold". We need recheck overall target of problem to solve and optimize configuration / fix inconsistency.

cc: @ChenSammi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for node report, there is a new configuration at SCM discovered to avoid new container allocation, "hdds.datanode.storage.utilization.critical.threshold". We need recheck overall target of problem to solve and optimize configuration / fix inconsistency.

As discussed, this is dead code in Ozone and is not used anywhere.

@siddhantsangwan
Copy link
Contributor Author

Thanks for the reviews! I've addressed comments in the latest commit.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddhantsangwan for updating the patch, LGTM!

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@siddhantsangwan
Copy link
Contributor Author

siddhantsangwan commented Jun 9, 2025

I think purpose of sending full volume report is avoiding pipeline and container creation. Now node report is throttled and hence close container is throttled implicitly. Initial purpose was close container immediate to avoid new block allocation for the HB time (ie 30 second).
This may be similar to sending DN HB, only advantage here is for first failure within 1 min, its immediate, but all later failure is throttled.

Based on this comment, we decided to trigger heartbeat immediately when:

  1. The container is (close) to full (the container full check already exists)
  2. The volume is full EXCLUDING committed space (reserved - available - min free <= 0). This is because when a volume is full INCLUDING committed space (reserved - available - committed - min free <= 0), open containers can still accept writes. So the current behaviour of sending a close container action when volume is full including committed space is a bug.
  3. The container is unhealthy (this is existing behaviour).

We decided to not send volume reports in the immediate heartbeat and instead rely on regular node reports for that. This allows us to make the throttling per container.

Closing this PR, opened a new PR instead - #8590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants