-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full #8492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| nodeReport = context.getParent().getContainer().getNodeReport(); | ||
| context.refreshFullReport(nodeReport); | ||
| context.getParent().triggerHeartbeat(); | ||
| LOG.info("Triggering heartbeat for full volume {}, with node report: {}.", volume, nodeReport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on the write path, so we must be extra careful about performance. An info log will reduce performance, but I wonder if it's ok in this case because this won't happen often? What do others think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover the future plan is to fail the write anyway if the size is exceeding the min free and reserved space boundary.
peterxcli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for this improvement!
...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
Show resolved
Hide resolved
| this.slowOpThresholdNs = getSlowOpThresholdMs(conf) * 1000000; | ||
| fullVolumeLastHeartbeatTriggerMs = new AtomicLong(-1); | ||
| long heartbeatInterval = | ||
| config.getTimeDuration("hdds.heartbeat.interval", 30000, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call HddsServerUtil#getScmHeartbeatInterval instead?
And there is HDDS_NODE_REPORT_INTERVAL for node report. Shall we use node report property instead of heartbeat property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDDS_NODE_REPORT_INTERVAL is 1 minute, it may be too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1m or 3s doesn't matter, because you always send out the first heartbeat immediately. This 1m is used to control the throttling, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's for throttling
| try { | ||
| handleFullVolume(container.getContainerData().getVolume()); | ||
| } catch (StorageContainerException e) { | ||
| ContainerUtils.logAndReturnError(LOG, e, msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to return here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, but I'm not sure. There was an exception in getting the node report, but does that mean we should fail the write? Maybe we should still let the write continue here. Otherwise because of an intermittent or not severe exception we could keep on failing writes. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK not return here, but instead of calling ContainerUtils.logAndReturnError, you can probably just log the failure message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test whether the logging is proper, I added a new test that throws an exception. Here's what the logs look like:
2025-05-30 16:01:08,027 [main] WARN impl.HddsDispatcher (HddsDispatcher.java:dispatchRequest(354)) - Failed to handle full volume while handling request: cmdType: WriteChunk
containerID: 1
datanodeUuid: "c6842f19-cbc5-47ca-bce0-f5bc859ef807"
writeChunk {
blockID {
containerID: 1
localID: 1
blockCommitSequenceId: 0
}
chunkData {
chunkName: "36b4d6b58215a7da96e3bf71a602e3ea_stream_1_chunk_1"
offset: 0
len: 36
checksumData {
type: NONE
bytesPerChecksum: 0
}
}
data: "b0bc4858-a308-417d-b363-0631e07b97ec"
}
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: Failed to create node report when handling full volume /var/folders/jp/39hcmgjx4yb_kry3ydxb3c7r0000gn/T/junit-110499014917526916. Volume Report: { id=DS-db481691-4055-404b-8790-f375e6d41215 dir=/var/folders/jp/39hcmgjx4yb_kry3ydxb3c7r0000gn/T/junit-110499014917526916/hdds type=DISK capacity=499 used=390 available=109 minFree=100 committed=50 }
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.handleFullVolume(HddsDispatcher.java:481)
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:352)
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$1(HddsDispatcher.java:199)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:198)
at org.apache.hadoop.ozone.container.common.impl.TestHddsDispatcher.testExceptionHandlingWhenVolumeFull(TestHddsDispatcher.java:430)
...
| */ | ||
| private void handleFullVolume(HddsVolume volume) throws StorageContainerException { | ||
| long current = System.currentTimeMillis(); | ||
| long last = fullVolumeLastHeartbeatTriggerMs.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider different volume gets full case , for example, P0, /data1 gets full, P1, /data2 gets full,
(P1-P0) < interval, do we expect two emergent container reports, or one report?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we will only send one report. I think this is fine because in the report we send info about all the volumes. However there's a discussion going on here #8460 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good answer for this after thought for a while. The ideal state is if we want to send immediate heartbeat when one volume is full, we should respect each volume, send a heartbeat for each volume when it's full, but consider the complexity introduced to achieve that, I just doubt whether it's worthy to do that.
Because except the heartbeat sent here, there are regular node reports with storage info sent every 60s. If we only sent one report regardless of which volume, them probably we only need to sent the first one, and let the regular periodic node reports do the rest thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's stick to the current implementation then. I'll change the interval to node report interval instead of heartbeat interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think purpose of sending full volume report is avoiding pipeline and container creation. Now node report is throttled and hence close container is throttled implicitly. Initial purpose was close container immediate to avoid new block allocation for the HB time (ie 30 second).
This may be similar to sending DN HB, only advantage here is for first failure within 1 min, its immediate, but all later failure is throttled.
for node report, there is a new configuration at SCM discovered to avoid new container allocation, "hdds.datanode.storage.utilization.critical.threshold". We need recheck overall target of problem to solve and optimize configuration / fix inconsistency.
cc: @ChenSammi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for node report, there is a new configuration at SCM discovered to avoid new container allocation, "hdds.datanode.storage.utilization.critical.threshold". We need recheck overall target of problem to solve and optimize configuration / fix inconsistency.
As discussed, this is dead code in Ozone and is not used anywhere.
|
Thanks for the reviews! I've addressed comments in the latest commit. |
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
Outdated
Show resolved
Hide resolved
peterxcli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for updating the patch, LGTM!
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Based on this comment, we decided to trigger heartbeat immediately when:
We decided to not send volume reports in the immediate heartbeat and instead rely on regular node reports for that. This allows us to make the throttling per container. Closing this PR, opened a new PR instead - #8590 |
What changes were proposed in this pull request?
This pull request is for implementing a part of the design proposed in HDDS-12929. This only contains the implementation for detecting a full volume, getting the latest storage report, adding the container action, then immediately triggering (or throttling) a heartbeat.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13045
How was this patch tested?
Modified existing unit tests. Also did some manual testing using the ozone docker compose cluster.
a. Simulated a close to full volume with a capacity of 2 GB, available space of 150 MB and min free space of 100 MB. Datanode log:
b. Wrote 100 MB of data using freon, with the expectation that an immediate heartbeat will be triggered as soon as the available space drops to 100 MB. Datanode log shows that this happened at 09:50:52:
c. In the SCM, the last storage report BEFORE the write operation was received at 09:50:09:
So, the next storage report should be received a minute later at 09:51:09, unless it's triggered immediately due to volume full. The SCM log shows that the immediately triggered report was received at 09:50:52, corresponding to the DN log:
The next storage report is received at the expected time of 09:51:09, showing that throttling also worked.
Green CI in my fork: https://github.com/siddhantsangwan/ozone/actions/runs/15135787944/job/42547140475