-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-10293. IllegalArgumentException: containerSize Negative #6178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ArafatKhan2198 please wait for clean CI run in your fork before opening the PR |
|
@devmadhuu @dombizita Can you please take a look. |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArafatKhan2198 , thanks for working on this! Question: why getUsedBytes() a container can return a negative number?
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java
Outdated
Show resolved
Hide resolved
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArafatKhan2198 for working on this patch. Few comments. Pls check.
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java
Show resolved
Hide resolved
|
@szetszwo @ArafatKhan2198 we've seen problems in the past with I checked one of our clusters with the issue and SCM sees the negative counts too so it is coming from the datanodes, not Recon. I think Recon should be tolerant of the inaccuracies and probably not log anything alarming like warn/error in this case. |
Thanks @errose28 for your explanation. As you mentioned that |
@errose28 , I do recall such problems. Thanks for pointing it out! The current code prints an error @ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems. |
Good idea, we could consolidate all related container information into a table, including container ID, state, and other useful details. This table could then be periodically updated by the container size count task to reflect any changes, especially if a container is found to have negative used bytes. |
Do not add a new table. Better add a new category of Unhealthy container in UNHEALTHY_CONTAINERS table. |
Yes, you're correct. We could utilize the existing Unhealthy Container Table. I forgot about that! |
|
Whatever solution we go with I don't think Recon should show user facing alerts about the negative container sizes. It will just raise more questions to users for something that is not a serious problem. I think datanodes or SCM may want to log warnings, but Recon would probably be fine with just a debug log and some sort of error handling when classifying by container sizes (maybe just round them to zero). |
Yes agree, we should not expose to UI, having it in table and periodic cleanup should be ok, and some API internally, may not be a documented one , so that using curl, we can easily fetch out such bad data. Why suggesting an API because if we are planning to store in existing SQL derby table , so someone needs not to have a client for sql table. |
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArafatKhan2198 for working on the comments. LGTM +1.
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArafatKhan2198 , thanks for the update! Please see the comments inlined.
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java
Outdated
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java
Outdated
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java
Outdated
Show resolved
Hide resolved
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
…ve (apache#6178) (cherry picked from commit 45d420a) Change-Id: I48ca5f730aefdf9d045cf2dab4ead4e01fbc46d1
What changes were proposed in this pull request?
Encounterd an error in recon log
IllegalArgumentExpection: containerSize Negative" :-The changes proposed in the PR :-
UnhealthyContainersTableunder a new unhealthy state termedNEGATIVE_SIZE.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10293
How was this patch tested?
Conducted unit and manual testing for scenarios where containers reporting negative sizes are skipped in the containerSizeCountTask and tracked in the unhealthyContainerTable.
http://localhost:9888/api/v1/containers/unhealthy:-