HDDS-10293. IllegalArgumentException: containerSize Negative #6178

ArafatKhan2198 · 2024-02-06T05:38:39Z

What changes were proposed in this pull request?

Encounterd an error in recon log IllegalArgumentExpection: containerSize Negative" :-

2024-02-01 19:50:21,727 ERROR [ContainerSizeCountTask]-org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask: FIXME: Failed to process ContainerInfo{id=#3979, state=DELETED, stateEnterTime=2024-02-01T02:45:43.506025Z, pipelineID=PipelineID=0ea98cdd-adf9-4ffa-b7f8-15a3052b155a, owner=om1546336125}
java.lang.IllegalArgumentException: containerSize = -63 < 0
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:204)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeBinIndex(ReconUtils.java:329)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeUpperBound(ReconUtils.java:309)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.getContainerSizeCountKey(ContainerSizeCountTask.java:333)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.updateContainerSizeCount(ContainerSizeCountTask.java:313)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.decrementContainerSizeCount(ContainerSizeCountTask.java:308)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:127)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:168)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.run(ContainerSizeCountTask.java:106)
    at java.base/java.lang.Thread.run(Thread.java:834)
2024-02-01 19:50:21,728 ERROR [ContainerSizeCountTask]-org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask: FIXME: Failed to process ContainerInfo{id=#3985, state=DELETED, stateEnterTime=2024-02-01T02:45:43.509864Z, pipelineID=PipelineID=4c92a12e-7c4d-457a-ab61-0bddd1a23711, owner=om1546336125}
java.lang.IllegalArgumentException: containerSize = -63 < 0
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:204)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeBinIndex(ReconUtils.java:329)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeUpperBound(ReconUtils.java:309)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.getContainerSizeCountKey(ContainerSizeCountTask.java:333)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.updateContainerSizeCount(ContainerSizeCountTask.java:313)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.decrementContainerSizeCount(ContainerSizeCountTask.java:308)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:127)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:168)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.run(ContainerSizeCountTask.java:106)
    at java.base/java.lang.Thread.run(Thread.java:834)

The changes proposed in the PR :-

Skip the containers having negative size due to a separate reason
Skip the containers which has gone into the DELETED state.
Track containers with negative sizes in the UnhealthyContainersTable under a new unhealthy state termed NEGATIVE_SIZE.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10293

How was this patch tested?

Conducted unit and manual testing for scenarios where containers reporting negative sizes are skipped in the containerSizeCountTask and tracked in the unhealthyContainerTable.

http://localhost:9888/api/v1/containers/unhealthy :-

…ner ID apache#3979.

adoroszlai · 2024-02-06T05:40:48Z

@ArafatKhan2198 please wait for clean CI run in your fork before opening the PR

ArafatKhan2198 · 2024-02-06T06:08:48Z

@devmadhuu @dombizita Can you please take a look.

szetszwo

@ArafatKhan2198 , thanks for working on this! Question: why getUsedBytes() a container can return a negative number?

...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java

devmadhuu

Thanks @ArafatKhan2198 for working on this patch. Few comments. Pls check.

...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java

errose28 · 2024-02-08T05:00:26Z

@szetszwo @ArafatKhan2198 we've seen problems in the past with ContainerInfo#usedBytes going negative on the datanodes when block deletes happen. Block counts had a similar issue. We did some fixes in the past to try to get them more accurate but there may still be issues, or the issues are left over from when this cluster was running an older version. We ended up just not using these for anything requiring exact values. For example, container delete uses the isEmpty flag now, but container close and DU approximations still use usedBytes since those can have some margin of error.

I checked one of our clusters with the issue and SCM sees the negative counts too so it is coming from the datanodes, not Recon. I think Recon should be tolerant of the inaccuracies and probably not log anything alarming like warn/error in this case.

devmadhuu · 2024-02-08T06:50:47Z

@szetszwo @ArafatKhan2198 we've seen problems in the past with ContainerInfo#usedBytes going negative on the datanodes when block deletes happen. Block counts had a similar issue. We did some fixes in the past to try to get them more accurate but there may still be issues, or the issues are left over from when this cluster was running an older version. We ended up just not using these for anything requiring exact values. For example, container delete uses the isEmpty flag now, but container close and DU approximations still use usedBytes since those can have some margin of error.

I checked one of our clusters with the issue and SCM sees the negative counts too so it is coming from the datanodes, not Recon. I think Recon should be tolerant of the inaccuracies and probably not log anything alarming like warn/error in this case.

Thanks @errose28 for your explanation. As you mentioned that isEmpty flag being used in SCM now to know if container bceomes empty, however Recon have a different use case where Recon has a task to categorize containers on different size ranges based on usedBytes, so Recon has to rely on this attribute, but since you mentioned this attribute has still some underlying margin of error, can Recon use something else to know the actual used space in a container. Any suggestions ?

szetszwo · 2024-02-08T16:13:50Z

... we've seen problems in the past with ContainerInfo#usedBytes going negative ...

@errose28 , I do recall such problems. Thanks for pointing it out! The current code prints an error IllegalArgumentException message, skips that container and then continues with the other containers.

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

ArafatKhan2198 · 2024-02-08T16:48:58Z

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

Good idea, we could consolidate all related container information into a table, including container ID, state, and other useful details. This table could then be periodically updated by the container size count task to reflect any changes, especially if a container is found to have negative used bytes.

devmadhuu · 2024-02-08T17:58:20Z

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

Good idea, we could consolidate all related container information into a table, including container ID, state, and other useful details. This table could then be periodically updated by the container size count task to reflect any changes, especially if a container is found to have negative used bytes.

Do not add a new table. Better add a new category of Unhealthy container in UNHEALTHY_CONTAINERS table.

ArafatKhan2198 · 2024-02-08T17:59:53Z

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

Good idea, we could consolidate all related container information into a table, including container ID, state, and other useful details. This table could then be periodically updated by the container size count task to reflect any changes, especially if a container is found to have negative used bytes.

Do not add a new table. Better add a new category of Unhealthy container in UNHEALTHY_CONTAINERS table.

Yes, you're correct. We could utilize the existing Unhealthy Container Table. I forgot about that!

errose28 · 2024-02-09T20:27:11Z

Whatever solution we go with I don't think Recon should show user facing alerts about the negative container sizes. It will just raise more questions to users for something that is not a serious problem. I think datanodes or SCM may want to log warnings, but Recon would probably be fine with just a debug log and some sort of error handling when classifying by container sizes (maybe just round them to zero).

devmadhuu · 2024-02-10T01:59:35Z

Whatever solution we go with I don't think Recon should show user facing alerts about the negative container sizes. It will just raise more questions to users for something that is not a serious problem. I think datanodes or SCM may want to log warnings, but Recon would probably be fine with just a debug log and some sort of error handling when classifying by container sizes (maybe just round them to zero).

Yes agree, we should not expose to UI, having it in table and periodic cleanup should be ok, and some API internally, may not be a documented one , so that using curl, we can easily fetch out such bad data. Why suggesting an API because if we are planning to store in existing SQL derby table , so someone needs not to have a client for sql table.

…erTable

devmadhuu

Thanks @ArafatKhan2198 for working on the comments. LGTM +1.

szetszwo

@ArafatKhan2198 , thanks for the update! Please see the comments inlined.

...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java

szetszwo

+1 the change looks good.

…ve (apache#6178) (cherry picked from commit 45d420a) Change-Id: I48ca5f730aefdf9d045cf2dab4ead4e01fbc46d1

…6178) (cherry picked from commit 45d420a)

HDDS-10293. IllegalArgumentException: containerSize Negative - Contai…

ff5fff2

…ner ID apache#3979.

adoroszlai changed the title ~~HDDS-10293. IllegalArgumentException: containerSize Negative - Container ID #3979.~~ HDDS-10293. IllegalArgumentException: containerSize Negative Feb 6, 2024

adoroszlai added the recon label Feb 6, 2024

Merge both the added test methods together

2ec7725

Fixed checkstyle issues

b3f8620

Fixed failing UT

b4bb56e

dombizita requested a review from szetszwo February 6, 2024 07:19

szetszwo reviewed Feb 6, 2024

View reviewed changes

...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java Outdated Show resolved Hide resolved

devmadhuu reviewed Feb 7, 2024

View reviewed changes

...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java Show resolved Hide resolved

...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerSizeCountTask.java Show resolved Hide resolved

Fixed review comments

63ef581

Persisting the containers with negative sizes in the UnhealthyContain…

b4e2408

…erTable

devmadhuu approved these changes Feb 12, 2024

View reviewed changes

szetszwo reviewed Feb 12, 2024

View reviewed changes

ArafatKhan2198 added 2 commits February 13, 2024 03:15

Finished unit tests for identifying containers with negative size.

edeea8a

Made review chages

345b2aa

dombizita requested a review from szetszwo February 20, 2024 12:27

Treating negative container sizes and zero

75e66da

szetszwo approved these changes Feb 21, 2024

View reviewed changes

Fixed failing UT

3757c65

szetszwo merged commit 45d420a into apache:master Feb 22, 2024

swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Jun 10, 2024

CDPD-66192 HDDS-10293: IllegalArgumentException: containerSize Negati…

0aee918

…ve (apache#6178) (cherry picked from commit 45d420a) Change-Id: I48ca5f730aefdf9d045cf2dab4ead4e01fbc46d1

ArafatKhan2198 mentioned this pull request Sep 8, 2024

HDDS-11389. Incorrect number of deleted containers shown in Recon UI. #7149

Merged

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 16, 2024

HDDS-10293. IllegalArgumentException: containerSize Negative (apache#…

1237a0f

…6178) (cherry picked from commit 45d420a)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 18, 2024

HDDS-10293. IllegalArgumentException: containerSize Negative (apache#…

2dd0c3a

…6178) (cherry picked from commit 45d420a)

xichen01 mentioned this pull request Sep 19, 2024

[DO NOT MERGE] Backport some fixes and compatibility commits from master to ozone-1.4 #7218

Merged

HDDS-10293. IllegalArgumentException: containerSize Negative #6178

HDDS-10293. IllegalArgumentException: containerSize Negative #6178

Uh oh!

Conversation

ArafatKhan2198 commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Feb 6, 2024

Uh oh!

ArafatKhan2198 commented Feb 6, 2024

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devmadhuu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

errose28 commented Feb 8, 2024

Uh oh!

devmadhuu commented Feb 8, 2024

Uh oh!

szetszwo commented Feb 8, 2024

Uh oh!

ArafatKhan2198 commented Feb 8, 2024

Uh oh!

devmadhuu commented Feb 8, 2024

Uh oh!

ArafatKhan2198 commented Feb 8, 2024

Uh oh!

errose28 commented Feb 9, 2024

Uh oh!

devmadhuu commented Feb 10, 2024

Uh oh!

devmadhuu left a comment

Choose a reason for hiding this comment

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ArafatKhan2198 commented Feb 6, 2024 •

edited

Loading