Skip to content

Conversation

@ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Feb 6, 2024

What changes were proposed in this pull request?

Encounterd an error in recon log IllegalArgumentExpection: containerSize Negative" :-

2024-02-01 19:50:21,727 ERROR [ContainerSizeCountTask]-org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask: FIXME: Failed to process ContainerInfo{id=#3979, state=DELETED, stateEnterTime=2024-02-01T02:45:43.506025Z, pipelineID=PipelineID=0ea98cdd-adf9-4ffa-b7f8-15a3052b155a, owner=om1546336125}
java.lang.IllegalArgumentException: containerSize = -63 < 0
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:204)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeBinIndex(ReconUtils.java:329)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeUpperBound(ReconUtils.java:309)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.getContainerSizeCountKey(ContainerSizeCountTask.java:333)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.updateContainerSizeCount(ContainerSizeCountTask.java:313)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.decrementContainerSizeCount(ContainerSizeCountTask.java:308)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:127)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:168)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.run(ContainerSizeCountTask.java:106)
    at java.base/java.lang.Thread.run(Thread.java:834)
2024-02-01 19:50:21,728 ERROR [ContainerSizeCountTask]-org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask: FIXME: Failed to process ContainerInfo{id=#3985, state=DELETED, stateEnterTime=2024-02-01T02:45:43.509864Z, pipelineID=PipelineID=4c92a12e-7c4d-457a-ab61-0bddd1a23711, owner=om1546336125}
java.lang.IllegalArgumentException: containerSize = -63 < 0
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:204)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeBinIndex(ReconUtils.java:329)
    at org.apache.hadoop.ozone.recon.ReconUtils.getContainerSizeUpperBound(ReconUtils.java:309)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.getContainerSizeCountKey(ContainerSizeCountTask.java:333)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.updateContainerSizeCount(ContainerSizeCountTask.java:313)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.decrementContainerSizeCount(ContainerSizeCountTask.java:308)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:127)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.process(ContainerSizeCountTask.java:168)
    at org.apache.hadoop.ozone.recon.tasks.ContainerSizeCountTask.run(ContainerSizeCountTask.java:106)
    at java.base/java.lang.Thread.run(Thread.java:834)  

The changes proposed in the PR :-

  1. Skip the containers having negative size due to a separate reason
  2. Skip the containers which has gone into the DELETED state.
  3. Track containers with negative sizes in the UnhealthyContainersTable under a new unhealthy state termed NEGATIVE_SIZE.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10293

How was this patch tested?

Conducted unit and manual testing for scenarios where containers reporting negative sizes are skipped in the containerSizeCountTask and tracked in the unhealthyContainerTable.

http://localhost:9888/api/v1/containers/unhealthy :-

image

@adoroszlai adoroszlai changed the title HDDS-10293. IllegalArgumentException: containerSize Negative - Container ID #3979. HDDS-10293. IllegalArgumentException: containerSize Negative Feb 6, 2024
@adoroszlai
Copy link
Contributor

@ArafatKhan2198 please wait for clean CI run in your fork before opening the PR

@ArafatKhan2198
Copy link
Contributor Author

@devmadhuu @dombizita Can you please take a look.

@dombizita dombizita requested a review from szetszwo February 6, 2024 07:19
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArafatKhan2198 , thanks for working on this! Question: why getUsedBytes() a container can return a negative number?

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for working on this patch. Few comments. Pls check.

@errose28
Copy link
Contributor

errose28 commented Feb 8, 2024

@szetszwo @ArafatKhan2198 we've seen problems in the past with ContainerInfo#usedBytes going negative on the datanodes when block deletes happen. Block counts had a similar issue. We did some fixes in the past to try to get them more accurate but there may still be issues, or the issues are left over from when this cluster was running an older version. We ended up just not using these for anything requiring exact values. For example, container delete uses the isEmpty flag now, but container close and DU approximations still use usedBytes since those can have some margin of error.

I checked one of our clusters with the issue and SCM sees the negative counts too so it is coming from the datanodes, not Recon. I think Recon should be tolerant of the inaccuracies and probably not log anything alarming like warn/error in this case.

@devmadhuu
Copy link
Contributor

@szetszwo @ArafatKhan2198 we've seen problems in the past with ContainerInfo#usedBytes going negative on the datanodes when block deletes happen. Block counts had a similar issue. We did some fixes in the past to try to get them more accurate but there may still be issues, or the issues are left over from when this cluster was running an older version. We ended up just not using these for anything requiring exact values. For example, container delete uses the isEmpty flag now, but container close and DU approximations still use usedBytes since those can have some margin of error.

I checked one of our clusters with the issue and SCM sees the negative counts too so it is coming from the datanodes, not Recon. I think Recon should be tolerant of the inaccuracies and probably not log anything alarming like warn/error in this case.

Thanks @errose28 for your explanation. As you mentioned that isEmpty flag being used in SCM now to know if container bceomes empty, however Recon have a different use case where Recon has a task to categorize containers on different size ranges based on usedBytes, so Recon has to rely on this attribute, but since you mentioned this attribute has still some underlying margin of error, can Recon use something else to know the actual used space in a container. Any suggestions ?

@szetszwo
Copy link
Contributor

szetszwo commented Feb 8, 2024

... we've seen problems in the past with ContainerInfo#usedBytes going negative ...

@errose28 , I do recall such problems. Thanks for pointing it out! The current code prints an error IllegalArgumentException message, skips that container and then continues with the other containers.

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

@ArafatKhan2198
Copy link
Contributor Author

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

Good idea, we could consolidate all related container information into a table, including container ID, state, and other useful details. This table could then be periodically updated by the container size count task to reflect any changes, especially if a container is found to have negative used bytes.

@devmadhuu
Copy link
Contributor

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

Good idea, we could consolidate all related container information into a table, including container ID, state, and other useful details. This table could then be periodically updated by the container size count task to reflect any changes, especially if a container is found to have negative used bytes.

Do not add a new table. Better add a new category of Unhealthy container in UNHEALTHY_CONTAINERS table.

@ArafatKhan2198
Copy link
Contributor Author

@ArafatKhan2198 , We may consider adding a counter in Recon for the containers with negative usedBytes. It could be a useful feature for debugging and fixing clusters with such problems.

Good idea, we could consolidate all related container information into a table, including container ID, state, and other useful details. This table could then be periodically updated by the container size count task to reflect any changes, especially if a container is found to have negative used bytes.

Do not add a new table. Better add a new category of Unhealthy container in UNHEALTHY_CONTAINERS table.

Yes, you're correct. We could utilize the existing Unhealthy Container Table. I forgot about that!

@errose28
Copy link
Contributor

errose28 commented Feb 9, 2024

Whatever solution we go with I don't think Recon should show user facing alerts about the negative container sizes. It will just raise more questions to users for something that is not a serious problem. I think datanodes or SCM may want to log warnings, but Recon would probably be fine with just a debug log and some sort of error handling when classifying by container sizes (maybe just round them to zero).

@devmadhuu
Copy link
Contributor

Whatever solution we go with I don't think Recon should show user facing alerts about the negative container sizes. It will just raise more questions to users for something that is not a serious problem. I think datanodes or SCM may want to log warnings, but Recon would probably be fine with just a debug log and some sort of error handling when classifying by container sizes (maybe just round them to zero).

Yes agree, we should not expose to UI, having it in table and periodic cleanup should be ok, and some API internally, may not be a documented one , so that using curl, we can easily fetch out such bad data. Why suggesting an API because if we are planning to store in existing SQL derby table , so someone needs not to have a client for sql table.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for working on the comments. LGTM +1.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArafatKhan2198 , thanks for the update! Please see the comments inlined.

@dombizita dombizita requested a review from szetszwo February 20, 2024 12:27
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 45d420a into apache:master Feb 22, 2024
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Jun 10, 2024
…ve (apache#6178)

(cherry picked from commit 45d420a)
Change-Id: I48ca5f730aefdf9d045cf2dab4ead4e01fbc46d1
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 16, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants