-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-7926. [hsync] Recon throws ClassCastException. #4266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // Handle RepeatedOmKeyInfo object | ||
| RepeatedOmKeyInfo repeatedKeyInfo = (RepeatedOmKeyInfo) omKeyInfo; | ||
| keyInfo = repeatedKeyInfo.getOmKeyInfoList().get(0); | ||
| oldKeyInfo = repeatedKeyInfo.getOmKeyInfoList().get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it guarantee repeatedKeyInfo has only one key? Also why is keyInfo and oldKeyInfo the same? typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment @jojochuang
I was mistaken in believing that all OmKeyInfo objects stored in the RepeatedOmKeyInfo class are similar in terms of size, but differ only in their paths. In fact, it is the complete opposite. Once a key is deleted, it is moved to the "om metadata deletedTable." Hence having a mapping of {label: List<OMKeyInfo>} ensures that if users repeatedly create and delete keys with the exact same URI, all instances of deletion are grouped together under the same key name.
Hence I believe we must iterate through the list and call handleDeleteKeyEvent() on each OmKeyInfo object in the list. Do let me know if I am making a mistake in my understanding of RepeatedOmKeyInfo class and the rocksDB DeletedTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is confusing me. The fileSizeCount task should only be processing events from the file table and the key table. They both have value type of OmKeyInfo. The only table that has RepeatedOmKeyInfo values is the deletedTable. But those should be skipped here:
ozone/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java
Line 129 in c449415
| continue; |
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgeJahad Just saw your comment here. I agree with you (to be double checked with UT/experiments). I have a short writeup below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smengcl Don't you think this check would prevent the deleted table from being referenced? and it shouldn't be possible for RepeatedOmKeyInfo from ever being thrown up?
ozone/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java
Line 127 in 59938f9
| if (!taskTables.contains(omdbUpdateEvent.getTable())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArafatKhan2198 Looks like it is correctly limiting the events to keyTable.
Then we need to find out how those deletedTable events could have gotten through this check in the first place.
Any chance for a quick repro UT? Generate an event from deletedTable and see if the filter logic is working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it could have set the wrong table from the source:
ozone/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/OMDBUpdatesHandler.java
Line 110 in 59938f9
| builder.setTable(tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smengcl I made modifications to the UTs for the process in the corresponding test class for filesizecount, and the filter check successfully prevented deletedTable from going through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArafatKhan2198 So the possibility of OMDBUpdatesHandler setting the wrong table above still exists. We need to double check its logic.
@jojochuang Do you have an idea how to repro this? Would doing hsync alone trigger the exception on Recon?
|
The And because |
|
Thanks @smengcl I am not familiar with recon implementation. If the recon does respond correctly when key is removed from namespace, then I agree the best course of action for this issue is to ignore. |
|
Oh it would still be a good idea to track deletedTable (and deletedDirectoryTable) in Recon too. Probably seperately. We encountered an issue recently where keys are removed from namespace, but because the number was high, the actual deleted backlogged and it took a huge amount of time to track down where it got backlogged (multiple entities are involved: OM, SCM and DN). We managed to track down the deletion by looking at rocksdb entries, tailing logs and metrics though. |
|
I think that's a great suggestion. We could start by creating an endpoint that exposes the data from the deletedTable and later work on implementing a user interface for it. I'll go ahead and create a JIRA for this task. |
|
@smengcl @jojochuang can you please take a look! |
When a key is deleted in an Ozone cluster, the Ozone Manager updates its OM-DB to reflect the deletion. This change is then detected by Recon during its next incremental update or full snapshot scan, depending on the mode in which Recon is operating. Once Recon detects the deletion, it updates its metadata store to reflect the change. Specifically, Recon updates the index associated with the deleted key to remove it from the index. This ensures that the deleted key is not included in any future searches or queries against the metadata store. Recon also retains a record of the deleted key in its metadata store inside the Overall, Apache Recon does respond to key deletions in an Ozone cluster by detecting the change and updating its metadata store to reflect the new state of the cluster. |
|
@ArafatKhan2198 , NSSummaryTaskWithLegacy also throws this ClassCastException exception in the Recon log. It need mitigation too. |
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java
Show resolved
Hide resolved
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java
Show resolved
Hide resolved
Got it. I was not aware of this new usage of |
Just want to be sure I understand. Does that mean that the value in the KeyTable here: ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/codec/OMDBDefinition.java Lines 89 to 96 in c755e5d
is changing from OmKeyInfo to RepeatedOmKeyInfo? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
I'm fine with ignoring RepeatedOmKeyInfo for now to solve the issue at hand as (IIUC) the new usage of it in OMKeyCommitRequest for hsync seems to be related to deletion.
But would that cause inaccurate counting of the file size in FileSizeCountTask? In that case shall we add back the for loop to process each KeyInfo in RepeatedOmKeyInfo? 84a1e60#diff-b149c501820a6046d4e78940b783982729ffd53a54ebdcf8d44412d477d271c5L137-L151 We can open another jira for this discussion.
No. None of the table schema has changed. The deletion in OmKeyCommit is done to deletedTable. However, that still doesn't explain why the table filter didn't work. I still suspect the event source is the culprit. If we have a repro it will be easy to dig into that. |
|
Agree, we should find the root case if this issue is reproducible. If it's not easily reproducible, I'm fine with this intermediate approach. |
...p-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/NSSummaryTaskWithLegacy.java
Outdated
Show resolved
Hide resolved
...p-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/NSSummaryTaskWithLegacy.java
Outdated
Show resolved
Hide resolved
|
LGTM + 1. |
What changes were proposed in this pull request?
The Recon server is encountering an issue while processing a key in the file system, leading to an unexpected exception. The error message cites a "ClassCastException," indicating that the Recon server is attempting to cast an object of type
org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfoas an object of typeorg.apache.hadoop.ozone.om.helpers.OmKeyInfo. As a result, the server is unable to properly process the key and accurately display the number of files/keys.So I have made changes to the fileSizeCountTask which will handle both RepeatedOmKeyInfo and OmKeyInfo objects.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7926
How was this patch tested?
UT's ran successfully