HDDS-7926. [hsync] Recon throws ClassCastException. #4266

ArafatKhan2198 · 2023-02-10T08:32:39Z

What changes were proposed in this pull request?

The Recon server is encountering an issue while processing a key in the file system, leading to an unexpected exception. The error message cites a "ClassCastException," indicating that the Recon server is attempting to cast an object of type org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo as an object of type org.apache.hadoop.ozone.om.helpers.OmKeyInfo. As a result, the server is unable to properly process the key and accurately display the number of files/keys.

So I have made changes to the fileSizeCountTask which will handle both RepeatedOmKeyInfo and OmKeyInfo objects.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7926

How was this patch tested?

UT's ran successfully

jojochuang · 2023-02-11T00:12:20Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java

+        // Handle RepeatedOmKeyInfo object
+        RepeatedOmKeyInfo repeatedKeyInfo = (RepeatedOmKeyInfo) omKeyInfo;
+        keyInfo = repeatedKeyInfo.getOmKeyInfoList().get(0);
+        oldKeyInfo = repeatedKeyInfo.getOmKeyInfoList().get(0);


does it guarantee repeatedKeyInfo has only one key? Also why is keyInfo and oldKeyInfo the same? typo?

Thanks for the comment @jojochuang
I was mistaken in believing that all OmKeyInfo objects stored in the RepeatedOmKeyInfo class are similar in terms of size, but differ only in their paths. In fact, it is the complete opposite. Once a key is deleted, it is moved to the "om metadata deletedTable." Hence having a mapping of {label: List<OMKeyInfo>} ensures that if users repeatedly create and delete keys with the exact same URI, all instances of deletion are grouped together under the same key name.
Hence I believe we must iterate through the list and call handleDeleteKeyEvent() on each OmKeyInfo object in the list. Do let me know if I am making a mistake in my understanding of RepeatedOmKeyInfo class and the rocksDB DeletedTable.

Something is confusing me. The fileSizeCount task should only be processing events from the file table and the key table. They both have value type of OmKeyInfo. The only table that has RepeatedOmKeyInfo values is the deletedTable. But those should be skipped here:

ozone/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java

Line 129 in c449415

continue;

What am I missing?

@GeorgeJahad Just saw your comment here. I agree with you (to be double checked with UT/experiments). I have a short writeup below.

@smengcl Don't you think this check would prevent the deleted table from being referenced? and it shouldn't be possible for RepeatedOmKeyInfo from ever being thrown up?

ozone/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java

Line 127 in 59938f9

if (!taskTables.contains(omdbUpdateEvent.getTable())) {

@ArafatKhan2198 Looks like it is correctly limiting the events to keyTable.

Then we need to find out how those deletedTable events could have gotten through this check in the first place.

Any chance for a quick repro UT? Generate an event from deletedTable and see if the filter logic is working?

I wonder if it could have set the wrong table from the source:

ozone/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/OMDBUpdatesHandler.java

Line 110 in 59938f9

builder.setTable(tableName);

@smengcl I made modifications to the UTs for the process in the corresponding test class for filesizecount, and the filter check successfully prevented deletedTable from going through.

@ArafatKhan2198 So the possibility of OMDBUpdatesHandler setting the wrong table above still exists. We need to double check its logic.

@jojochuang Do you have an idea how to repro this? Would doing hsync alone trigger the exception on Recon?

smengcl · 2023-02-14T23:11:32Z

The OMDBUpdateEvent being processed by FileSizeCountTask should have come from processEvent here.

And because deletedTable is the only table that carries those RepeatedOmKeyInfo values, the correct action to take in this case could simply be ignoring the event when its value is of class RepeatedOmKeyInfo. -- Because those same keys are very likely processed once already when they are "moved" from keyTable to deletedTable during key delete request, where the former effectively deletes the entries from keyTable. Please double check if this is the case. CMIIW.

jojochuang · 2023-02-16T13:31:24Z

Thanks @smengcl
This event is caused by the hsync implementation (still under code review) which is implemented by modifying OMKeyCommit. Essentially this means it could happen during key commit already, if the key to be committed overwrites an existing key (possible today if key is versioned)

I am not familiar with recon implementation. If the recon does respond correctly when key is removed from namespace, then I agree the best course of action for this issue is to ignore.

jojochuang · 2023-02-16T13:34:22Z

Oh it would still be a good idea to track deletedTable (and deletedDirectoryTable) in Recon too. Probably seperately.

We encountered an issue recently where keys are removed from namespace, but because the number was high, the actual deleted backlogged and it took a huge amount of time to track down where it got backlogged (multiple entities are involved: OM, SCM and DN).

We managed to track down the deletion by looking at rocksdb entries, tailing logs and metrics though.

ArafatKhan2198 · 2023-02-19T08:03:08Z

The OMDBUpdateEvent being processed by FileSizeCountTask should have come from processEvent here.

And because deletedTable is the only table that carries those RepeatedOmKeyInfo values, the correct action to take in this case could simply be ignoring the event when its value is of class RepeatedOmKeyInfo. -- Because those same keys are very likely processed once already when they are "moved" from keyTable to deletedTable during key delete request, where the former effectively deletes the entries from keyTable. Please double check if this is the case. CMIIW.

The deletedTable stores information about deleted keys in Ozone Manager's metadata. Whenever an object is deleted, the OmMetadataManagerImpl class creates an entry in the deletedTable in the form of RepeatedOmKeyInfo class that contains multiple OmKeyInfo objects with the same URI. Periodic scans by Ozone's background services identify objects eligible for permanent deletion, which are then removed from the deletedTable and their corresponding data is freed from storage.
Since this information is hidden from the user, there's no need to worry about the contents of the deletedTable for the fileSizeCountTask, hence we should like you suggested ignore the repeatedOmKeyInfo objects being thrown as we only deal with OmKeyInfo object from the key and file table.

ArafatKhan2198 · 2023-02-19T08:05:22Z

Oh it would still be a good idea to track deletedTable (and deletedDirectoryTable) in Recon too. Probably seperately.

We encountered an issue recently where keys are removed from namespace, but because the number was high, the actual deleted backlogged and it took a huge amount of time to track down where it got backlogged (multiple entities are involved: OM, SCM and DN).

We managed to track down the deletion by looking at rocksdb entries, tailing logs and metrics though.

I think that's a great suggestion. We could start by creating an endpoint that exposes the data from the deletedTable and later work on implementing a user interface for it. I'll go ahead and create a JIRA for this task.

ArafatKhan2198 · 2023-02-19T08:10:23Z

@smengcl @jojochuang can you please take a look!

ArafatKhan2198 · 2023-02-19T13:46:26Z

I am not familiar with recon implementation. If the recon does respond correctly when key is removed from namespace, then I agree the best course of action for this issue is to ignore.

When a key is deleted in an Ozone cluster, the Ozone Manager updates its OM-DB to reflect the deletion. This change is then detected by Recon during its next incremental update or full snapshot scan, depending on the mode in which Recon is operating.

Once Recon detects the deletion, it updates its metadata store to reflect the change. Specifically, Recon updates the index associated with the deleted key to remove it from the index. This ensures that the deleted key is not included in any future searches or queries against the metadata store.

Recon also retains a record of the deleted key in its metadata store inside the DeletedTable. This allows administrators to query the metadata store for deleted keys and to see when they were deleted. This information can be useful for auditing, compliance, and troubleshooting purposes.

Overall, Apache Recon does respond to key deletions in an Ozone cluster by detecting the change and updating its metadata store to reflect the new state of the cluster.

ChenSammi · 2023-02-23T07:24:43Z

@ArafatKhan2198 , NSSummaryTaskWithLegacy also throws this ClassCastException exception in the Recon log. It need mitigation too.

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java

smengcl · 2023-02-27T18:58:10Z

Thanks @smengcl This event is caused by the hsync implementation (still under code review) which is implemented by modifying OMKeyCommit. Essentially this means it could happen during key commit already, if the key to be committed overwrites an existing key (possible today if key is versioned)

I am not familiar with recon implementation. If the recon does respond correctly when key is removed from namespace, then I agree the best course of action for this issue is to ignore.

Got it. I was not aware of this new usage of RepeatedOmKeyInfo before this. In this case I agree that it should be properly handled like in this PR.

GeorgeJahad · 2023-02-27T19:35:04Z

Got it. I was not aware of this new usage of RepeatedOmKeyInfo before this. In this case I agree that it should be properly handled like in this PR.

Just want to be sure I understand. Does that mean that the value in the KeyTable here:

ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/codec/OMDBDefinition.java

Lines 89 to 96 in c755e5d

    
           KEY_TABLE = 
        
           new DBColumnFamilyDefinition<>( 
        
                   OmMetadataManagerImpl.KEY_TABLE, 
        
                   String.class, 
        
                   new StringCodec(), 
        
                   OmKeyInfo.class, 
        
                   new OmKeyInfoCodec(true));

is changing from OmKeyInfo to RepeatedOmKeyInfo?

smengcl

+1.

I'm fine with ignoring RepeatedOmKeyInfo for now to solve the issue at hand as (IIUC) the new usage of it in OMKeyCommitRequest for hsync seems to be related to deletion.

But would that cause inaccurate counting of the file size in FileSizeCountTask? In that case shall we add back the for loop to process each KeyInfo in RepeatedOmKeyInfo? 84a1e60#diff-b149c501820a6046d4e78940b783982729ffd53a54ebdcf8d44412d477d271c5L137-L151 We can open another jira for this discussion.

smengcl · 2023-02-27T20:21:33Z

Got it. I was not aware of this new usage of RepeatedOmKeyInfo before this. In this case I agree that it should be properly handled like in this PR.

Just want to be sure I understand. Does that mean that the value in the KeyTable here:

ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/codec/OMDBDefinition.java

Lines 89 to 96 in c755e5d

KEY_TABLE =

new DBColumnFamilyDefinition<>(

OmMetadataManagerImpl.KEY_TABLE,

String.class,

new StringCodec(),

OmKeyInfo.class,

new OmKeyInfoCodec(true));

is changing from OmKeyInfo to RepeatedOmKeyInfo?

No. None of the table schema has changed.

The deletion in OmKeyCommit is done to deletedTable. However, that still doesn't explain why the table filter didn't work. I still suspect the event source is the culprit. If we have a repro it will be easy to dig into that.

ChenSammi · 2023-02-28T07:20:34Z

Agree, we should find the root case if this issue is reproducible.
I double checked the code change in OmKeyCommit, although there is new key delete logic, it will go to the deleted table. This exception should not happen at all.

If it's not easily reproducible, I'm fine with this intermediate approach.

...p-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/NSSummaryTaskWithLegacy.java

ChenSammi · 2023-03-01T12:25:34Z

LGTM + 1.

HDDS-7926. [hsync] Recon throws ClassCastException.

0332cb2

jojochuang reviewed Feb 11, 2023

View reviewed changes

Made review changes

c449415

ArafatKhan2198 marked this pull request as ready for review February 13, 2023 07:47

ArafatKhan2198 requested a review from jojochuang February 13, 2023 07:58

Made changes to ignore the RepeatedOmKeyInfo object

84a1e60

ChenSammi reviewed Feb 23, 2023

View reviewed changes

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java Show resolved Hide resolved

ChenSammi reviewed Feb 23, 2023

View reviewed changes

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/FileSizeCountTask.java Show resolved Hide resolved

Added a check for classCast exception in NSSummaryTaskWithLegacy

4146f15

ArafatKhan2198 requested review from ChenSammi and removed request for jojochuang February 27, 2023 07:29

ArafatKhan2198 added 2 commits February 27, 2023 18:11

Fixed bugs

394aa7a

Fixed bugs and failing UT's

1ca9fdc

smengcl approved these changes Feb 27, 2023

View reviewed changes

ChenSammi reviewed Feb 28, 2023

View reviewed changes

...p-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/NSSummaryTaskWithLegacy.java Outdated Show resolved Hide resolved

Made code changes as per review

3e91e0c

smengcl reviewed Feb 28, 2023

View reviewed changes

...p-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/NSSummaryTaskWithLegacy.java Outdated Show resolved Hide resolved

Fixed the log messages

f376d69

Changed log level to warn

1b9a97d

ChenSammi merged commit f0c4d41 into apache:master Mar 1, 2023

ArafatKhan2198 mentioned this pull request Mar 15, 2023

HDDS-7994. Expose information via the ClusterState endpoint on keys and Directories marked for deletion #4406

Merged

jojochuang added the hbase HBase on Ozone support label Jan 24, 2024

HDDS-7926. [hsync] Recon throws ClassCastException. #4266

HDDS-7926. [hsync] Recon throws ClassCastException. #4266

Uh oh!

Conversation

ArafatKhan2198 commented Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smengcl commented Feb 14, 2023

Uh oh!

jojochuang commented Feb 16, 2023

Uh oh!

jojochuang commented Feb 16, 2023

Uh oh!

ArafatKhan2198 commented Feb 19, 2023

Uh oh!

ArafatKhan2198 commented Feb 19, 2023

Uh oh!

ArafatKhan2198 commented Feb 19, 2023

Uh oh!

ArafatKhan2198 commented Feb 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Feb 23, 2023

Uh oh!

Uh oh!

Uh oh!

smengcl commented Feb 27, 2023

Uh oh!

GeorgeJahad commented Feb 27, 2023

Uh oh!

smengcl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smengcl commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Feb 28, 2023

Uh oh!

Uh oh!

Uh oh!

ChenSammi commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ArafatKhan2198 commented Feb 10, 2023 •

edited

Loading

ArafatKhan2198 commented Feb 19, 2023 •

edited

Loading

smengcl left a comment •

edited

Loading

smengcl commented Feb 27, 2023 •

edited

Loading