Skip to content

Conversation

@xBis7
Copy link
Contributor

@xBis7 xBis7 commented Sep 9, 2022

What changes were proposed in this pull request?

In the previous PR HDDS-5504, we refactored NSSummaryEndpoint and NSSummaryTask so that we can extract the bucket specific code to new separated classes. So far, Recon has only been supporting FSO buckets. All the existing bucket related code exists in handlers/FSOBucketHandler and tasks/NSSummaryTaskWithFSO. In order to add support for Legacy buckets, we created two new classes handlers/LegacyBucketHandler and tasks/NSSummaryTaskWithLegacy.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7121

How was this patch tested?

This patch was tested with new unit tests. We added two new test classes, TestNSSummaryEndpointWithLegacy and TestNSSummaryTaskWithLegacy. These two classes are as similar as possible to the corresponding FSO classes. We tried to keep the setups exactly the same with TestNSSummaryEndpointWithFSO and TestNSSummaryTaskWithFSO as if they were handling FSO related code.

@GeorgeJahad
Copy link
Contributor

GeorgeJahad commented Sep 15, 2022

So there is a problem with clearing the NSSummary table during reprocess().

Both the fso and legacy tasks write to the nssummary table. The reprocess() method for FSO clears the nssummary table here The Legacy task doesn't clear the table.

But both tasks run in separate threads in parallel There is no guarantee that FSO will run first.

In addition failed tasks get rerun here If one fails and the other doesn't the table clear won't work correctly.

@GeorgeJahad
Copy link
Contributor

GeorgeJahad commented Sep 15, 2022

Also because the two tasks are writing to the same table from separate threads, it would be good to clarify why no synchronization is needed.

@dombizita
Copy link
Contributor

Hi @xBis7 and @GeorgeJahad, thanks for the patch and the problem description. I believe if we already separated the NSSummary handling for legacy and FSO buckets (here) we can maintain two separate NSSummary tables, one for each bucket layout. This would help with the table clear and if one of them fails and there is a retry. Let me know what you think of this idea. cc @smengcl

@xBis7
Copy link
Contributor Author

xBis7 commented Sep 21, 2022

Hi @dombizita, thanks for the feedback. With two separate tables, what happens when Recon gets statistics for both bucket layouts? Do we iterate both tables and sum the results?

@xBis7
Copy link
Contributor Author

xBis7 commented Sep 21, 2022

@dombizita @smengcl Another solution to this, would be to have a common NSSummary table for both bucket layouts. Instead of calling both tasks, we could add process and reprocess methods to NSSummaryTask and call that class in place of the subclasses. NSSummaryTask.process will call both NSSummaryTaskWithFSO.process and NSSummaryTaskWithLegacy.process and we will do the same for reprocess. NSSummaryTask will also manage common setups or operations for NSSummary table, such as clearing it. What do you think about this approach?

@dombizita
Copy link
Contributor

If we have two separate tables for NSSummarys based on the bucket layout (legacy and FSO) we can iterate through both tables right after each other if there is a task that needs information from both types. I think this approach would be nice to avoid issues as we already separated the classes based on the bucket layout.
But as I looked into it for both bucket types we would maintain a table with the same columns, as we add long value (ID) and a NSSummary in both cases. Based on this it would be better to avoid having two same tables.
Your approach sounds good to me, if I understand correctly the NSSummaryTask.process would call eg. first NSSummaryTaskWithFSO.process and after the NSSummaryTaskWithLegacy.process, right?

@xBis7
Copy link
Contributor Author

xBis7 commented Sep 23, 2022

@dombizita NSSummary isn't bucket specific. We would have an issue if we were gathering info for volumes but we are dealing with buckets.

NSSummaryTask.process would call eg. first NSSummaryTaskWithFSO.process and after the NSSummaryTaskWithLegacy.process, right?

Yes, you are right. This way we can control the order by which we are calling any bucket specific task. NSSummaryTask.process will return new ImmutablePair<>(getTaskName(), true) after both bucket tasks have finished running and getTaskName() will be NSSummaryTask. Since NSSummary isn't specific to bucket layout and the entries for different layouts won't have an attribute that differentiates them, it seems more clean to have one table instead of two tables with entries of the same type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the exception being swallowed? it will cause the bucketid to be returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the npe being swallowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the npe being swallowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version of getBucketHandler will throw an NPE if bucketInfo is null. The other version just returns null. They should do the same thing, probably both throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the other version to return null, so that we know that the BucketHandler doesn't exist and later return EntityType.UNKNOWN. See here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usually good practice to keep the methods consistent. If the null check is required, it would be better to put it in the one that is called by this one, so they both have it.

@xBis7
Copy link
Contributor Author

xBis7 commented Sep 26, 2022

@GeorgeJahad I've made the changes based on your recommendations.

@xBis7
Copy link
Contributor Author

xBis7 commented Sep 26, 2022

@dombizita I've made the changes so that NSSummaryTask gets called in place of NSSummaryTaskWithFSO and NSSummaryTaskWithLegacy, like we discussed. Let me know how it looks. I've also modified the tests to go only over each subclass and not NSSummaryTask. If we decide to move forward with this approach I will cleanup the code and add a new class TestNSSummaryTask that will test the case of having buckets of different layouts.

Copy link
Contributor

@dombizita dombizita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xBis7 for updating your patch, this approach of processing and reprocessing looks good to me, I left a few inline comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are also testing the LEGACY buckets here why did you added a ${BUCKET_LAYOUT} variable below that is set to FILE_SYSTEM_OPTIMIZED? Will you add test cases in this robot test file for LEGACY too?

Copy link
Contributor Author

@xBis7 xBis7 Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dombizita Apart from the Namespace Summary for a directory there is no difference between FSO and LEGACY. In order to avoid duplication, we added the BUCKET_LAYOUT as a parameter and gave it the default value of FSO. The idea is to add an extra command to test.sh to also run the test with a parameter -v BUCKET_LAYOUT:LEGACY.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent object id for legacy records is 0, (which is why we added the setKeyParentId() method.) Let's set them back to 0 in the test code as well, just to be sure that setKeyParentId() is really working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent object id for legacy records is 0, (which is why we added the setKeyParentId() method.) Let's set them back to 0 in the test code as well, just to be sure that setKeyParentId() is really working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent object id for legacy records is 0, (which is why we added the setKeyParentId() method.) Let's set them back to 0 in the test code as well, just to be sure that setKeyParentId() is really working.

@GeorgeJahad
Copy link
Contributor

I just tried running: "ozone admin namespace summary /vol1"

It gets a warning:

[Warning] Namespace CLI is only designed for FSO mode.
Bucket being accessed must be of type FILE_SYSTEM_OPTIMIZED bucket layout.

It should be updated to something like this:

[Warning] Namespace CLI is only designed for FSO buckets or LEGACY buckets with "ozone.om.enable.filesystem.paths" set to true.

@xBis7
Copy link
Contributor Author

xBis7 commented Oct 25, 2022

@smengcl CI looks good. Can we merge this?

@smengcl smengcl merged commit 5b7f448 into apache:master Oct 25, 2022
@smengcl
Copy link
Contributor

smengcl commented Oct 25, 2022

Merged. Thanks @xBis7 for the work! Thanks @GeorgeJahad and @dombizita for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Oct 26, 2022
* master: (718 commits)
  HDDS-7342. Move encryption-related code from MultipartCryptoKeyInputStream to OzoneCryptoInputStream (apache#3852)
  HDDS-7413. Fix logging while marking container state unhealthy (apache#3887)
  Revert "HDDS-7253. Fix exception when '/' in key name (apache#3774)"
  HDDS-7396. Force close non-RATIS containers in ReplicationManager (apache#3877)
  HDDS-7121. Support namespace summaries (du, dist & counts) for legacy FS buckets (apache#3746)
  HDDS-7258. Cleanup the allocated but uncommitted blocks (apache#3778)
  HDDS-7381. Cleanup of VolumeManagerImpl (apache#3873)
  HDDS-7253. Fix exception when '/' in key name (apache#3774)
  HDDS-7182. Add property to control RocksDB max open files (apache#3843)
  HDDS-7284. JVM crash for rocksdb for read/write after close (apache#3801)
  HDDS-7368. [Multi-Tenant] Add Volume Existence check in preExecute for OMTenantCreateRequest (apache#3869)
  HDDS-7403. README Security Improvement (apache#3879)
  HDDS-7199. Implement new mix workload Read/Write Freon command (apache#3872)
  HDDS-7248. Recon: Expand the container status page to show all unhealthy container states (apache#3837)
  HDDS-7141. Recon: Improve Disk Usage Page (apache#3789)
  HDDS-7369. Fix wrong order of command arguments in Nonrolling-Upgrade.md (apache#3866)
  HDDS-6210. EC: Add EC metrics (apache#3851)
  HDDS-7355. non-primordial scm fail to get signed cert from primordial SCM when converting an unsecure cluster to secure (apache#3859)
  HDDS-7356. Update SCM-HA.zh.md to match the English version (apache#3861)
  HDDS-6930. SCM,OM,RECON should not print ERROR and exit with code 1 on successful shutdown (apache#3848)
  ...

Conflicts:
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java
smengcl pushed a commit that referenced this pull request Oct 27, 2022
… FS buckets (#3746)

Change-Id: I478fc74ec8b0899441d1c003c2f200d7dd9dcc67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants