Skip to content

Conversation

@hemantk-12
Copy link
Contributor

What changes were proposed in this pull request?

This change is to create a operational tool to create an image for the compaction log DAG at any given point in time.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8961

How was this patch tested?

Manually tested it on Docker.

@smengcl smengcl added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jul 13, 2023
Comment on lines +4653 to +4652
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check here to allow only Ozone admins to run this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Otherwise anyone can call this and fill up the /tmp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll raise a PR addressing this comment and @errose28's concern.

Copy link
Contributor

@prashantpogde prashantpogde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@prashantpogde prashantpogde changed the title HDDS-8961. [Snapshot] Print compaction log dag and create an image HDDS-8961. Write a standalone tool to create a visual graph from compaction log files. Jul 14, 2023
@hemantk-12 hemantk-12 marked this pull request as ready for review July 15, 2023 06:58
@prashantpogde prashantpogde merged commit c8e63d2 into apache:master Jul 16, 2023
@prashantpogde
Copy link
Contributor

Thank you @hemantk-12 for this PR and addressing review comments.

@errose28
Copy link
Contributor

@hemantk-12 @smengcl @prashantpogde I think this PR was incorrectly merged with some existing problems:

  • Does this call need to go to the running OM, or can the compaction dag be read by the ozone debug process from disk state if run on the OM host? The later is how most Ozone debug calls work. Having a network call that generates a file on the server is strange, which brings me to my other concerns.

  • The PR was merged with an existing security concern not addressed. This definitely needs to be restricted to admins since it could be invoked indefinitely to fill up the OM OS disk.

  • There is no regard for absolute or relative paths in the filename parameter. Either some server side handling or documentation in the help message should either handle this case or explain that the base path is fixed.

  • Help message does not specify that the image is created on the server side. Users may think the command is broken.

  • Response should include which OM generated the file. It will only exist on the OM who is the leader at the time the command was run.

If this command can be made to run locally on an OM host without depending on the OM process then a lot of the above concerns should be addressed. Otherwise I think these will need to be fixed, preferably with high priority (especially the security concern). We have enough tech debt around Ozone troubleshooting, let's not add more.

@hemantk-12
Copy link
Contributor Author

@hemantk-12 @smengcl @prashantpogde I think this PR was incorrectly merged with some existing problems:

* Does this call need to go to the running OM, or can the compaction dag be read by the `ozone debug` process from disk state if run on the OM host? The later is how most Ozone debug calls work. Having a network call that generates a file on the server is strange, which brings me to my other concerns.

* The PR was merged with an [existing security](https://github.com/apache/ozone/pull/5061#discussion_r1263077404) concern not addressed. This definitely needs to be restricted to admins since it could be invoked indefinitely to fill up the OM OS disk.

* There is no regard for absolute or relative paths in the filename parameter. Either some server side handling or documentation in the help message should either handle this case or explain that the base path is fixed.

* Help message does not specify that the image is created on the server side. Users may think the command is broken.

* Response should include which OM generated the file. It will only exist on the OM who is the leader at the time the command was run.

If this command can be made to run locally on an OM host without depending on the OM process then a lot of the above concerns should be addressed. Otherwise I think these will need to be fixed, preferably with high priority (especially the security concern). We have enough tech debt around Ozone troubleshooting, let's not add more.

Thanks for the review and suggestion @errose28

The reason I made it a OM call because I needed current state of DAG. You can argue that we can get that from compaction log but I had slight doubt that it won't give you exact picture as in memory.
I think most of the concerns can be resolved if I change OM request to return the DAG as java object and construct the image on client side or use compaction log instead of the way it was done.

Should we revert this PR in meanwhile I make the change to address your concern? Or just create new one with fix?

@errose28
Copy link
Contributor

errose28 commented Jul 17, 2023

The reason I made it a OM call because I needed current state of DAG. You can argue that we can get that from compaction log but I had slight doubt that it won't give you exact picture as in memory.

I'm not currently familiar with the snapshot implementation so this is up to you. As a debug tool it might actually be better to return the state from memory like you suggested since that is what the OM is actually using.

I think most of the concerns can be resolved if I change OM request to return the DAG as java object and construct the image on client side or use compaction log instead of the way it was done.

Yeah some way to pass the result as either a java/proto object or a stream of bytes that the client writes as an image file sounds good, assuming the sizes of either response will not be prohibitively large. We may need to check which approach is easier to implement and generates the smallest size response. Another option is to return the bytes as an svg instead of png. png is not as compressed, jpg would be lossy, but svg should give all the info in the most compact form factor. Gzipping as svgz before sending and then unpacking client side could help reduce size on the wire even more.

Should we revert this PR in meanwhile I make the change to address your concern? Or just create new one with fix?

If the fix can be completed in a few days, let's just create a new one with the fix. I don't think we should cut a release with these changes as is, but I don't think that is happening from apache master too soon.

@hemantk-12
Copy link
Contributor Author

@hemantk-12 @smengcl @prashantpogde I think this PR was incorrectly merged with some existing problems:

* Does this call need to go to the running OM, or can the compaction dag be read by the `ozone debug` process from disk state if run on the OM host? The later is how most Ozone debug calls work. Having a network call that generates a file on the server is strange, which brings me to my other concerns.

* The PR was merged with an [existing security](https://github.com/apache/ozone/pull/5061#discussion_r1263077404) concern not addressed. This definitely needs to be restricted to admins since it could be invoked indefinitely to fill up the OM OS disk.

* There is no regard for absolute or relative paths in the filename parameter. Either some server side handling or documentation in the help message should either handle this case or explain that the base path is fixed.

* Help message does not specify that the image is created on the server side. Users may think the command is broken.

* Response should include which OM generated the file. It will only exist on the OM who is the leader at the time the command was run.

If this command can be made to run locally on an OM host without depending on the OM process then a lot of the above concerns should be addressed. Otherwise I think these will need to be fixed, preferably with high priority (especially the security concern). We have enough tech debt around Ozone troubleshooting, let's not add more.

Raised PR: #5087 to address major concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants