-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(periodic): config for maxage/maxsize to prevent recording upload timeouts due to large filesize #96
fix(periodic): config for maxage/maxsize to prevent recording upload timeouts due to large filesize #96
Conversation
c690b39
to
d7f394d
Compare
1c49e60
to
4aab4d7
Compare
I'll take a look. |
I assume the maxAge cannot be smaller than the period? I tried this setting the |
|
Yea, there are certain minimums that the JFR system itself within the JVM will enforce, like the minimum size of a chunk and therefore the minimum number of events that will be included. If that minimum threshold isn't met then the maxage/maxsize policies won't be applied. In order to get a short maxage like that to actually apply you'd need to be recording on a target application that is generating a lot more events per second. |
4aab4d7
to
2ed06ef
Compare
One more question, I've set the harvesting period to 10min and the
|
I'll have to give this more thought. I think this reveals a deeper bug about the Agent lifecycle and how the discovery ping is handled with re-registration. The Agent currently handles this by going through deregistration and re-registration internally, but the deregistration is what interrupts the periodic upload schedule and I think is also breaking the onexit upload. |
2ed06ef
to
6c389fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good now!
Signed-off-by: Andrew Azores <[email protected]>
6c389fa
to
4b27d45
Compare
Rebased with no changes for commit signing. |
Fixes #95
Depends on #99
Adds two new config parameters for the maxage/maxsize JFR properties. These two properties could already be controlled previously for recordings that are uploaded when the host JVM is exiting, but periodically pushed recordings would always push the entire available JFR repository contents. It's likely that this results in overlapping recording chunks being pushed to the server frequently, wasting network bandwidth and storage capacity. The new config parameters can be tuned to minimize this waste. The maxsize is not applied by default, but the maxage is, and the default data age is taken as 1.5x the harvester period. This will result in some overlap of recording chunks on each push, but probably much less than previously for common harvester periods and JFR repository sizes.
Also included are some fixes to ensure that one thread is responsible for managing the harvester and the state that it controls, and to ensure that the harvester does not get into a bad spinning state that I've seen when uploads fail. I have also seen uploads fail when the server sends the registration refresh POST signal, which would cause the spinning behaviour. In this PR single uploads that overlap(?) with handling the registration signal may still fail, but they are handled gracefully and the periodic push resumes as normal on the next scheduled attempt.
"Spinning" fix testing
Use the following Cryostat
smoketest.sh
:CRYOSTAT_DISCOVERY_PING_PERIOD=30000 sh smoketest.sh
. This sets the discovery callback POST ping signal to occur every 30 seconds. Every 15 seconds the agent should try to push a harvested JFR file to the server. Before this PR, the agent will get into a bad "spinning" state quite easily whenever it needs to reregister itself after the POST signal. After the PR the same root failure can still be observed, but results in the agent simply trying again later and succeeding.maxage/maxsize testing
TODO determine a quicker way. I have observed some issues with the agent trying to push very large files and timing out when leaving the standard
smoketest.sh
setup running for long periods of time, but I'm not sure exactly how long this takes or if there are other extenuating circumstances that also contribute to the problems I have seen.