-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pause optimize on large merges. attempt 2 #3339
Conversation
62ec5b0
to
1ac6a7d
Compare
Codecov ReportBase: 92.93% // Head: 27.61% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #3339 +/- ##
===========================================
- Coverage 92.93% 27.61% -65.33%
===========================================
Files 702 662 -40
Lines 32256 31064 -1192
===========================================
- Hits 29976 8577 -21399
- Misses 2280 22487 +20207
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
For some reason, Github does not detect that |
# if theres a merge in progress wait for it to finish | ||
while is_busy_merging(clickhouse, database, table): | ||
logger.info(f"busy merging, sleeping for {OPTIMIZE_BASE_SLEEP_TIME}s") | ||
time.sleep(OPTIMIZE_BASE_SLEEP_TIME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One optimization we could do is potentially add a higher bound on how much time to sleep. Usually merges take 1.5 hours to complete. So you could add an upper bound to protect from code logic bug. The upper bound could be 2 hours. So even if the signal says that there is an ongoing merge, you could proceed after 2 hours have passed from the check starting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Added the check to set upper bound to 2 hours.
This is a retry of #3158 which got reverted
Context
We want to make our Optimize cron job wait for large merges to complete before starting. This avoids spinning up too many optimize jobs especially in cases where K8s reschedules the cron jobs. The main change in this PR is a check for ongoing large merges at the start of the cron job.
In optimize.py the function
get_current_large_merges
queries Clickhouse to find large merge jobs. The functionis_busy_merging
repeatedly callsget_current_large_merges
to find if there are existed merges that are large and returns true if there are large merges that pass our threshold for waiting. At the start of the optimize cron job, we repeatedly pollis_busy_merging
, sleeping each time it returns true and progress only once it returns false.This PR also refactors the optimize code to move it to
snuba/clickhouse/optimize
to avoid cluttering the root with optimize-specific utility functions.Blast Radius
This affects the optimize cron job. No impact on calling optimize via other methods.
Before State
Optimize cron job would immediately spin up threads calling OPTIMIZE query when started
After State
Optimize cron job checks if there are large merges going on at start that meet a threshold. If they exist, the job repeatedly sleeps until those other merges are finished.
Testing Notes
It's difficult to create actual large merges, instead we mocked the some of the clickhouse query responses to
system.merges
with predefined partitions of large size; then checked that sleep was called when responses indicated large partitions.