Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause optimize on large merges. attempt 2 #3339

Merged
merged 5 commits into from
Nov 9, 2022
Merged

Pause optimize on large merges. attempt 2 #3339

merged 5 commits into from
Nov 9, 2022

Conversation

dbanda
Copy link
Contributor

@dbanda dbanda commented Nov 3, 2022

This is a retry of #3158 which got reverted

Context

We want to make our Optimize cron job wait for large merges to complete before starting. This avoids spinning up too many optimize jobs especially in cases where K8s reschedules the cron jobs. The main change in this PR is a check for ongoing large merges at the start of the cron job.

In optimize.py the function get_current_large_merges queries Clickhouse to find large merge jobs. The function is_busy_merging repeatedly calls get_current_large_merges to find if there are existed merges that are large and returns true if there are large merges that pass our threshold for waiting. At the start of the optimize cron job, we repeatedly poll is_busy_merging, sleeping each time it returns true and progress only once it returns false.

This PR also refactors the optimize code to move it to snuba/clickhouse/optimize to avoid cluttering the root with optimize-specific utility functions.

Blast Radius

This affects the optimize cron job. No impact on calling optimize via other methods.

Before State

Optimize cron job would immediately spin up threads calling OPTIMIZE query when started

After State

Optimize cron job checks if there are large merges going on at start that meet a threshold. If they exist, the job repeatedly sleeps until those other merges are finished.

Testing Notes

It's difficult to create actual large merges, instead we mocked the some of the clickhouse query responses to system.merges with predefined partitions of large size; then checked that sleep was called when responses indicated large partitions.

@dbanda dbanda force-pushed the ref/optimize branch 2 times, most recently from 62ec5b0 to 1ac6a7d Compare November 3, 2022 20:47
@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2022

Codecov Report

Base: 92.93% // Head: 27.61% // Decreases project coverage by -65.32% ⚠️

Coverage data is based on head (c793092) compared to base (54e8a43).
Patch coverage: 9.09% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3339       +/-   ##
===========================================
- Coverage   92.93%   27.61%   -65.33%     
===========================================
  Files         702      662       -40     
  Lines       32256    31064     -1192     
===========================================
- Hits        29976     8577    -21399     
- Misses       2280    22487    +20207     
Impacted Files Coverage Δ
snuba/cli/optimize.py 0.00% <0.00%> (-42.56%) ⬇️
snuba/clickhouse/optimize/optimize.py 0.00% <0.00%> (ø)
snuba/clickhouse/optimize/optimize_scheduler.py 0.00% <ø> (ø)
snuba/clickhouse/optimize/optimize_tracker.py 0.00% <ø> (ø)
snuba/clickhouse/optimize/util.py 0.00% <0.00%> (ø)
snuba/replacer.py 0.00% <0.00%> (-92.65%) ⬇️
tests/datasets/test_errors_replacer.py 0.00% <0.00%> (-99.62%) ⬇️
tests/test_replacer.py 0.00% <0.00%> (-97.20%) ⬇️
snuba/settings/__init__.py 95.51% <100.00%> (+0.11%) ⬆️
tests/base.py 0.00% <0.00%> (-100.00%) ⬇️
... and 619 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dbanda
Copy link
Contributor Author

dbanda commented Nov 5, 2022

For some reason, Github does not detect that test_optimize_tracker.py was moved and thinks it a new file. The only changes there are test_run_optimize_with_ongoing_merges() and test_merge_info()

@dbanda dbanda marked this pull request as ready for review November 5, 2022 04:14
@dbanda dbanda requested a review from a team as a code owner November 5, 2022 04:14
Comment on lines +92 to +95
# if theres a merge in progress wait for it to finish
while is_busy_merging(clickhouse, database, table):
logger.info(f"busy merging, sleeping for {OPTIMIZE_BASE_SLEEP_TIME}s")
time.sleep(OPTIMIZE_BASE_SLEEP_TIME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One optimization we could do is potentially add a higher bound on how much time to sleep. Usually merges take 1.5 hours to complete. So you could add an upper bound to protect from code logic bug. The upper bound could be 2 hours. So even if the signal says that there is an ongoing merge, you could proceed after 2 hours have passed from the check starting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Added the check to set upper bound to 2 hours.

@dbanda dbanda merged commit 8f23e9f into master Nov 9, 2022
@dbanda dbanda deleted the ref/optimize branch November 9, 2022 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants