Pause optimize on large merges. attempt 2 #3339

dbanda · 2022-11-03T19:14:33Z

This is a retry of #3158 which got reverted

Context

We want to make our Optimize cron job wait for large merges to complete before starting. This avoids spinning up too many optimize jobs especially in cases where K8s reschedules the cron jobs. The main change in this PR is a check for ongoing large merges at the start of the cron job.

In optimize.py the function get_current_large_merges queries Clickhouse to find large merge jobs. The function is_busy_merging repeatedly calls get_current_large_merges to find if there are existed merges that are large and returns true if there are large merges that pass our threshold for waiting. At the start of the optimize cron job, we repeatedly poll is_busy_merging, sleeping each time it returns true and progress only once it returns false.

This PR also refactors the optimize code to move it to snuba/clickhouse/optimize to avoid cluttering the root with optimize-specific utility functions.

Blast Radius

This affects the optimize cron job. No impact on calling optimize via other methods.

Before State

Optimize cron job would immediately spin up threads calling OPTIMIZE query when started

After State

Optimize cron job checks if there are large merges going on at start that meet a threshold. If they exist, the job repeatedly sleeps until those other merges are finished.

Testing Notes

It's difficult to create actual large merges, instead we mocked the some of the clickhouse query responses to system.merges with predefined partitions of large size; then checked that sleep was called when responses indicated large partitions.

codecov-commenter · 2022-11-03T20:53:34Z

Codecov Report

Base: 92.93% // Head: 27.61% // Decreases project coverage by -65.32% ⚠️

Coverage data is based on head (c793092) compared to base (54e8a43).
Patch coverage: 9.09% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #3339       +/-   ##
===========================================
- Coverage   92.93%   27.61%   -65.33%     
===========================================
  Files         702      662       -40     
  Lines       32256    31064     -1192     
===========================================
- Hits        29976     8577    -21399     
- Misses       2280    22487    +20207

Impacted Files	Coverage Δ
snuba/cli/optimize.py	`0.00% <0.00%> (-42.56%)`	⬇️
snuba/clickhouse/optimize/optimize.py	`0.00% <0.00%> (ø)`
snuba/clickhouse/optimize/optimize_scheduler.py	`0.00% <ø> (ø)`
snuba/clickhouse/optimize/optimize_tracker.py	`0.00% <ø> (ø)`
snuba/clickhouse/optimize/util.py	`0.00% <0.00%> (ø)`
snuba/replacer.py	`0.00% <0.00%> (-92.65%)`	⬇️
tests/datasets/test_errors_replacer.py	`0.00% <0.00%> (-99.62%)`	⬇️
tests/test_replacer.py	`0.00% <0.00%> (-97.20%)`	⬇️
snuba/settings/__init__.py	`95.51% <100.00%> (+0.11%)`	⬆️
tests/base.py	`0.00% <0.00%> (-100.00%)`	⬇️
... and 619 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

dbanda · 2022-11-05T04:10:31Z

For some reason, Github does not detect that test_optimize_tracker.py was moved and thinks it a new file. The only changes there are test_run_optimize_with_ongoing_merges() and test_merge_info()

typo

rebase on master

nikhars · 2022-11-08T18:37:38Z

snuba/clickhouse/optimize/optimize.py

+    # if theres a merge in progress wait for it to finish
+    while is_busy_merging(clickhouse, database, table):
+        logger.info(f"busy merging, sleeping for {OPTIMIZE_BASE_SLEEP_TIME}s")
+        time.sleep(OPTIMIZE_BASE_SLEEP_TIME)


One optimization we could do is potentially add a higher bound on how much time to sleep. Usually merges take 1.5 hours to complete. So you could add an upper bound to protect from code logic bug. The upper bound could be 2 hours. So even if the signal says that there is an ongoing merge, you could proceed after 2 hours have passed from the check starting.

That makes sense. Added the check to set upper bound to 2 hours.

dbanda force-pushed the ref/optimize branch 2 times, most recently from 62ec5b0 to 1ac6a7d Compare November 3, 2022 20:47

dbanda marked this pull request as ready for review November 5, 2022 04:14

dbanda requested a review from a team as a code owner November 5, 2022 04:14

dbanda and others added 3 commits November 4, 2022 21:48

move optimizer to clickhouse folder

d8f8d02

typo

move merge check outside threads

3d8cc2e

move sleep to cron job call stack only

5c28e12

rebase on master

dbanda force-pushed the ref/optimize branch from 1ac6a7d to 7164170 Compare November 5, 2022 04:48

fix optimize import paths

7b8c764

dbanda force-pushed the ref/optimize branch from 7164170 to 7b8c764 Compare November 5, 2022 07:27

nikhars approved these changes Nov 8, 2022

View reviewed changes

nikhars reviewed Nov 8, 2022

View reviewed changes

add max sleep time

c793092

dbanda merged commit 8f23e9f into master Nov 9, 2022

dbanda deleted the ref/optimize branch November 9, 2022 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pause optimize on large merges. attempt 2 #3339

Pause optimize on large merges. attempt 2 #3339

dbanda commented Nov 3, 2022 •

edited

Loading

codecov-commenter commented Nov 3, 2022 •

edited

Loading

dbanda commented Nov 5, 2022

nikhars Nov 8, 2022

dbanda Nov 9, 2022

Pause optimize on large merges. attempt 2 #3339

Pause optimize on large merges. attempt 2 #3339

Conversation

dbanda commented Nov 3, 2022 • edited Loading

Context

Blast Radius

Before State

After State

Testing Notes

codecov-commenter commented Nov 3, 2022 • edited Loading

Codecov Report

dbanda commented Nov 5, 2022

nikhars Nov 8, 2022

Choose a reason for hiding this comment

dbanda Nov 9, 2022

Choose a reason for hiding this comment

dbanda commented Nov 3, 2022 •

edited

Loading

codecov-commenter commented Nov 3, 2022 •

edited

Loading