[Worker] Implement parallel upload processing #84

matt-codecov · 2023-06-30T17:41:21Z

UploadTask breaks the list of uploads for a commit into chunks of 3 and then dispatches UploadProcessorTasks for each chunk serially:
https://github.com/codecov/worker/blob/master/tasks/upload.py#L374-L387

UploadProcessorTask will lock the whole task in redis, fetch the Report for the commit as it was left by the previous instance of UploadProcessorTask, and update it with the result of processing the current chunk:
https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L80-L84
https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L135-L140
https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L168-L174
https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L191-L192
https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L201-L209

We may be able to improve performance for many-upload uses by following more of a map/reduce pattern: run all the chunk processing tasks in parallel (via Celery chord) and then synchronize and merge all the reports in a single task at the end.

The text was updated successfully, but these errors were encountered:

trent-codecov · 2023-07-05T20:26:49Z

This is a great idea and merits further research. Please research this and create a notion investigation into the feasibility here. Involve Scott, Dana, Gio as needed to gain context and discuss the solution.

matt-codecov · 2023-08-03T00:03:38Z

first step is build something that will let us measure whether the idea works or not. the way the code is, it's not dashboardable with statsd timers or sentry traces. i have a PR up with a solution, and once it is in prod for a little while we can make a dashboard, deploy this proposal, and see how the metrics move

matt-codecov · 2023-08-16T21:16:56Z

end of sprint update:

actively investigating approach and size of work. opinion so far is: probably possible but will need hearty scrutiny from the team to do it safely. will try to conclude investigation with a PoC or proposal
in parallel, have deployed + built a dashboard for upload flow metrics. looking at them, my hypothesis for this change is:
- it'll bring the prod p95 for batch processing, total processing, and notification latency much closer to p50 and p75
- it may also speed up p50 and p75, but probably not as dramatically

matt-codecov · 2023-08-23T03:48:32Z

i've spent a couple days hacking on this now. pushed where i left off to matt/prototype-parallel-processor.

i think it's achievable but implementation/validation/rollout will be a lot of work. going to backburner it for now and will revisit later in q3 or in q4. so leaving the task open but it won't be added to sprints until maybe later

the high-level approach is:

upload.py
- break uploads into batches, run them in a chord rather than a chain so celery will run them concurrently
upload_processor.py
- instead of calling save_report_results(), upload report.to_database() and report.to_archive() to archive storage
- return an intermediate result: a dict with the archive paths you just uploaded
upload_finisher.py
- download all the incremental results and construct reports from them
- merge all the reports
- call save_report_results() and then continue as normal

the current report-merging code really only supports merging one upload at a time. report.merge(other_report) assumes other_report only has one session, and there's a weird responsibility division where worker's raw_upload_processor has an _adjust_sessions() function that looks at a session that was just added and decides whether other previously-added sessions should be removed, and something about labels. making sense of this logic is slow and the main reason i'm backburning this

when i come back to it, i think the fastest way to get a demo is to lower the batch size from 3 to 1. then i should be able to reuse report.merge() and steal _adjust_sessions() from raw_upload_processor.py.

for a real implementation, there's more thinking to do:

what is the right batch size? 1 may let us skip risky changes to report-merging code, but paying task overhead for every single upload may neutralize the benefits
can we use asyncio-powered minio/gcp libraries? this would help bulk downloads like the upload_finisher.py but probably other places too
i had a third one but forgot it

matt-codecov · 2023-10-02T18:09:59Z

codecov/worker#127 is a draft PR that implements this in worker. it uses a batch size of 1 as we dynamically scale our worker deployment based on queue size so it Should Be Fine (tm) and uses threadpools to download incremental results concurrently in upload_finisher() despite not using an asyncio-aware GCS client

more validation needs to happen before merging:

need to set up a many-upload repo locally and run with/without the feature enabled and confirm output is equivalent
need to write automated tests for the new behavior (at time of writing, tests pass but only covering the old behavior)
need to ensure after_n_builds still works.
- previously the lock around upload processing ensured the Nth upload is the Nth to finish processing. without the lock, the Nth upload may potentially finish while the N-1th is still running. in that case, a notification should not be sent. make sure that's still the case

going to backburner for now for two reasons:

focusing on my metrics KR as completing it will help us roll this out safely
looking through GCP logs for commits with >100 uploads, they're processed in groups of 22, and then 19, and then 14, and then 18, and so on. it's those groups that are processed in parallel. making each group process faster would be valuable, but it can't influence the spacing of those groups which may add up to a large part of the delay in getting the final, most accurate PR comment.

michelletran-codecov · 2024-08-20T16:24:04Z

Linking this issue: https://github.com/codecov/internal-issues/issues/699 as I think this can be resolved with parallelization.

trent-codecov assigned matt-codecov Jul 5, 2023

trent-codecov added the investigation label Jul 6, 2023

matt-codecov mentioned this issue Aug 2, 2023

feat: add "flow logger" + instrument upload flow with it codecov/worker#47

Merged

matt-codecov mentioned this issue Sep 30, 2023

feat/perf: run upload_processor in parallel (with feature gate) codecov/worker#127

Closed

rohan-at-sentry added this to the Q2'24 milestone Apr 18, 2024

codecov-hooky bot added the epic this label is used to mark issues as epics label Apr 25, 2024

rohan-at-sentry added the carryover Carryover from a previous sprint label Jul 11, 2024

thomasrockhu-codecov modified the milestones: Q2'24, Q3'24 Jul 17, 2024

trent-codecov assigned Swatinem and unassigned matt-codecov Jul 31, 2024

trent-codecov added P0: must do priority 10 and removed investigation carryover Carryover from a previous sprint labels Jul 31, 2024

trent-codecov changed the title ~~Check feasibility of running chained UploadProcessor tasks in parallel~~ [Worker] Implement parallel upload processing Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Worker] Implement parallel upload processing #84

[Worker] Implement parallel upload processing #84

matt-codecov commented Jun 30, 2023 •

edited by trent-codecov

Loading

trent-codecov commented Jul 5, 2023

matt-codecov commented Aug 3, 2023

matt-codecov commented Aug 16, 2023 •

edited by rfox-codecov

Loading

matt-codecov commented Aug 23, 2023

matt-codecov commented Oct 2, 2023 •

edited

Loading

michelletran-codecov commented Aug 20, 2024

[Worker] Implement parallel upload processing #84

[Worker] Implement parallel upload processing #84

Comments

matt-codecov commented Jun 30, 2023 • edited by trent-codecov Loading

trent-codecov commented Jul 5, 2023

matt-codecov commented Aug 3, 2023

matt-codecov commented Aug 16, 2023 • edited by rfox-codecov Loading

matt-codecov commented Aug 23, 2023

matt-codecov commented Oct 2, 2023 • edited Loading

michelletran-codecov commented Aug 20, 2024

matt-codecov commented Jun 30, 2023 •

edited by trent-codecov

Loading

matt-codecov commented Aug 16, 2023 •

edited by rfox-codecov

Loading

matt-codecov commented Oct 2, 2023 •

edited

Loading