[BUG] Use shared thread pool for multiple running instances of df on pyrunner #2502

jaychia · 2024-07-11T18:15:44Z

Fixes the thread pool aspect of Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493 by lifting the thread_pool from a local per-generator variable to a shared self._thread_pool in the singleton runner
Fixes the shared resources aspect of Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493 by lifting the inflight_tasks_resources and inflight_tasks to shared variables in the singleton runner

The second part of this is a little dangerous because of the way these variables are used that can introduce a potential race condition:

self._can_admit_task checks against the inflight_tasks_resources to see if a task can be admitted
If so, we then proceed with submitting a task to the thread pool and updating inflight_tasks_resources

If another iterator is somehow able to update inflight_tasks_resources between steps 1 and 2, then our resource accounting would fail. However, I think this should never happen because our generators are not pre-emptable, and I don't think this race condition is possible.

…pyrunner

samster25 · 2024-07-16T03:26:50Z

daft/runners/pyrunner.py

+                            # Register the inflight task and resources used.
+                            future_to_task[future] = next_step.id()
+
+                            self._inflight_tasks[next_step.id()] = next_step


I don't know if the id is guaranteed to be unique across dataframe executions?
let's add an assert that

next_step.id() not in self._inflight_tasks_resources

Just checked: I think it should be

We have a singleton ID_GEN = itertools.count() that is used to generate the IDs, and it is supposedly threadsafe thanks to the GIL.

(Also added the assert)

[BUG] Use shared thread pool for multiple running instances of df on …

52f5734

…pyrunner

github-actions bot added the bug Something isn't working label Jul 11, 2024

jaychia requested a review from samster25 July 11, 2024 18:37

jaychia mentioned this pull request Jul 11, 2024

Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493

Open

Centralize resource accounting

303c449

jaychia requested a review from colin-ho July 12, 2024 20:48

samster25 approved these changes Jul 16, 2024

View reviewed changes

Add ID assert

a6969a6

jaychia enabled auto-merge (squash) July 18, 2024 01:21

jaychia merged commit afcfecd into main Jul 18, 2024
44 checks passed

jaychia deleted the jay/share-thread-pool branch July 18, 2024 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Use shared thread pool for multiple running instances of df on pyrunner #2502

[BUG] Use shared thread pool for multiple running instances of df on pyrunner #2502

jaychia commented Jul 11, 2024 •

edited

Loading

samster25 Jul 16, 2024

jaychia Jul 18, 2024

jaychia Jul 18, 2024

[BUG] Use shared thread pool for multiple running instances of df on pyrunner #2502

[BUG] Use shared thread pool for multiple running instances of df on pyrunner #2502

Conversation

jaychia commented Jul 11, 2024 • edited Loading

samster25 Jul 16, 2024

Choose a reason for hiding this comment

jaychia Jul 18, 2024

Choose a reason for hiding this comment

jaychia Jul 18, 2024

Choose a reason for hiding this comment

jaychia commented Jul 11, 2024 •

edited

Loading