Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493

samster25 · 2024-07-09T21:41:57Z

Currently our pyrunner does not account for other dataframe collections that could be running in parallel.
Today our pyrunner spawns a threadpool for each collect / iter_partitions and assumes it has all the cores as its disposal
Instead we should have a single threadpool that is shared and have centralized accounting.

failure case

iter1 = daft.read_parquet(...).iter_partitions()
iter2 = daft.read_parquet(...).iter_partitions()

The text was updated successfully, but these errors were encountered:

jaychia · 2024-07-11T18:43:48Z

I think #2502 should fix most of this. However, there is still a potential problem though with multiple dataframe iterators are running in parallel.

Our current implementation doesn't place any hard limits on how many inflight partitions each generator can have running. That is to say, if we run it1 = df1.iter_partitions() alongside a it2 = df2.iter_partitions() and call next(it1); next(it2), we will most likely have a situation in which it1 will schedule a bunch of tasks (up till the limit of either exhausting "inflight resources", or when any partition task has available results), potentially taking up all of the "inflight resources" and starve it2.

To fix this, we need something like #2279 to help us limit the number of in-flight tasks per iterator.

…pyrunner (#2502) 1. Fixes the thread pool aspect of #2493 by lifting the `thread_pool` from a local per-generator variable to a shared `self._thread_pool` in the singleton runner 2. Fixes the shared resources aspect of #2493 by lifting the `inflight_tasks_resources` and `inflight_tasks` to shared variables in the singleton runner The second part of this is a little dangerous because of the way these variables are used that can introduce a potential race condition: 1. `self._can_admit_task` checks against the `inflight_tasks_resources` to see if a task can be admitted 2. If so, we then proceed with submitting a task to the thread pool and updating `inflight_tasks_resources` If another iterator is somehow able to update `inflight_tasks_resources` between steps 1 and 2, then our resource accounting would fail. However, I think this should never happen because our generators are not pre-emptable, and I don't think this race condition is possible. --------- Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

This was referenced Jul 9, 2024

Defining a sampling distribution #2331

Open

[BUG] Use shared thread pool for multiple running instances of df on pyrunner #2502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493

Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493

samster25 commented Jul 9, 2024

jaychia commented Jul 11, 2024 •

edited

Loading

Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493

Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493

Comments

samster25 commented Jul 9, 2024

jaychia commented Jul 11, 2024 • edited Loading

jaychia commented Jul 11, 2024 •

edited

Loading