Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493

Open
samster25 opened this issue Jul 9, 2024 · 1 comment

Comments

@samster25
Copy link
Member

  • Currently our pyrunner does not account for other dataframe collections that could be running in parallel.
  • Today our pyrunner spawns a threadpool for each collect / iter_partitions and assumes it has all the cores as its disposal
  • Instead we should have a single threadpool that is shared and have centralized accounting.

failure case

iter1 = daft.read_parquet(...).iter_partitions()
iter2 = daft.read_parquet(...).iter_partitions()
@jaychia
Copy link
Contributor

jaychia commented Jul 11, 2024

I think #2502 should fix most of this. However, there is still a potential problem though with multiple dataframe iterators are running in parallel.

Our current implementation doesn't place any hard limits on how many inflight partitions each generator can have running. That is to say, if we run it1 = df1.iter_partitions() alongside a it2 = df2.iter_partitions() and call next(it1); next(it2), we will most likely have a situation in which it1 will schedule a bunch of tasks (up till the limit of either exhausting "inflight resources", or when any partition task has available results), potentially taking up all of the "inflight resources" and starve it2.

To fix this, we need something like #2279 to help us limit the number of in-flight tasks per iterator.

jaychia added a commit that referenced this issue Jul 18, 2024
…pyrunner (#2502)

1. Fixes the thread pool aspect of #2493 by lifting the `thread_pool`
from a local per-generator variable to a shared `self._thread_pool` in
the singleton runner
2. Fixes the shared resources aspect of #2493 by lifting the
`inflight_tasks_resources` and `inflight_tasks` to shared variables in
the singleton runner

The second part of this is a little dangerous because of the way these
variables are used that can introduce a potential race condition:

1. `self._can_admit_task` checks against the `inflight_tasks_resources`
to see if a task can be admitted
2. If so, we then proceed with submitting a task to the thread pool and
updating `inflight_tasks_resources`

If another iterator is somehow able to update `inflight_tasks_resources`
between steps 1 and 2, then our resource accounting would fail. However,
I think this should never happen because our generators are not
pre-emptable, and I don't think this race condition is possible.

---------

Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants