Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Use shared thread pool for multiple running instances of df on pyrunner #2502

Merged
merged 3 commits into from
Jul 18, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Jul 11, 2024

  1. Fixes the thread pool aspect of Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493 by lifting the thread_pool from a local per-generator variable to a shared self._thread_pool in the singleton runner
  2. Fixes the shared resources aspect of Do correct resource accounting if running multiple dataframe collections in the pyrunner #2493 by lifting the inflight_tasks_resources and inflight_tasks to shared variables in the singleton runner

The second part of this is a little dangerous because of the way these variables are used that can introduce a potential race condition:

  1. self._can_admit_task checks against the inflight_tasks_resources to see if a task can be admitted
  2. If so, we then proceed with submitting a task to the thread pool and updating inflight_tasks_resources

If another iterator is somehow able to update inflight_tasks_resources between steps 1 and 2, then our resource accounting would fail. However, I think this should never happen because our generators are not pre-emptable, and I don't think this race condition is possible.

@jaychia jaychia requested a review from colin-ho July 12, 2024 20:48
# Register the inflight task and resources used.
future_to_task[future] = next_step.id()

self._inflight_tasks[next_step.id()] = next_step
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the id is guaranteed to be unique across dataframe executions?
let's add an assert that

next_step.id() not in self._inflight_tasks_resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked: I think it should be

We have a singleton ID_GEN = itertools.count() that is used to generate the IDs, and it is supposedly threadsafe thanks to the GIL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also added the assert)

@jaychia jaychia enabled auto-merge (squash) July 18, 2024 01:21
@jaychia jaychia merged commit afcfecd into main Jul 18, 2024
44 checks passed
@jaychia jaychia deleted the jay/share-thread-pool branch July 18, 2024 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants