Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#4182, FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes #4391
PERF-#4182, FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes #4391
Changes from all commits
10bd02c
69402db
e568d44
1add93e
d8236d9
06f630c
ebeed39
5c7b0e2
828894f
c5edce8
859b079
5d2d79f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case when we can pass the right partitions and the respective call queues in the remote calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can but I don't see any benefit from applying of the respective call queue of right partition in the remote call here. This will be useful in case we have lazy execution (but even that is questionable, described below), but here we use
partition.apply
which runs an execution of remote calls right now.In this row we run an early execution of the call queue in background for each
right_partition
and immediately actualizeright_partition._data
. After, during serial logic ofleft_partition.apply
call for each partition we will have already finished remote calls from the call queue of the eachright_partition
with some probability.If we will use your way we can than draining of call queue for
right_partition
will happen only after draining of call queue forleft_partition
in remote call.Also, by providing the call queue of right partition together with
right_partition._data
inleft_partition.apply
we don't drainright_partition.call_queue
and don't actualize theright_partition._data
. If we will work with the right dataframe further in the code we will again rundrain_call_queue
for each partition of right dataframe. In the result we have double work of the same calls but in different places.What benefits do you see from your suggestion? How we will resolve the issue which I described in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are by far valid points but on the other hand there are others that can affect the performance. For instance, it we are considering Ray, the following factors but probably not limited to can affect the performance: 1. how many partitions are contained in the right array. 2. how many physical (materialized) data of the right partitions will be saved into in-process memory of the driver. 3. how many physical (materialized) data of the right partitions will be saved into the plasma.
We can experiment this case as part of a separate issue if we see good times without passing the call queues of the right partitions in to the remote calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe that the main issue with implementation you suggest will in duplicating of processing operations for right partition(the first one is draining of call queue in remote call from the left partition, the second one in right partition directly in other part of code).
I mean the next flow in the sentences above:
So, I suggest to experiment with this new execution flow in a separate issue, because it, possible, requires a lot of architectural changes.