-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10582: [Rust] [DataFusion] Implement "repartition" operator #8982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8982 +/- ##
==========================================
- Coverage 82.64% 82.54% -0.11%
==========================================
Files 200 201 +1
Lines 49730 49983 +253
==========================================
+ Hits 41098 41256 +158
- Misses 8632 8727 +95
Continue to review full report at Codecov.
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is headed @andygrove -- 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that you asked, but if I had to pick 2 of these three schemes to implement, I would pick RoundRobinBatch and Hash and leave RoundRobinRow until later
The rationale being that I theorize RoundRobinRow usecase is much less common (e.g. maybe re-evening output of joins or filters, but I would expect most operators to respect the requested batch size if possible when creating their output)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That makes sense and I have removed RoundRobinRow now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I was trying to implement the RoundRobinRow functionality independently and was going down a route similar to the StructBuilder vector of builders route: https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1600. Staying at the RecordBatch level is much more sensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me that the biggest buffer we would want would be the total number of cores available for processing. Any larger and we are just wasting memory and cache size if the producer can make them faster than the consumer can consume them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this turned out to be the really challenging part. Input partitions are sending to multiple output partitions, but those output partitions could be read in order and this results in deadlocks if the buffer is too small. I switched to using unbounded channels for now to make this functional but I know this isn't a great solution. I think I need to sleep on this and have another look tomorrow now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding unit tests now that will be easily modifiable to demonstrate this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the deadlock problem to be most acute when trying to keep the data sorted (e.g a traditional merge). I didn't think we had any operators like that (yet) in DataFusion.
Maybe we need to use try_recv when reading from channels rather than recvso as not to block on empty channels
When we do actually have something that is trying to keep the data sorted, the behavior you want is "keep producing until every output channel has at least one record batch"
Using round robin repartitioning, you can probably avoid infinite channels. Using hash re-partitioning, however, I don't think in general there is any way to ensure you have evenly distributed rows
|
@alamb @jorgecarleitao I have not implemented a I have a |
|
@alamb @jorgecarleitao never mind ... switching to crossbeam did the trick |
8209284 to
d845f92
Compare
|
@alamb @jorgecarleitao @seddonm1 @Dandandan This is ready for review now |
| let mut rx = self.rx.lock().await; | ||
|
|
||
| let num_input_partitions = self.input.output_partitioning().partition_count(); | ||
| let num_output_partition = self.partitioning.partition_count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding but num_output_partition -> num_output_partitions would help readability
| let mut counter = 0; | ||
| while let Some(result) = stream.next().await { | ||
| match partitioning { | ||
| Partitioning::RoundRobinBatch(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash partition is not yet implemented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I filed https://issues.apache.org/jira/browse/ARROW-11011 to implement hash partitioning as a separate PR since it will be quite a lot of work.
RepartitionExec::try_new returns a DataFusionError::NotImplemented error if you try and create it with the hash partitioning scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok 👍 makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an old hash repartitioning code in a branch around from a previous try. Quite old by now, but I can definitely put it together for this (like I did for the join). I think we now actually have the framework in place to use it.
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @andygrove . Good as is. Left minor comments.
|
|
||
| fn repartition( | ||
| &self, | ||
| partitioning_scheme: Partitioning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this introduces a new naming, partitioning_scheme.
We have:
partitionpartitioningpartitioning_schemerepartitionpart
I do not know the common notation, but we could try to reduce the number of different names we use.
In my (little) understanding:
- data is partitioned according to a
partition - partitioned data is divided in
parts - we can
repartitionit according to a newpartition.
In this understanding, I would replace partitioning and partitioning_scheme by partition.
Even if this understanding is not correct, maybe we could reduce the number of different names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree keeping the number of different names low is important
I suggest using
partitionto refer to an actual portion of the data (in a bunch ofRecordBatches)partitioningto refer to the "schema" of how the data is divided intopartitions (the use of thePartitioningscheme now)
Thus we would repartition the data into a new partitioning
| let mut counter = 0; | ||
| while let Some(result) = stream.next().await { | ||
| match partitioning { | ||
| Partitioning::RoundRobinBatch(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an old hash repartitioning code in a branch around from a previous try. Quite old by now, but I can definitely put it together for this (like I did for the join). I think we now actually have the framework in place to use it.
rust/datafusion/src/dataframe.rs
Outdated
| /// let mut ctx = ExecutionContext::new(); | ||
| /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?; | ||
| /// let df1 = df.repartition(Partitioning::RoundRobinBatch(4))?; | ||
| /// let df2 = df.repartition(Partitioning::Hash(vec![col("a")], 4))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not place it in an example since we do not support it yet.
| pub enum Partitioning { | ||
| /// Allocate batches using a round-robin algorithm | ||
| RoundRobinBatch(usize), | ||
| /// Allocate rows based on a hash of one of more expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document usize? (Number of parts?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a comment here that Hash partitioning is not yet completely implemented so as to avoid runtime disappointment for someone who sees this enum in the code
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nicely done @andygrove 👍
|
|
||
| fn repartition( | ||
| &self, | ||
| partitioning_scheme: Partitioning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree keeping the number of different names low is important
I suggest using
partitionto refer to an actual portion of the data (in a bunch ofRecordBatches)partitioningto refer to the "schema" of how the data is divided intopartitions (the use of thePartitioningscheme now)
Thus we would repartition the data into a new partitioning
| pub enum Partitioning { | ||
| /// Allocate batches using a round-robin algorithm | ||
| RoundRobinBatch(usize), | ||
| /// Allocate rows based on a hash of one of more expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a comment here that Hash partitioning is not yet completely implemented so as to avoid runtime disappointment for someone who sees this enum in the code
| // partitions to be blocked when sending data to output receivers that are not | ||
| // being read yet. This may cause high memory usage if the next operator is | ||
| // reading output partitions in order rather than concurrently. One workaround | ||
| // for this would be to add spill-to-disk capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other work around is to ensure that any operator that reads from multiple partitions doesn't block waiting for data from partition channel if other partitions can produce data.
With that invariant, the only operators that would need spill to disk would be ones that are maintaining the sorted ness (e.g a classic merge)
Co-authored-by: Andrew Lamb <[email protected]>
|
Thanks for the reviews. I have pushed changes to address feedback:
|
…void potential deadlocks # Rationale As spotted / articulated by @edrevo #9523 (comment), the intermixing of `crossbeam` channels (not designed for `async` and can block task threads) and `async` code such as DataFusion can lead to deadlock. At least one of the crossbeam uses predates DataFusion being async (e.g. the one in the parquet reader). The use of crossbeam in the repartition operator in #8982 may have resulted from the re-use of the same pattern. # Changes 1. Removes the use of crossbeam channels from DataFusion (in `RepartitionExec` and `ParquetExec`) and replace with tokio channels (which are designed for single threaded code). 2. Removes `crossbeam` dependency entirely 3. Removes use of `multi_thread`ed executor in tests (e.g. `#[tokio::test(flavor = "multi_thread")]`) which can mask hangs # Kudos / Thanks This PR incorporates the work of @seddonm1 from #9603 and @edrevo in https://github.com/edrevo/arrow/tree/remove-crossbeam (namely 97c256c4f76b8185311f36a7b27e317588904a3a). A big thanks to both of them for their help in this endeavor. Closes #9605 from alamb/alamb/remove_hang Lead-authored-by: Ximo Guanter <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Mike Seddon <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
This PR adds support for the
repartitionoperator and it is plumbed through from theDataFrameAPI all the way through to execution.The benchmark crate TPC-H file conversion utility has been updated to take advantage of this new operator.
I can break this down into smaller PRs if that helps.