Ensure global stream partitioning contains node partitioning columns#11287
Ensure global stream partitioning contains node partitioning columns#11287sopel39 merged 1 commit intoprestodb:masterfrom
Conversation
b1a5dbc to
061eec7
Compare
Streams global partitioning shouldn't be contradictionary to node partitioning. Otherwise invalid AddExchanges decisions could be made.
061eec7 to
09a9d3d
Compare
|
@sopel39 Karol, could you share some context behind this change? Did you run into an issue that prompted you to add this check? Could you give some examples of when node-partitioning symbols are a subset and a superset of the stream-partitioning symbols? |
The current model, in theory, allows for node-level partitioning according to a set of columns and stream-level partition according to a different set of columns (e.g., partitioned across nodes on |
The context of this assertion relates to "global streaming partitioning property", which I believe describes how stream is partitioned on a global/cluster level. This is different Take a look at: if global streaming partitioning was incompatible with global node partitioning then we would make an invalid exchange decision.
Table scan might provide "global stream partitioning", but not "node partitioning". Exchange node provides both "global stream partitioning" and "node partitioning" (com/facebook/presto/sql/planner/optimizations/PropertyDerivations.java:538). |
|
@sopel39 Karol, I'm still trying to understand this. Global properties appear to have two settings: node and stream partitioning. Stream partitioning appears to describe split partitioning and I assume is only used to capture source partitioning. I'm further assuming that the scheduling code is somehow converting source partitioning into node partitioning by scheduling splits on separate nodes, through I don't know how that can be achieved if there are more splits than nodes. CC: @dain Hence, Furthermore, partitioned_on(a) implies partitioned_on(a, b, c), right? Hence, if node partitioning columns are a superset of split partitioning columns, then node partitioning information is redundant. Similarly, if split partitioning columns are a superset of node partitioning columns, then split partitioning information is redundant. So, the new check is asserting that either node or stream partitioning info is redundant. If that's the intent, they why not drop one of these and add a requirement that only node or stream partitioning can be specified? |
That doesn't seem to be case (see:
That's the intent. That would probably work too. I just wanted to add some assertion so that no bogus plans are produced and unnoticed in this tricky logic. This idea comes from #11262 which adds new logic. |
mbasmanova
left a comment
There was a problem hiding this comment.
@sopel39 Karol, given the discussion in this PR the check itself seems fine, but my preference would be to address the underlying issue of having two global partitioning schemes. I think it is impossible to reason about more than one partitioning scheme and it would help to remove one of these.
Streams global partitioning shouldn't be contradictionary to node
partitioning. Otherwise invalid AddExchanges decisions could be made.