Spark SystemFunctions are not pushed down during JOIN#9233
Spark SystemFunctions are not pushed down during JOIN#9233tmnd1991 wants to merge 1 commit intoapache:mainfrom
Conversation
|
Hi @ConeyLiu, this still needs some refinement (mostly wrt testing) but do you think the change make sense? I would avoid to work more on it if I'm way off ;) |
| } else { | ||
| filter.copy(condition = newCondition) | ||
| } | ||
| case j @ Join(_, _, _, Some(condition), _) => |
There was a problem hiding this comment.
Here the join condition can be pushed to the leaf node by the Spark optimizer, right? I think this can not cover the COW/MOR cases. COW/MOR needs to do some special handling here. I plan to do it, however, I've been quite busy lately.
There was a problem hiding this comment.
I discovered the bug working with a MERGE statement and actually this works both with CoW and MoR, I have it running on my cluster like that, and it's correctly pruning all the partitions
|
cc @nastra @dramaticlly @advancedxy for review |
advancedxy
left a comment
There was a problem hiding this comment.
How do you determine that the SystemFunctions are not pushed down?
Spark will push down predicate(which includes predicates containing system functions) through join(except for full outer join), see: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1912 . So I don't think you need to handle joins specifically in ReplaceStaticInvoke.
Thanks @advancedxy , now, that explains a lot of what I was observing happening in my project. |
Could you elaborate a bit more? the planning tree string/dag of Spark SQL would be helpful. |
|
Sure, let me add a bit of context:
I started running a merge statement as following, taking advantage of SPJ: This results in the following physical plan: with This was creating 33 (+10 to exchange the file names) tasks for the subquery and 33 tasks for the second join. Therefore, knowing exactly the partitions that I hit beforehand, I tried to help iceberg/spark a little enumerating the partitions values that are actually hit: To my surprise the plan was exactly the same... Then I fixed this issue and also #9191 locally (adding an optimiser to my spark session) and the scans actually changed: With this plan I obtain 25 (+10 of shuffle) + 25 tasks, hitting actually only the minimum number of partitions. Given the context, I think that I probably highlighted 2 "bugs":
|
If the join type is full outer, it means that there are NoMatchedActions. So your merge into command should have an Could you give the full plan tree or dag for this changed plan? Is the join type still full outer? This is quite strange. I'm not sure why Filter would be pushed down to the data source for a full outer join. You may set |
|
yes sorry, there’s also a when not matched statement. i can’t attach the plan, but i’ll push a reproducer soon |
|
Finally I got a reproducer inside the codebase, you can find it at Anyway the more I work on this, the more I think the issue should be solved directly on the Scan, not by adding conditions manually. All info should be available to Spark beforehand, am I right? |
I did some quick debug. The reason why spark 3.4 succeeded is that
This question is answered, it's covered by |
|
The main in my local clone is at d6eba2a. I applied the diff from this PR to my local main. Same with 3.4. I assume the test is expected to pass with the changes in |
| long affectedPartitions = | ||
| sql(spark, "SELECT DISTINCT(partition) FROM %s.files", sourceTableName).count(); | ||
| int shufflePartitions = Integer.parseInt(spark.conf().get("spark.sql.shuffle.partitions")); | ||
| Assertions.assertThat(tasks).isEqualTo(affectedPartitions * 2 + shufflePartitions); |
There was a problem hiding this comment.
Can you please explain the reasoning behind this assertion?
There was a problem hiding this comment.
sure.
the target table is created with the following partitions (year_month, day, bucket(4, id)):
- 202306/01/0
- 202306/01/1
- 202306/01/2
- 202306/01/3
- 202306/02/0
- 202306/02/1
- 202307/01/3
the source table is created with the following partitions:
- 202306/01/0
- 202306/01/1
- 202306/02/0
- 202307/01/3
so the source table partitions are (is) a subset of the target table partitions.
Spark statically knows that info, because it's part of the metadata that iceberg keeps.
So copy-on-write "merge" consists of 2 jobs:
- left-semi join to understand which files are affected by the merge
- full-outer join where on the left side discards all the files not found while executing 1
In our particular case (where we know that source table partitions are a subset of target table partitions) if we do that with a Storage Partitioned Join, the most efficient way to do it is to:
- create 1 task for each partition that will change, read all the files from both tables, join locally, collect the file names
- create 1 task for each partition that will change, for each task, read all the files from the target table / partition except the ones that will not change (that's the effect of the IN), read all the files from the source table, join and apply merge logic locally, write out new files, add these files to the snapshot and remove the original files from the snapshot
Disclaimer: I know very little about internals and I can only imagine how hard can this be to actually done like that, but I'm quite sure that is "logically" doable 😄
so the reasoning of the number of tasks is:
- 1 task per partition that is going to change to collect the affected files
spark.sql.shuffle.partitionsto shuffle the file list (which I thought could be broadcasted, but I think it's not important right now)- 1 task per partition that is going to change to actually rewrite it
Let me know if there is any fallacy in my reasoning
There was a problem hiding this comment.
forgot to add: if I add the condition (after the patch to ReplaceStaticInvoke) it actually prunes the partitions (and tasks) in 3.4 (but not in 3.5).
Another thing: I know that the tasks that get created are actually very fast (I would say almost skipped) but the thing is that if the target table has 400.000 partitions, even the scheduling of those no-op tasks kills the performance of my job
4650308 to
1a24ea6
Compare
I just rebased and tested on top of latest main (2eea697) and I have all tests (TestSPJWithBucketing) failing on 3.5, while on 3.4 testMergeSPJwithCondition passes and testMergeSPJwithoutCondition does not (and that is expected to me, because merge/SPJ is not smart enough imho, so with the guidance on the condition it works, otherwise it does not). |
|
@tmnd1991 are you saying that TestSPJWithBucketing is supposed to fail? I thought that the idea is to write a test that fails without the change in this PR but passes with it. |
Test "TestSPJWithBucketing#testMergeSPJwithCondition` on 3.4 passes with the patch and fails without the patch, the other one always fails because I wanted to highlight another fallacy, probably I should remove it and address it in another PR (or directly on Spark repo?). |
|
I ran |
I was wrong in the above hypothesis. Nevertheless, why the difference in |
It was just me playing around, now they are identical, but the test results are the same as before (i.e. different between 3.4 and 3.5) |
reformat The fix works but the test does not :) Change test Replace only full outer joins Add merge SPJ reproducer Align ReplaceStaticInvoke between 3.4 and 3.5
|
@aokolnychyi I see you fixed part of this in #9873 but Spark 3.4 looks stil bugged on main. Do you have any time to give me a feedback on this? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
PR to verify bug reported in issue #9232
With some guidance I'm open to work on the fix too.