Add a nested schema pruning test case for Update #1371

Kimahriman · 2022-09-04T19:56:22Z

Description

#1370 will be fixed shortly after we upgrade to Spark 3.4.0. This PR just adds a test case.

How was this patch tested?

New UT.

Does this PR introduce any user-facing changes?

No

Kimahriman · 2022-09-04T19:57:31Z

If this method is acceptable I have a working version for delete, and merge might be able to benefit from this as well.

allisonport-db · 2022-09-08T18:21:42Z

core/src/main/scala/org/apache/spark/sql/delta/commands/UpdateCommand.scala

            .collect()
        }

+      // If no files need to be updated, the observation never gets initialized,


Can you reword this comment a little bit? It was confusing at first because I thought the following code would be the case when no files need to be updated

I see what you mean, how about:

// Only get the metrics if there are files to be updated. If there are no files that need // to be updated, the observation may never get initialized and `.get` will hang.

Tagar · 2022-09-08T23:34:02Z

My understanding this uses Observable API which is based on query listener metrics, so perhaps it’s a bit more brittle for file pruning. I’ve seen cases query listener can miss some events when spark.scheduler.listenerbus.eventqueue.capacity is set
too low on larger clusters with a lot of tasks.
cc @hvanhovell who I think was the original author of Observable API.
This improvement probably needs to be behind a feature flag and not default?

Kimahriman · 2022-09-10T19:00:50Z

Yeah I guess you're right, it's all based on the QueryExecutionListener which is doing async events that could get dropped. I'm trying to create a separate wrapper that can use the same CollectMetrics expression but pulls the metrics directly off the queryExecution.

Kimahriman · 2022-09-11T11:40:07Z

Created a util method to pull the metrics directly from queryExecution rather than through the Observation class, so that should avoid any reliance on the listener bus

zsxwing · 2022-09-14T20:47:58Z

uses Observable API which is based on query listener metrics

~~Do you mean reading the metrics from StreamingQueryListener? Here we are not doing that. This is a batch query and observe API should just return values from its internal accumulators.~~ NVM. I was reading the latest change but your comment was for the old commit.

zsxwing · 2022-09-14T23:18:11Z

core/src/main/scala/org/apache/spark/sql/delta/commands/UpdateCommand.scala

        withStatusCode("DELTA", UpdateCommand.FINDING_TOUCHED_FILES_MSG) {
-          data.filter(new Column(updateCondition))
-            .select(input_file_name())
-            .filter(updatedRowUdf())


Do you know why Spark doesn't do the column pruning properly? Although we can change the code to work around it, it's better to also fix it in Spark side as well.

I think it's because PhysicalOperation is used to determine nested fields to prune, and non-deterministic Project's aren't included in PhysicalOperation's. I feel like there's a lot of optimizations that just don't apply to non-deterministic functions, so if there's ways to avoid using them it's probably good, especially when it's just to collect metrics.

Or in this case, non-deterministic filter expressions don't get included in PhysicalOperation's. I'm not exactly sure when/how the empty Project gets added in the plan in the description. Interestingly nested fields don't get pushed through a CollectMetrics operation, which prevents full schema pruning with this method for merging, but I already have a working Spark patch to enable that for the next Spark release if we go down the CollectMetrics vs non-determinstic UDF approach

And actually it's not just the metrics, but also the input_file_name that I thing combined or one of the other causes the problem, it's hard to say exactly. But you could potentially use metadata fields instead of input_file_name in the future too to help address these weird cases caused by non-deterministic functions too (which is a problem for merging)

I think it's because PhysicalOperation is used to determine nested fields to prune, and non-deterministic Project's aren't included in PhysicalOperation's.

IIRC, the non-deterministic UDF makes Spark not be able to know which columns are needed? Technically, Spark can still do the nested column pruning on the non-deterministic UDF if the UDF doesn't read these columns. Right? Do you know the exact optimization rule?

Trying to understand the entire problem better so that we can apply the same approach to other places that are using non-deterministic UDFs.

Yeah still trying to figure out the exact reason/code path. But I already have a fix for Delete which has the same problem, and a partial fix for Merge (which has additional issues that might only be fixable come Spark 3.4), which are the only other places it looks like the non-deterministic func for metrics thing is used.

Ok so while digging into the why, I was testing out on Spark master and couldn't recreate the issue. After a lot of digging, I'm pretty sure apache/spark#37176 will fix this issue in Spark 3.4+. Gonna try to test all the cases (update/delete/merge) with latest Spark master and see if it addresses all the current limitations

Kimahriman · 2022-09-30T20:59:26Z

Delete still needs the same update as #1202, because even with the PhysicalOperation updates, columns still can't get pushed through a Filter - Filter non-deterministic - Project combo. Still haven't tested out merge behavior with the updates. I can either close this or add a new UT with the current behavior (no nested pruning), with a comment that it should fail when upgrading to Spark 3.4.

zsxwing · 2022-10-03T18:14:06Z

add a new UT with the current behavior (no nested pruning), with a comment that it should fail when upgrading to Spark 3.4.

This sounds better. It's worth to have a unit test here. But not worth to change the prod code since it's going to be fixed soon.

Kimahriman · 2022-10-03T18:31:49Z

Sounds good, I'll update this PR and then make a separate issue/PR for delete, and try to see if anything needs to be done with merge

Kimahriman · 2022-10-04T11:53:38Z

Just have the updated test here now

Kimahriman · 2022-10-04T11:58:27Z

Made #1412 for delete

zsxwing · 2022-10-04T18:09:58Z

could you resolve the conflicts?

Conflicts: core/src/test/scala/org/apache/spark/sql/delta/UpdateSuiteBase.scala

Kimahriman · 2022-10-04T18:43:56Z

Oops sorry meant to do that when I updated it, done

zsxwing

LGTM

zsxwing · 2022-10-18T23:06:01Z

Closing this. It has been merged: acd0bd4

tdas requested review from allisonport-db and tdas September 6, 2022 18:08

allisonport-db reviewed Sep 8, 2022

View reviewed changes

allisonport-db self-requested a review September 14, 2022 18:14

zsxwing reviewed Sep 14, 2022

View reviewed changes

Add test for current nested schema behavior

b0c61f8

Kimahriman force-pushed the feature/update-nested-pruning branch from 3f52832 to b0c61f8 Compare October 4, 2022 11:47

Merge branch 'master' into feature/update-nested-pruning

e9d7d12

Conflicts: core/src/test/scala/org/apache/spark/sql/delta/UpdateSuiteBase.scala

zsxwing changed the title ~~Allow nested schema pruning in update operations first pass~~ Add a nested schema pruning test case for Update Oct 4, 2022

zsxwing approved these changes Oct 4, 2022

View reviewed changes

zsxwing added the waiting for merge label Oct 4, 2022

allisonport-db removed their request for review October 5, 2022 01:43

zsxwing added this to the 2.2.0 milestone Oct 5, 2022

zsxwing closed this Oct 18, 2022

Add a nested schema pruning test case for Update #1371

Add a nested schema pruning test case for Update #1371

Uh oh!

Conversation

Kimahriman commented Sep 4, 2022 • edited by zsxwing Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

Kimahriman commented Sep 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tagar commented Sep 8, 2022

Uh oh!

Kimahriman commented Sep 10, 2022

Uh oh!

Kimahriman commented Sep 11, 2022

Uh oh!

zsxwing commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Sep 30, 2022

Uh oh!

zsxwing commented Oct 3, 2022

Uh oh!

Kimahriman commented Oct 3, 2022

Uh oh!

Kimahriman commented Oct 4, 2022

Uh oh!

Kimahriman commented Oct 4, 2022

Uh oh!

zsxwing commented Oct 4, 2022

Uh oh!

Kimahriman commented Oct 4, 2022

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Oct 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kimahriman commented Sep 4, 2022 •

edited by zsxwing

Loading

Kimahriman commented Sep 4, 2022 •

edited

Loading

zsxwing commented Sep 14, 2022 •

edited

Loading

Kimahriman Sep 14, 2022 •

edited

Loading