Skip to content

Conversation

@Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Sep 4, 2022

Description

#1370 will be fixed shortly after we upgrade to Spark 3.4.0. This PR just adds a test case.

How was this patch tested?

New UT.

Does this PR introduce any user-facing changes?

No

@Kimahriman
Copy link
Contributor Author

Kimahriman commented Sep 4, 2022

If this method is acceptable I have a working version for delete, and merge might be able to benefit from this as well.

@tdas tdas requested review from allisonport-db and tdas September 6, 2022 18:08
.collect()
}

// If no files need to be updated, the observation never gets initialized,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reword this comment a little bit? It was confusing at first because I thought the following code would be the case when no files need to be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, how about:

// Only get the metrics if there are files to be updated. If there are no files that need
// to be updated, the observation may never get initialized and `.get` will hang.

@Tagar
Copy link
Contributor

Tagar commented Sep 8, 2022

My understanding this uses Observable API which is based on query listener metrics, so perhaps it’s a bit more brittle for file pruning. I’ve seen cases query listener can miss some events when spark.scheduler.listenerbus.eventqueue.capacity is set
too low on larger clusters with a lot of tasks.
cc @hvanhovell who I think was the original author of Observable API.
This improvement probably needs to be behind a feature flag and not default?

@Kimahriman
Copy link
Contributor Author

Yeah I guess you're right, it's all based on the QueryExecutionListener which is doing async events that could get dropped. I'm trying to create a separate wrapper that can use the same CollectMetrics expression but pulls the metrics directly off the queryExecution.

@Kimahriman
Copy link
Contributor Author

Created a util method to pull the metrics directly from queryExecution rather than through the Observation class, so that should avoid any reliance on the listener bus

@allisonport-db allisonport-db self-requested a review September 14, 2022 18:14
@zsxwing
Copy link
Member

zsxwing commented Sep 14, 2022

uses Observable API which is based on query listener metrics

Do you mean reading the metrics from StreamingQueryListener? Here we are not doing that. This is a batch query and observe API should just return values from its internal accumulators. NVM. I was reading the latest change but your comment was for the old commit.

withStatusCode("DELTA", UpdateCommand.FINDING_TOUCHED_FILES_MSG) {
data.filter(new Column(updateCondition))
.select(input_file_name())
.filter(updatedRowUdf())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why Spark doesn't do the column pruning properly? Although we can change the code to work around it, it's better to also fix it in Spark side as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because PhysicalOperation is used to determine nested fields to prune, and non-deterministic Project's aren't included in PhysicalOperation's. I feel like there's a lot of optimizations that just don't apply to non-deterministic functions, so if there's ways to avoid using them it's probably good, especially when it's just to collect metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or in this case, non-deterministic filter expressions don't get included in PhysicalOperation's. I'm not exactly sure when/how the empty Project gets added in the plan in the description. Interestingly nested fields don't get pushed through a CollectMetrics operation, which prevents full schema pruning with this method for merging, but I already have a working Spark patch to enable that for the next Spark release if we go down the CollectMetrics vs non-determinstic UDF approach

Copy link
Contributor Author

@Kimahriman Kimahriman Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually it's not just the metrics, but also the input_file_name that I thing combined or one of the other causes the problem, it's hard to say exactly. But you could potentially use metadata fields instead of input_file_name in the future too to help address these weird cases caused by non-deterministic functions too (which is a problem for merging)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because PhysicalOperation is used to determine nested fields to prune, and non-deterministic Project's aren't included in PhysicalOperation's.

IIRC, the non-deterministic UDF makes Spark not be able to know which columns are needed? Technically, Spark can still do the nested column pruning on the non-deterministic UDF if the UDF doesn't read these columns. Right? Do you know the exact optimization rule?

Trying to understand the entire problem better so that we can apply the same approach to other places that are using non-deterministic UDFs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah still trying to figure out the exact reason/code path. But I already have a fix for Delete which has the same problem, and a partial fix for Merge (which has additional issues that might only be fixable come Spark 3.4), which are the only other places it looks like the non-deterministic func for metrics thing is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so while digging into the why, I was testing out on Spark master and couldn't recreate the issue. After a lot of digging, I'm pretty sure apache/spark#37176 will fix this issue in Spark 3.4+. Gonna try to test all the cases (update/delete/merge) with latest Spark master and see if it addresses all the current limitations

@Kimahriman
Copy link
Contributor Author

Delete still needs the same update as #1202, because even with the PhysicalOperation updates, columns still can't get pushed through a Filter - Filter non-deterministic - Project combo. Still haven't tested out merge behavior with the updates. I can either close this or add a new UT with the current behavior (no nested pruning), with a comment that it should fail when upgrading to Spark 3.4.

@zsxwing
Copy link
Member

zsxwing commented Oct 3, 2022

add a new UT with the current behavior (no nested pruning), with a comment that it should fail when upgrading to Spark 3.4.

This sounds better. It's worth to have a unit test here. But not worth to change the prod code since it's going to be fixed soon.

@Kimahriman
Copy link
Contributor Author

Sounds good, I'll update this PR and then make a separate issue/PR for delete, and try to see if anything needs to be done with merge

@Kimahriman Kimahriman force-pushed the feature/update-nested-pruning branch from 3f52832 to b0c61f8 Compare October 4, 2022 11:47
@Kimahriman
Copy link
Contributor Author

Just have the updated test here now

@Kimahriman
Copy link
Contributor Author

Made #1412 for delete

@zsxwing
Copy link
Member

zsxwing commented Oct 4, 2022

could you resolve the conflicts?

Conflicts:
	core/src/test/scala/org/apache/spark/sql/delta/UpdateSuiteBase.scala
@Kimahriman
Copy link
Contributor Author

Oops sorry meant to do that when I updated it, done

@zsxwing zsxwing changed the title Allow nested schema pruning in update operations first pass Add a nested schema pruning test case for Update Oct 4, 2022
Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@allisonport-db allisonport-db removed their request for review October 5, 2022 01:43
@zsxwing zsxwing added this to the 2.2.0 milestone Oct 5, 2022
@zsxwing
Copy link
Member

zsxwing commented Oct 18, 2022

Closing this. It has been merged: acd0bd4

@zsxwing zsxwing closed this Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants