Allow for schema pruning during update check for files to touch by Kimahriman · Pull Request #1202 · delta-io/delta

Kimahriman · 2022-06-13T22:36:53Z

Description

Resolves #1201

Allows for schema pruning in the first part of an update to check for files to touch.

Code snippet I ran:

>>> import pyspark.sql.functions as F
>>> from delta.tables import DeltaTable
>>> table = DeltaTable.forPath(spark, "test")
>>> table.toDF().printSchema()
root
 |-- key: string (nullable = true)
 |-- value: long (nullable = true)

>>> table.update("key = 'c'", set={'value': F.lit(6)})

The execution plan for the find files to update:
before:

(1) Scan parquet 
Output [2]: [key#526, value#527L]
Batched: true
Location: TahoeBatchFileIndex [file:.../projects/delta/test]
PushedFilters: [IsNotNull(key), EqualTo(key,c)]
ReadSchema: struct<key:string,value:bigint>

after:

(1) Scan parquet 
Output [1]: [key#686]
Batched: true
Location: TahoeBatchFileIndex [file:.../projects/delta/test]
PushedFilters: [IsNotNull(key), EqualTo(key,c)]
ReadSchema: struct<key:string>

Only key is read, not value as well. The line swap should result in the same behavior, but doing the select before the nonDeterminstic UDF allows schema pruning to happen.

How was this patch tested?

Existing UTs plus screenshot of execution plan.

Does this PR introduce any user-facing changes?

Performance improvement for update with data predicate.

…w for schema pruning

Kimahriman · 2022-06-13T22:38:37Z

I've randomly noticed Update taking an incredibly long time for certain tables. Today it was taking ~10 minute to find the files to update for a 1 GB table, and another 20-30 min to try to rewrite it. This was the first thing I found while trying to look more into that.

scottsand-db · 2022-06-15T15:15:46Z

I've randomly noticed Update taking an incredibly long time for certain tables. Today it was taking ~10 minute to find the files to update for a 1 GB table, and another 20-30 min to try to rewrite it. This was the first thing I found while trying to look more into that.

@Kimahriman - want to make a separate issue for this? with the full stack trace; so we can help debug.

scottsand-db

This change LGTM, but I'd like @vkorukanti to take a look as well.

Were you able to do any performance testing using this change? On the slow tables/queries you mentioned.

Kimahriman · 2022-06-15T15:54:13Z

No I haven't, only was able to verify the read schema change

vkorukanti · 2022-06-17T17:31:31Z

@Kimahriman It makes sense, the non-deterministic function is not allowing the schema pruning from happening. Is it possible to add a test? There is DeltaTestUtils.withLogicalPlansCaptured that you can try (not sure though if the schema pruning happens in the logical planing itself or in physical planning).

Kimahriman · 2022-06-17T17:52:29Z

Yeah it would be during the optimization stage, not sure if that can be captured, I can see

Kimahriman · 2022-06-18T00:02:16Z

Had to add a physical plan variant of that but got a test working

scottsand-db · 2022-06-24T16:56:58Z

Hi @Kimahriman - thanks for adding that test! I'm going to wait for @vkorukanti to take a look before I merge.

Since we are so busy with the next release of Delta Lake, and with the Data and AI summit next week, he might not get to this for a week or so. Thanks!

## Description Resolves delta-io#1201 Allows for schema pruning in the first part of an update to check for files to touch. Code snippet I ran: ```python >>> import pyspark.sql.functions as F >>> from delta.tables import DeltaTable >>> table = DeltaTable.forPath(spark, "test") >>> table.toDF().printSchema() root |-- key: string (nullable = true) |-- value: long (nullable = true) >>> table.update("key = 'c'", set={'value': F.lit(6)}) ``` The execution plan for the find files to update: before: ``` (1) Scan parquet Output: [key#526, value#527L] Batched: true Location: TahoeBatchFileIndex [file:.../projects/delta/test] PushedFilters: [IsNotNull(key), EqualTo(key,c)] ReadSchema: struct<key:string,value:bigint> ``` after: ``` (1) Scan parquet Output: [key#686] Batched: true Location: TahoeBatchFileIndex [file:.../projects/delta/test] PushedFilters: [IsNotNull(key), EqualTo(key,c)] ReadSchema: struct<key:string> ``` Only key is read, not value as well. The line swap should result in the same behavior, but doing the select before the nonDeterminstic UDF allows schema pruning to happen. Existing UTs plus screenshot of execution plan. ## Does this PR introduce _any_ user-facing changes? Performance improvement for update with data predicate. Closes delta-io#1202 Signed-off-by: Venki Korukanti <venki.korukanti@gmail.com> GitOrigin-RevId: a4a52a19fa1d18f0727d1dd134e7d38d4cbabfc3

Change order of select and filter for finding files to update to allo…

2b02126

…w for schema pruning

vkorukanti requested review from scottsand-db and vkorukanti June 14, 2022 18:06

scottsand-db approved these changes Jun 15, 2022

View reviewed changes

scottsand-db self-requested a review June 15, 2022 18:12

Add test for pruned schema

23b0fab

scottsand-db added the enhancement New feature or request label Jun 24, 2022

scottsand-db approved these changes Jun 24, 2022

View reviewed changes

vkorukanti approved these changes Jul 7, 2022

View reviewed changes

zsxwing added the waiting for merge label Jul 13, 2022

allisonport-db closed this in 176ed08 Aug 1, 2022

allisonport-db added this to the 2.1.0 milestone Aug 28, 2022

Kimahriman mentioned this pull request Sep 30, 2022

Add a nested schema pruning test case for Update #1371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for schema pruning during update check for files to touch#1202

Allow for schema pruning during update check for files to touch#1202
Kimahriman wants to merge 2 commits intodelta-io:masterfrom
Kimahriman:update-schema-pruning

Kimahriman commented Jun 13, 2022

Uh oh!

Kimahriman commented Jun 13, 2022

Uh oh!

scottsand-db commented Jun 15, 2022

Uh oh!

scottsand-db left a comment

Uh oh!

Kimahriman commented Jun 15, 2022

Uh oh!

vkorukanti commented Jun 17, 2022

Uh oh!

Kimahriman commented Jun 17, 2022

Uh oh!

Kimahriman commented Jun 18, 2022

Uh oh!

scottsand-db commented Jun 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Kimahriman commented Jun 13, 2022

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

Kimahriman commented Jun 13, 2022

Uh oh!

scottsand-db commented Jun 15, 2022

Uh oh!

scottsand-db left a comment

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Jun 15, 2022

Uh oh!

vkorukanti commented Jun 17, 2022

Uh oh!

Kimahriman commented Jun 17, 2022

Uh oh!

Kimahriman commented Jun 18, 2022

Uh oh!

scottsand-db commented Jun 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants