Skip to content

Allow for schema pruning during update check for files to touch#1202

Closed
Kimahriman wants to merge 2 commits intodelta-io:masterfrom
Kimahriman:update-schema-pruning
Closed

Allow for schema pruning during update check for files to touch#1202
Kimahriman wants to merge 2 commits intodelta-io:masterfrom
Kimahriman:update-schema-pruning

Conversation

@Kimahriman
Copy link
Contributor

Description

Resolves #1201

Allows for schema pruning in the first part of an update to check for files to touch.

Code snippet I ran:

>>> import pyspark.sql.functions as F
>>> from delta.tables import DeltaTable
>>> table = DeltaTable.forPath(spark, "test")
>>> table.toDF().printSchema()
root
 |-- key: string (nullable = true)
 |-- value: long (nullable = true)

>>> table.update("key = 'c'", set={'value': F.lit(6)})

The execution plan for the find files to update:
before:

(1) Scan parquet 
Output [2]: [key#526, value#527L]
Batched: true
Location: TahoeBatchFileIndex [file:.../projects/delta/test]
PushedFilters: [IsNotNull(key), EqualTo(key,c)]
ReadSchema: struct<key:string,value:bigint>

after:

(1) Scan parquet 
Output [1]: [key#686]
Batched: true
Location: TahoeBatchFileIndex [file:.../projects/delta/test]
PushedFilters: [IsNotNull(key), EqualTo(key,c)]
ReadSchema: struct<key:string>

Only key is read, not value as well. The line swap should result in the same behavior, but doing the select before the nonDeterminstic UDF allows schema pruning to happen.

How was this patch tested?

Existing UTs plus screenshot of execution plan.

Does this PR introduce any user-facing changes?

Performance improvement for update with data predicate.

@Kimahriman
Copy link
Contributor Author

I've randomly noticed Update taking an incredibly long time for certain tables. Today it was taking ~10 minute to find the files to update for a 1 GB table, and another 20-30 min to try to rewrite it. This was the first thing I found while trying to look more into that.

@scottsand-db
Copy link
Collaborator

I've randomly noticed Update taking an incredibly long time for certain tables. Today it was taking ~10 minute to find the files to update for a 1 GB table, and another 20-30 min to try to rewrite it. This was the first thing I found while trying to look more into that.

@Kimahriman - want to make a separate issue for this? with the full stack trace; so we can help debug.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change LGTM, but I'd like @vkorukanti to take a look as well.

Were you able to do any performance testing using this change? On the slow tables/queries you mentioned.

@Kimahriman
Copy link
Contributor Author

No I haven't, only was able to verify the read schema change

@scottsand-db scottsand-db self-requested a review June 15, 2022 18:12
@vkorukanti
Copy link
Collaborator

@Kimahriman It makes sense, the non-deterministic function is not allowing the schema pruning from happening. Is it possible to add a test? There is DeltaTestUtils.withLogicalPlansCaptured that you can try (not sure though if the schema pruning happens in the logical planing itself or in physical planning).

@Kimahriman
Copy link
Contributor Author

Yeah it would be during the optimization stage, not sure if that can be captured, I can see

@Kimahriman
Copy link
Contributor Author

Had to add a physical plan variant of that but got a test working

@scottsand-db scottsand-db added the enhancement New feature or request label Jun 24, 2022
@scottsand-db
Copy link
Collaborator

Hi @Kimahriman - thanks for adding that test! I'm going to wait for @vkorukanti to take a look before I merge.

Since we are so busy with the next release of Delta Lake, and with the Data and AI summit next week, he might not get to this for a week or so. Thanks!

ganeshchand pushed a commit to ganeshchand/delta that referenced this pull request Aug 10, 2022
## Description

Resolves delta-io#1201

Allows for schema pruning in the first part of an update to check for files to touch.

Code snippet I ran:
```python
>>> import pyspark.sql.functions as F
>>> from delta.tables import DeltaTable
>>> table = DeltaTable.forPath(spark, "test")
>>> table.toDF().printSchema()
root
 |-- key: string (nullable = true)
 |-- value: long (nullable = true)

>>> table.update("key = 'c'", set={'value': F.lit(6)})
```

The execution plan for the find files to update:
before:
```
(1) Scan parquet
Output: [key#526, value#527L]
Batched: true
Location: TahoeBatchFileIndex [file:.../projects/delta/test]
PushedFilters: [IsNotNull(key), EqualTo(key,c)]
ReadSchema: struct<key:string,value:bigint>
```

after:
```
(1) Scan parquet
Output: [key#686]
Batched: true
Location: TahoeBatchFileIndex [file:.../projects/delta/test]
PushedFilters: [IsNotNull(key), EqualTo(key,c)]
ReadSchema: struct<key:string>
```

Only key is read, not value as well. The line swap should result in the same behavior, but doing the select before the nonDeterminstic UDF allows schema pruning to happen.

Existing UTs plus screenshot of execution plan.

## Does this PR introduce _any_ user-facing changes?

Performance improvement for update with data predicate.

Closes delta-io#1202

Signed-off-by: Venki Korukanti <venki.korukanti@gmail.com>
GitOrigin-RevId: a4a52a19fa1d18f0727d1dd134e7d38d4cbabfc3
@allisonport-db allisonport-db added this to the 2.1.0 milestone Aug 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request waiting for merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Update doesn't schema prune when finding files to update

5 participants