Allow for schema pruning during update check for files to touch#1202
Allow for schema pruning during update check for files to touch#1202Kimahriman wants to merge 2 commits intodelta-io:masterfrom
Conversation
…w for schema pruning
|
I've randomly noticed Update taking an incredibly long time for certain tables. Today it was taking ~10 minute to find the files to update for a 1 GB table, and another 20-30 min to try to rewrite it. This was the first thing I found while trying to look more into that. |
@Kimahriman - want to make a separate issue for this? with the full stack trace; so we can help debug. |
scottsand-db
left a comment
There was a problem hiding this comment.
This change LGTM, but I'd like @vkorukanti to take a look as well.
Were you able to do any performance testing using this change? On the slow tables/queries you mentioned.
|
No I haven't, only was able to verify the read schema change |
|
@Kimahriman It makes sense, the non-deterministic function is not allowing the schema pruning from happening. Is it possible to add a test? There is |
|
Yeah it would be during the optimization stage, not sure if that can be captured, I can see |
|
Had to add a physical plan variant of that but got a test working |
|
Hi @Kimahriman - thanks for adding that test! I'm going to wait for @vkorukanti to take a look before I merge. Since we are so busy with the next release of Delta Lake, and with the Data and AI summit next week, he might not get to this for a week or so. Thanks! |
## Description Resolves delta-io#1201 Allows for schema pruning in the first part of an update to check for files to touch. Code snippet I ran: ```python >>> import pyspark.sql.functions as F >>> from delta.tables import DeltaTable >>> table = DeltaTable.forPath(spark, "test") >>> table.toDF().printSchema() root |-- key: string (nullable = true) |-- value: long (nullable = true) >>> table.update("key = 'c'", set={'value': F.lit(6)}) ``` The execution plan for the find files to update: before: ``` (1) Scan parquet Output: [key#526, value#527L] Batched: true Location: TahoeBatchFileIndex [file:.../projects/delta/test] PushedFilters: [IsNotNull(key), EqualTo(key,c)] ReadSchema: struct<key:string,value:bigint> ``` after: ``` (1) Scan parquet Output: [key#686] Batched: true Location: TahoeBatchFileIndex [file:.../projects/delta/test] PushedFilters: [IsNotNull(key), EqualTo(key,c)] ReadSchema: struct<key:string> ``` Only key is read, not value as well. The line swap should result in the same behavior, but doing the select before the nonDeterminstic UDF allows schema pruning to happen. Existing UTs plus screenshot of execution plan. ## Does this PR introduce _any_ user-facing changes? Performance improvement for update with data predicate. Closes delta-io#1202 Signed-off-by: Venki Korukanti <venki.korukanti@gmail.com> GitOrigin-RevId: a4a52a19fa1d18f0727d1dd134e7d38d4cbabfc3
Description
Resolves #1201
Allows for schema pruning in the first part of an update to check for files to touch.
Code snippet I ran:
The execution plan for the find files to update:
before:
after:
Only key is read, not value as well. The line swap should result in the same behavior, but doing the select before the nonDeterminstic UDF allows schema pruning to happen.
How was this patch tested?
Existing UTs plus screenshot of execution plan.
Does this PR introduce any user-facing changes?
Performance improvement for update with data predicate.