-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Disable implicit casting in Delta streaming sink #3691
[Spark] Disable implicit casting in Delta streaming sink #3691
Conversation
fc7f72c
to
aa05b6a
Compare
FWIW we noticed something similar recently with inserts on merge, the implicit cast failed if struct fields were missing |
I believe that's expected in that case, unless you enable schema evolution, MERGE / UPDATE rejects writes with missing struct fields. Batch (non-streaming) INSERT also rejects that - although I've recently looked closer at batch insert behavior and the truth is that it's a bit all over the place.. Rejecting missing fields is a reasonable behavior when schema evolution is disabled, as it means we instead fall back to schema enforcement. Streaming writes didn't behave that way though, which I missed |
This is missing fields in the source, not the target, so not schema evolution related. Easy example: import pyspark.sql.functions as F
from delta import DeltaTable
# Create table of nested struct<id: long, value: long>
spark.range(10).select(F.struct('id', (F.col('id') * 10).alias('value')).alias('nested')).write.format('delta').save('/tmp/merge-test')
# Works without "nested.value"
spark.range(10, 20).select(F.struct('id').alias('nested')).write.format('delta').mode('append').save('/tmp/merge-test')
table = DeltaTable.forPath(spark, "/tmp/merge-test")
# Fails with "Cannot cast struct<id:bigint> to struct<id:bigint,value:bigint>. All nested columns must match
table.alias('target').merge(spark.range(20, 30).select(F.struct('id').alias('nested')).alias('source', 'target.nested.id = source.nested.id').whenNotMatchedInsertAll().execute() |
Right, We do handle SQL insert by name and SQL/DF insert by position in DeltaAnalysis and apply schema enforcement, but let That's what I meant by insert behavior is a bit all over the place, I looked at it a couple of weeks ago to see if it could be fixed - and allow implicit casting in |
Would it make more sense to do a |
That seems too limited, The column reordering part does make sense though, although the issue isn't so much how to apply it as much as making sure we're not breaking existing workloads if we start doing it in more cases |
So it seems like the whole issue stems from the fact that |
+1, I have it on my todo list to look into this. |
@Kimahriman I finally got around to implementing some of the changes we discussed previously E.p. |
Description
#3443 introduced implicit casting when writing to a Delta table using a streaming query.
We are disabling this change for now as it regresses behavior when a struct field is missing in the input data. This previously succeeded, filling the missing fields with
null
but would now fail with:Note: batch INSERT fails in this scenario with:
but since streaming write allowed this, we have to preserve that behavior.
How was this patch tested?
Tests added as part of #3443, e.p. with flag disabled.
Does this PR introduce any user-facing changes?
Disabled behavior change that was to be introduced with #3443.