[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal"#31133
[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal"#31133attilapiros wants to merge 10 commits intoapache:masterfrom
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
jenkins retest this please |
|
Test build #133932 has finished for PR 31133 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Show resolved
Hide resolved
|
Test build #133936 has finished for PR 31133 at commit
|
|
cc @gengliangwang too FYI |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133963 has finished for PR 31133 at commit
|
|
@dongjoon-hyun I have addressed all of your comments, is there anything else I can help you with? |
xkrogen
left a comment
There was a problem hiding this comment.
Good catch @attilapiros, it's great to see this!
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
Outdated
Show resolved
Hide resolved
|
Test build #134126 has started for PR 31133 at commit |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #134136 has finished for PR 31133 at commit
|
|
cc @cloud-fan |
|
@dongjoon-hyun may I ask for another review from you? |
|
@xkrogen . That another instance explicitly uses
|
|
Thank you for updating, @attilapiros . But, I'm still not sure about the other properties in those namespace. So, if you don't mind, may I ask again you narrow down this PR to |
|
@dongjoon-hyun this sound good to me. |
|
Thanks for the clarification @dongjoon-hyun ! I understand your concern now. New plan sounds good to me as well. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
At data source layer, we have the following schema evolution test coverage. I guess we need at least (1) add column and (2) hide column test case in this PR. (1) is already included in this PR. So, we need (2).
* | File Format | Coverage | Note |
* | ------------ | ------------ | ------------------------------------------------------ |
* | TEXT | N/A | Schema consists of a single string column. |
* | CSV | 1, 2, 4 | |
* | JSON | 1, 2, 3, 4 | |
* | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. |
* | PARQUET | 1, 2, 3 | |
* | AVRO | 1, 2, 3 | |
If you can add (3) Change a column position, it's the best.
|
Test build #134889 has finished for PR 31133 at commit
|
|
Test build #134893 has finished for PR 31133 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #134897 has finished for PR 31133 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #134919 has finished for PR 31133 at commit
|
… tables using "avro.schema.url" ### What changes were proposed in this pull request? With #31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`. Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`. ### Why are the changes needed? Without this PR the problem described in #31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`. So for example when a new column (with a default value) is added to the table then one the following problem happens: - when the new field is added after the last one the cell values will be null values instead of the default value - when the schema is extended somewhere before the last field then values will be listed for the wrong column positions Similar error will happen when one of the field is removed from the schema. For details please check the attached unit tests where both cases are checked. ### Does this PR introduce _any_ user-facing change? Fixes the potential value error. ### How was this patch tested? The existing unit tests for schema evolution is generalized and reused. New tests: - `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)` - `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)` Closes #31501 from attilapiros/SPARK-34370. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data
the table level properties were overwritten by the partition level properties.
This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema
is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data.
This new behavior is consistent with Apache Hive.
See the example used in the unit test
SPARK-26836: support Avro schema evolution, in Hive this results in:Why are the changes needed?
Without this change the old schema would be used. This can use a correctness issue when the new schema introduces
a new field with a default value (following the rules of schema evolution) before an existing field.
In this case the rows coming from the partition where the old schema was used will contain values in wrong column positions.
For example check the attached unit test
SPARK-26836: support Avro schema evolutionWithout this fix the result of the select on the table would be:
With this fix:
Does this PR introduce any user-facing change?
Just fixes the value errors.
When a new column is introduced even to the last position then instead of 'null' the given default will be used.
How was this patch tested?
This was tested with the unit tested included to the PR.
And manually on Apache Spark / Hive.