[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal" by attilapiros · Pull Request #31133 · apache/spark

attilapiros · 2021-01-11T14:34:32Z

What changes were proposed in this pull request?

Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data
the table level properties were overwritten by the partition level properties.

This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema
is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data.

This new behavior is consistent with Apache Hive.
See the example used in the unit test SPARK-26836: support Avro schema evolution, in Hive this results in:

0: jdbc:hive2://<IP>:10000> select * from t;
INFO  : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds
INFO  : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO  : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds
INFO  : OK
+---------------+-------------+-------------+
|    t.col1     |   t.col2    |    t.ds     |
+---------------+-------------+-------------+
| col1_default  | col2_value  | 1981-01-07  |
| col1_value    | col2_value  | 1983-04-27  |
+---------------+-------------+-------------+
2 rows selected (0.159 seconds)

Why are the changes needed?

Without this change the old schema would be used. This can use a correctness issue when the new schema introduces
a new field with a default value (following the rules of schema evolution) before an existing field.
In this case the rows coming from the partition where the old schema was used will contain values in wrong column positions.

For example check the attached unit test SPARK-26836: support Avro schema evolution

Without this fix the result of the select on the table would be:

+----------+----------+----------+
|      col1|      col2|        ds|
+----------+----------+----------+
|col2_value|      null|1981-01-07|
|col1_value|col2_value|1983-04-27|
+----------+----------+----------+

With this fix:

+------------+----------+----------+
|        col1|      col2|        ds|
+------------+----------+----------+
|col1_default|col2_value|1981-01-07|
|  col1_value|col2_value|1983-04-27|
+------------+----------+----------+

Does this PR introduce any user-facing change?

Just fixes the value errors.
When a new column is introduced even to the last position then instead of 'null' the given default will be used.

How was this patch tested?

This was tested with the unit tested included to the PR.
And manually on Apache Spark / Hive.

SparkQA · 2021-01-11T15:29:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38519/

SparkQA · 2021-01-11T16:02:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38519/

attilapiros · 2021-01-11T19:34:52Z

jenkins retest this please

SparkQA · 2021-01-11T20:16:07Z

Test build #133932 has finished for PR 31133 at commit a5f8d1a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-11T20:22:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38524/

SparkQA · 2021-01-11T20:31:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38524/

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

SparkQA · 2021-01-11T23:01:07Z

Test build #133936 has finished for PR 31133 at commit a5f8d1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-01-12T01:39:08Z

cc @gengliangwang too FYI

SparkQA · 2021-01-12T10:21:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38550/

SparkQA · 2021-01-12T10:55:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38550/

SparkQA · 2021-01-12T14:39:23Z

Test build #133963 has finished for PR 31133 at commit 080c29b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ShufflePushBlockId(shuffleId: Int, mapIndex: Int, reduceId: Int) extends BlockId

attilapiros · 2021-01-15T14:23:51Z

@dongjoon-hyun I have addressed all of your comments, is there anything else I can help you with?

xkrogen

Good catch @attilapiros, it's great to see this!

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

SparkQA · 2021-01-15T21:53:16Z

Test build #134126 has started for PR 31133 at commit 232d8c0.

SparkQA · 2021-01-15T22:37:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38709/

SparkQA · 2021-01-15T23:10:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38709/

SparkQA · 2021-01-16T08:55:37Z

Test build #134136 has finished for PR 31133 at commit 8dc2c08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2021-01-22T14:30:45Z

cc @cloud-fan

attilapiros · 2021-02-02T08:43:00Z

@dongjoon-hyun may I ask for another review from you?

dongjoon-hyun · 2021-02-04T19:53:34Z

@xkrogen . That another instance explicitly uses AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(), not other property. My suggestion, it seems that we only need avro.schema.literal, was the same.

@dongjoon-hyun can you explain more fully why you don't want to depend on AvroTableProperties? We already have another reference to it in the Spark codebase within HiveShim.

dongjoon-hyun · 2021-02-04T19:57:46Z

Thank you for updating, @attilapiros . But, I'm still not sure about the other properties in those namespace.
For avro.schema.literal, your test case provides the value. For the others, I'm negative if there is no useful test case in the Apache Spark community.

So, if you don't mind, may I ask again you narrow down this PR to avro.schema.literal first? For the other new configurations, please proceed separately if you want.

attilapiros · 2021-02-04T20:02:49Z

@dongjoon-hyun this sound good to me.
Here I will use explicitly AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName() and for the rest I will open separate PRs with new tests proving they are useful.

xkrogen · 2021-02-04T20:05:58Z

Thanks for the clarification @dongjoon-hyun ! I understand your concern now.

New plan sounds good to me as well.

SparkQA · 2021-02-04T21:10:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39476/

SparkQA · 2021-02-04T21:14:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39476/

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

dongjoon-hyun

At data source layer, we have the following schema evolution test coverage. I guess we need at least (1) add column and (2) hide column test case in this PR. (1) is already included in this PR. So, we need (2).

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

 *   | File Format  | Coverage     | Note                                                   |
 *   | ------------ | ------------ | ------------------------------------------------------ |
 *   | TEXT         | N/A          | Schema consists of a single string column.             |
 *   | CSV          | 1, 2, 4      |                                                        |
 *   | JSON         | 1, 2, 3, 4   |                                                        |
 *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
 *   | PARQUET      | 1, 2, 3      |                                                        |
 *   | AVRO         | 1, 2, 3      |                                                        |

If you can add (3) Change a column position, it's the best.

SparkQA · 2021-02-04T22:26:50Z

Test build #134889 has finished for PR 31133 at commit 6892f31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-04T23:25:57Z

Test build #134893 has finished for PR 31133 at commit 28140f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-04T23:35:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39480/

SparkQA · 2021-02-05T01:06:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39480/

SparkQA · 2021-02-05T02:29:04Z

Test build #134897 has finished for PR 31133 at commit 92abaa3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-05T08:12:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39502/

SparkQA · 2021-02-05T08:43:09Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39502/

SparkQA · 2021-02-05T09:04:41Z

Test build #134919 has finished for PR 31133 at commit 20dc167.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @attilapiros and all.

Merged to master.

cc @gengliangwang , too.

… tables using "avro.schema.url" ### What changes were proposed in this pull request? With #31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`. Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`. ### Why are the changes needed? Without this PR the problem described in #31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`. So for example when a new column (with a default value) is added to the table then one the following problem happens: - when the new field is added after the last one the cell values will be null values instead of the default value - when the schema is extended somewhere before the last field then values will be listed for the wrong column positions Similar error will happen when one of the field is removed from the schema. For details please check the attached unit tests where both cases are checked. ### Does this PR introduce _any_ user-facing change? Fixes the potential value error. ### How was this patch tested? The existing unit tests for schema evolution is generalized and reused. New tests: - `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)` - `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)` Closes #31501 from attilapiros/SPARK-34370. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Initial commit.

a5f8d1a

github-actions bot added the SQL label Jan 11, 2021

dongjoon-hyun reviewed Jan 11, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jan 11, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jan 11, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jan 11, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Show resolved Hide resolved

attilapiros added 2 commits January 12, 2021 10:00

applying review comments

9cba254

Merge branch 'master' into SPARK-26836

080c29b

xkrogen suggested changes Jan 15, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Outdated Show resolved Hide resolved

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala Outdated Show resolved Hide resolved

simplify condition about avro properties

232d8c0

attilapiros added 2 commits January 16, 2021 07:21

remove spark.sql.hive.avroSchemaEvolution.enabled

009ecc0

fix indentation

8dc2c08

remove AvroTableProperties.*

6892f31

Narrowing down the scope to avro.schema.literal

28140f5

attilapiros changed the title ~~[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables~~ [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.literal" Feb 4, 2021

attilapiros changed the title ~~[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.literal"~~ [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal" Feb 4, 2021