Skip to content

[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal"#31133

Closed
attilapiros wants to merge 10 commits intoapache:masterfrom
attilapiros:SPARK-26836
Closed

[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal"#31133
attilapiros wants to merge 10 commits intoapache:masterfrom
attilapiros:SPARK-26836

Conversation

@attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Jan 11, 2021

What changes were proposed in this pull request?

Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data
the table level properties were overwritten by the partition level properties.

This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema
is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data.

This new behavior is consistent with Apache Hive.
See the example used in the unit test SPARK-26836: support Avro schema evolution, in Hive this results in:

0: jdbc:hive2://<IP>:10000> select * from t;
INFO  : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds
INFO  : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO  : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds
INFO  : OK
+---------------+-------------+-------------+
|    t.col1     |   t.col2    |    t.ds     |
+---------------+-------------+-------------+
| col1_default  | col2_value  | 1981-01-07  |
| col1_value    | col2_value  | 1983-04-27  |
+---------------+-------------+-------------+
2 rows selected (0.159 seconds)

Why are the changes needed?

Without this change the old schema would be used. This can use a correctness issue when the new schema introduces
a new field with a default value (following the rules of schema evolution) before an existing field.
In this case the rows coming from the partition where the old schema was used will contain values in wrong column positions.

For example check the attached unit test SPARK-26836: support Avro schema evolution

Without this fix the result of the select on the table would be:

+----------+----------+----------+
|      col1|      col2|        ds|
+----------+----------+----------+
|col2_value|      null|1981-01-07|
|col1_value|col2_value|1983-04-27|
+----------+----------+----------+

With this fix:

+------------+----------+----------+
|        col1|      col2|        ds|
+------------+----------+----------+
|col1_default|col2_value|1981-01-07|
|  col1_value|col2_value|1983-04-27|
+------------+----------+----------+

Does this PR introduce any user-facing change?

Just fixes the value errors.
When a new column is introduced even to the last position then instead of 'null' the given default will be used.

How was this patch tested?

This was tested with the unit tested included to the PR.
And manually on Apache Spark / Hive.

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38519/

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38519/

@github-actions github-actions bot added the SQL label Jan 11, 2021
@attilapiros
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Test build #133932 has finished for PR 31133 at commit a5f8d1a.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38524/

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38524/

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Test build #133936 has finished for PR 31133 at commit a5f8d1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 12, 2021

cc @gengliangwang too FYI

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38550/

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38550/

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Test build #133963 has finished for PR 31133 at commit 080c29b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ShufflePushBlockId(shuffleId: Int, mapIndex: Int, reduceId: Int) extends BlockId

@attilapiros
Copy link
Contributor Author

@dongjoon-hyun I have addressed all of your comments, is there anything else I can help you with?

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @attilapiros, it's great to see this!

@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Test build #134126 has started for PR 31133 at commit 232d8c0.

@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38709/

@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38709/

@SparkQA
Copy link

SparkQA commented Jan 16, 2021

Test build #134136 has finished for PR 31133 at commit 8dc2c08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

cc @cloud-fan

@attilapiros
Copy link
Contributor Author

@dongjoon-hyun may I ask for another review from you?

@dongjoon-hyun
Copy link
Member

@xkrogen . That another instance explicitly uses AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(), not other property. My suggestion, it seems that we only need avro.schema.literal, was the same.

@dongjoon-hyun can you explain more fully why you don't want to depend on AvroTableProperties? We already have another reference to it in the Spark codebase within HiveShim.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 4, 2021

Thank you for updating, @attilapiros . But, I'm still not sure about the other properties in those namespace.
For avro.schema.literal, your test case provides the value. For the others, I'm negative if there is no useful test case in the Apache Spark community.

So, if you don't mind, may I ask again you narrow down this PR to avro.schema.literal first? For the other new configurations, please proceed separately if you want.

@attilapiros
Copy link
Contributor Author

@dongjoon-hyun this sound good to me.
Here I will use explicitly AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName() and for the rest I will open separate PRs with new tests proving they are useful.

@xkrogen
Copy link
Contributor

xkrogen commented Feb 4, 2021

Thanks for the clarification @dongjoon-hyun ! I understand your concern now.

New plan sounds good to me as well.

@attilapiros attilapiros changed the title [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.literal" Feb 4, 2021
@attilapiros attilapiros changed the title [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.literal" [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal" Feb 4, 2021
@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39476/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39476/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At data source layer, we have the following schema evolution test coverage. I guess we need at least (1) add column and (2) hide column test case in this PR. (1) is already included in this PR. So, we need (2).

 *   | File Format  | Coverage     | Note                                                   |
 *   | ------------ | ------------ | ------------------------------------------------------ |
 *   | TEXT         | N/A          | Schema consists of a single string column.             |
 *   | CSV          | 1, 2, 4      |                                                        |
 *   | JSON         | 1, 2, 3, 4   |                                                        |
 *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
 *   | PARQUET      | 1, 2, 3      |                                                        |
 *   | AVRO         | 1, 2, 3      |                                                        |

If you can add (3) Change a column position, it's the best.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134889 has finished for PR 31133 at commit 6892f31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134893 has finished for PR 31133 at commit 28140f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39480/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39480/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134897 has finished for PR 31133 at commit 92abaa3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39502/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39502/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134919 has finished for PR 31133 at commit 20dc167.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @attilapiros and all.

Merged to master.

cc @gengliangwang , too.

dongjoon-hyun pushed a commit that referenced this pull request Feb 7, 2021
… tables using "avro.schema.url"

### What changes were proposed in this pull request?

With #31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`.
Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`.

### Why are the changes needed?

Without this PR the problem described in #31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`.

So for example when a new column (with a default value) is added to the table then one the following problem happens:
-  when the new field is added after the last one the cell values will be null values instead of the default value
-  when the schema is extended somewhere before the last field then values will be listed for the wrong column positions

Similar error will happen when one of the field is removed from the schema.

For details please check the attached unit tests where both cases are checked.

### Does this PR introduce _any_ user-facing change?

Fixes the potential value error.

### How was this patch tested?

The existing unit tests for schema evolution is generalized and reused.
New tests:
- `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)`
- `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)`

Closes #31501 from attilapiros/SPARK-34370.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants