Skip to content

[SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"#31501

Closed
attilapiros wants to merge 2 commits intoapache:masterfrom
attilapiros:SPARK-34370
Closed

[SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"#31501
attilapiros wants to merge 2 commits intoapache:masterfrom
attilapiros:SPARK-34370

Conversation

@attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Feb 6, 2021

What changes were proposed in this pull request?

With #31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by avro.schema.literal.
Here that functionality is extended to support schema evolution where the schema is defined via avro.schema.url.

Why are the changes needed?

Without this PR the problem described in #31133 can be reproduced by tables where avro.schema.url is used. As in this case always the property value given at partition level is used for the avro.schema.url.

So for example when a new column (with a default value) is added to the table then one the following problem happens:

  • when the new field is added after the last one the cell values will be null values instead of the default value
  • when the schema is extended somewhere before the last field then values will be listed for the wrong column positions

Similar error will happen when one of the field is removed from the schema.

For details please check the attached unit tests where both cases are checked.

Does this PR introduce any user-facing change?

Fixes the potential value error.

How was this patch tested?

The existing unit tests for schema evolution is generalized and reused.
New tests:

  • SPARK-34370: support Avro schema evolution (add column with avro.schema.url)
  • SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)

@github-actions github-actions bot added the SQL label Feb 6, 2021
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-34370] Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url" [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url" Feb 6, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM except two comments.

  1. Renaming avroSchemaEvolutionProperties to avroSchemaProperties
  2. Don't create resources/schemaEvolution.

@SparkQA
Copy link

SparkQA commented Feb 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39546/

@SparkQA
Copy link

SparkQA commented Feb 7, 2021

Test build #134962 has finished for PR 31501 at commit f29439c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 3.2.0.

@SparkQA
Copy link

SparkQA commented Feb 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39546/

@SparkQA
Copy link

SparkQA commented Feb 7, 2021

Test build #134965 has finished for PR 31501 at commit e3bd8e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants