[SPARK-25104][SQL]Avro: Validate user specified output schema#22094
[SPARK-25104][SQL]Avro: Validate user specified output schema#22094gengliangwang wants to merge 2 commits intoapache:masterfrom
Conversation
065229c to
c8e98b1
Compare
|
Test build #94694 has finished for PR 22094 at commit
|
|
LGTM. |
| case (FloatType, FLOAT) => | ||
| (getter, ordinal) => getter.getFloat(ordinal) | ||
| case DoubleType => | ||
| case (DoubleType, DOUBLE) => |
There was a problem hiding this comment.
Do we want to allow users to do casting up from catalystType to avroType? For example, catalystType float to avroType double. If so, this can be done in different PR.
There was a problem hiding this comment.
Personally I would like to keep it simple as this PR proposes.
If data type casting needed, users can always do it in DataFrame before writing Avro files.
But if the casting is important, we can work on it.
There was a problem hiding this comment.
Yeah, if someone feels it's important, let's do it in different PR.
| (NullType, NULL), | ||
| (BooleanType, BOOLEAN), | ||
| (ByteType, INT), | ||
| (IntegerType, INT), |
There was a problem hiding this comment.
Could you add (ShortType, INT),, too?
| (DoubleType, DOUBLE), | ||
| (BinaryType, BYTES), | ||
| (DateType, INT), | ||
| (TimestampType, LONG) |
There was a problem hiding this comment.
If the intention is to be exhaustive, decimal types? And, primitive to complex, and vice versa?
There was a problem hiding this comment.
@dongjoon-hyun Thanks, I have updated the test.
|
|
||
| private def resolveNullableType(avroType: Schema, nullable: Boolean): Schema = { | ||
| if (nullable) { | ||
| if (nullable && avroType.getType != NULL) { |
There was a problem hiding this comment.
This fixes a trivial bug if avroType is NULL type.
|
Test build #94720 has finished for PR 22094 at commit
|
|
Merged into master. Thanks. |
What changes were proposed in this pull request?
With code changes in #21847 , Spark can write out to Avro file as per user provided output schema.
To make it more robust and user friendly, we should validate the Avro schema before tasks launched.
Also we should support output logical decimal type as BYTES (By default we output as FIXED)
How was this patch tested?
Unit test