-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1827: UUID type currently not supported by parquet-mr #778
Conversation
@@ -861,6 +871,36 @@ PrimitiveStringifier valueStringifier(PrimitiveType primitiveType) { | |||
} | |||
} | |||
|
|||
public static class UUIDLogicalTypeAnnotation extends LogicalTypeAnnotation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there is no implementation for hasCode() and equal()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a singleton the default implementation of equals(Object)
and hashCode()
fits perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
private final char[] digit = "0123456789abcdef".toCharArray(); | ||
@Override | ||
public String stringify(Binary value) { | ||
byte[] bytes = value.getBytesUnsafe(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make 'bytes' final since you are calling getBytesUnsafe()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
would protect the reference only and not the values of the array. Making a reference final in a local scope is usually required in situations where it is accessed from e.g. lambda closures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
assertEquals("ffffffff-ffff-ffff-ffff-ffffffffffff", stringifier.stringify( | ||
toBinary(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff))); | ||
|
||
assertEquals("0eb1497c-19b6-42bc-b028-b4b612bed141", stringifier.stringify( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the 3 test data are randomly chosen? It seems duplicate coverage.
Could you add some negative tests like incorrect length uuids, invalid characters etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to have 3 kind-of corner cases and 3 common (random but constant) values. What do you mean by duplicate coverage? (I think, we do not need exhaustive testing for the stringifiers because they only used by our tools for debugging purposes.)
The stringifiers do not validate the data they get because of performance reasons. So, if the array is longer than 16 it would simply stringify the first 16 and skip the others. In case of the length is too short then an ArrayIndexOutOfBoundsException
would be thrown. Do you think we should test these cases? They would not reach any additional branches in the parquet code.
Invalid characters are not possible. The full set of values of the 16 bytes array is covered in UUID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By duplicate coverage, I meant the #2, #3 tests seems repeating the same test as #1. The value is different, but when the test executes, they would execute the same code path. So I think they won't provide extra coverage.
From test perspective, negative test does provide values. In this case, we can test the exception is thrown as expected if it is too short.
For "if the array is longer than 16 it would simply stringify the first 16 and skip the others", that could cause silent errors, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to add tests for the edge cases like too short or too long inputs. Though, I would not implement additional validations because of performance issues. A stringify
method would be invoked on each values; an additional check would highly impact performance even if it is only used from the tools and not really in production. A Stringifier
is associated to the value at schema level which means it shall never happen that the value is invalid. That's why the Stringifier
implementations do not validate the values.
@@ -0,0 +1,44 @@ | |||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add it using separate jira?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we separate jiras in similar cases to not make cherry-picking hard in case of the related changed needs to be on another branch as well. In this case this is only a documentation so should not cause any troubles. It would be cleaner if this documentation would have already been existed and I've had to add the docs of the new keys only (which would clearly be part of this change).
If you have a strong opinion to separate this to another change I'm happy to do so, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
|
||
testAvroToParquetConversion(fromAvro, parquet); | ||
testParquetToAvroConversion(toAvro, parquet); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have checkReaderWriterCompatibility() to verify if the parquet and avro schema are compatible for UUID?
There are issues like PARQUET-1681 for avro schema and parquet schema conversion for other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I am not too familiar with parquet-avro. I've made the changes based on the implementation/test of other logical types. Could you explain it in more details what you would test exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically we found some type of avro schema is not compatible with the parquet schema which it is converted to. This caused problem that the data cannot be read. I have a test here (shangxinli@f80469f#diff-536ca67880a7870cf8df8f95143bd7d7R814) that reproduce the issue for a nested schema. I know likely UUID type won't have this issue but it better to have a test for it. It is pretty easy to add also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is too much effort for doing this, it is OK not to do it. It is a lower priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this just did not have time to work on this PR. Thanks a lot for reviewing. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testRoundTripConversion
I'm using in testUUIDTypeWithParquetUUID
is actually stronger than the one you suggested: it checks for equality (in two phases) of the initial and the result avro schemas (and not only for compatibility). For testUUIDType
, though it is a good idea to check the compatibility of the avro schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation