Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1827: UUID type currently not supported by parquet-mr #778

Merged
merged 2 commits into from
Jun 4, 2020

Conversation

gszadovszky
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@@ -861,6 +871,36 @@ PrimitiveStringifier valueStringifier(PrimitiveType primitiveType) {
}
}

public static class UUIDLogicalTypeAnnotation extends LogicalTypeAnnotation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is no implementation for hasCode() and equal()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a singleton the default implementation of equals(Object) and hashCode() fits perfectly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

private final char[] digit = "0123456789abcdef".toCharArray();
@Override
public String stringify(Binary value) {
byte[] bytes = value.getBytesUnsafe();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to make 'bytes' final since you are calling getBytesUnsafe()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final would protect the reference only and not the values of the array. Making a reference final in a local scope is usually required in situations where it is accessed from e.g. lambda closures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

assertEquals("ffffffff-ffff-ffff-ffff-ffffffffffff", stringifier.stringify(
toBinary(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff)));

assertEquals("0eb1497c-19b6-42bc-b028-b4b612bed141", stringifier.stringify(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the 3 test data are randomly chosen? It seems duplicate coverage.

Could you add some negative tests like incorrect length uuids, invalid characters etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to have 3 kind-of corner cases and 3 common (random but constant) values. What do you mean by duplicate coverage? (I think, we do not need exhaustive testing for the stringifiers because they only used by our tools for debugging purposes.)

The stringifiers do not validate the data they get because of performance reasons. So, if the array is longer than 16 it would simply stringify the first 16 and skip the others. In case of the length is too short then an ArrayIndexOutOfBoundsException would be thrown. Do you think we should test these cases? They would not reach any additional branches in the parquet code.
Invalid characters are not possible. The full set of values of the 16 bytes array is covered in UUID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By duplicate coverage, I meant the #2, #3 tests seems repeating the same test as #1. The value is different, but when the test executes, they would execute the same code path. So I think they won't provide extra coverage.

From test perspective, negative test does provide values. In this case, we can test the exception is thrown as expected if it is too short.

For "if the array is longer than 16 it would simply stringify the first 16 and skip the others", that could cause silent errors, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to add tests for the edge cases like too short or too long inputs. Though, I would not implement additional validations because of performance issues. A stringify method would be invoked on each values; an additional check would highly impact performance even if it is only used from the tools and not really in production. A Stringifier is associated to the value at schema level which means it shall never happen that the value is invalid. That's why the Stringifier implementations do not validate the values.

@@ -0,0 +1,44 @@
<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add it using separate jira?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we separate jiras in similar cases to not make cherry-picking hard in case of the related changed needs to be on another branch as well. In this case this is only a documentation so should not cause any troubles. It would be cleaner if this documentation would have already been existed and I've had to add the docs of the new keys only (which would clearly be part of this change).
If you have a strong opinion to separate this to another change I'm happy to do so, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good


testAvroToParquetConversion(fromAvro, parquet);
testParquetToAvroConversion(toAvro, parquet);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have checkReaderWriterCompatibility() to verify if the parquet and avro schema are compatible for UUID?
There are issues like PARQUET-1681 for avro schema and parquet schema conversion for other types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I am not too familiar with parquet-avro. I've made the changes based on the implementation/test of other logical types. Could you explain it in more details what you would test exactly?

Copy link
Contributor

@shangxinli shangxinli May 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we found some type of avro schema is not compatible with the parquet schema which it is converted to. This caused problem that the data cannot be read. I have a test here (shangxinli@f80469f#diff-536ca67880a7870cf8df8f95143bd7d7R814) that reproduce the issue for a nested schema. I know likely UUID type won't have this issue but it better to have a test for it. It is pretty easy to add also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is too much effort for doing this, it is OK not to do it. It is a lower priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this just did not have time to work on this PR. Thanks a lot for reviewing. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testRoundTripConversion I'm using in testUUIDTypeWithParquetUUID is actually stronger than the one you suggested: it checks for equality (in two phases) of the initial and the result avro schemas (and not only for compatibility). For testUUIDType, though it is a good idea to check the compatibility of the avro schemas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@gszadovszky gszadovszky merged commit 84c954d into apache:master Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants