PARQUET-1827: UUID type currently not supported by parquet-mr #778

gszadovszky · 2020-04-03T13:17:52Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

shangxinli · 2020-05-02T16:45:36Z

parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java

@@ -861,6 +871,36 @@ PrimitiveStringifier valueStringifier(PrimitiveType primitiveType) {
    }
  }

+  public static class UUIDLogicalTypeAnnotation extends LogicalTypeAnnotation {


Why there is no implementation for hasCode() and equal()?

As this is a singleton the default implementation of equals(Object) and hashCode() fits perfectly.

Sounds good!

shangxinli · 2020-05-02T17:00:50Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveStringifier.java

+    private final char[] digit = "0123456789abcdef".toCharArray();
+    @Override
+    public String stringify(Binary value) {
+      byte[] bytes = value.getBytesUnsafe();


Do you want to make 'bytes' final since you are calling getBytesUnsafe()?

final would protect the reference only and not the values of the array. Making a reference final in a local scope is usually required in situations where it is accessed from e.g. lambda closures.

Sounds good!

shangxinli · 2020-05-02T17:21:23Z

parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveStringifier.java

+    assertEquals("ffffffff-ffff-ffff-ffff-ffffffffffff", stringifier.stringify(
+        toBinary(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff)));
+
+    assertEquals("0eb1497c-19b6-42bc-b028-b4b612bed141", stringifier.stringify(


Are the 3 test data are randomly chosen? It seems duplicate coverage.

Could you add some negative tests like incorrect length uuids, invalid characters etc.

The idea is to have 3 kind-of corner cases and 3 common (random but constant) values. What do you mean by duplicate coverage? (I think, we do not need exhaustive testing for the stringifiers because they only used by our tools for debugging purposes.)

The stringifiers do not validate the data they get because of performance reasons. So, if the array is longer than 16 it would simply stringify the first 16 and skip the others. In case of the length is too short then an ArrayIndexOutOfBoundsException would be thrown. Do you think we should test these cases? They would not reach any additional branches in the parquet code.
Invalid characters are not possible. The full set of values of the 16 bytes array is covered in UUID.

By duplicate coverage, I meant the #2, #3 tests seems repeating the same test as #1. The value is different, but when the test executes, they would execute the same code path. So I think they won't provide extra coverage.

From test perspective, negative test does provide values. In this case, we can test the exception is thrown as expected if it is too short.

For "if the array is longer than 16 it would simply stringify the first 16 and skip the others", that could cause silent errors, right?

I'm happy to add tests for the edge cases like too short or too long inputs. Though, I would not implement additional validations because of performance issues. A stringify method would be invoked on each values; an additional check would highly impact performance even if it is only used from the tools and not really in production. A Stringifier is associated to the value at schema level which means it shall never happen that the value is invalid. That's why the Stringifier implementations do not validate the values.

shangxinli · 2020-05-02T17:29:14Z

parquet-avro/README.md

@@ -0,0 +1,44 @@
+<!--


Should we add it using separate jira?

Usually we separate jiras in similar cases to not make cherry-picking hard in case of the related changed needs to be on another branch as well. In this case this is only a documentation so should not cause any troubles. It would be cleaner if this documentation would have already been existed and I've had to add the docs of the new keys only (which would clearly be part of this change).
If you have a strong opinion to separate this to another change I'm happy to do so, though.

Sounds good

shangxinli · 2020-05-02T17:51:28Z

parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java

+
+    testAvroToParquetConversion(fromAvro, parquet);
+    testParquetToAvroConversion(toAvro, parquet);
+  }


Can we have checkReaderWriterCompatibility() to verify if the parquet and avro schema are compatible for UUID?
There are issues like PARQUET-1681 for avro schema and parquet schema conversion for other types.

To be honest I am not too familiar with parquet-avro. I've made the changes based on the implementation/test of other logical types. Could you explain it in more details what you would test exactly?

Basically we found some type of avro schema is not compatible with the parquet schema which it is converted to. This caused problem that the data cannot be read. I have a test here (shangxinli@f80469f#diff-536ca67880a7870cf8df8f95143bd7d7R814) that reproduce the issue for a nested schema. I know likely UUID type won't have this issue but it better to have a test for it. It is pretty easy to add also.

If it is too much effort for doing this, it is OK not to do it. It is a lower priority.

I'll look into this just did not have time to work on this PR. Thanks a lot for reviewing. :)

The testRoundTripConversion I'm using in testUUIDTypeWithParquetUUID is actually stronger than the one you suggested: it checks for equality (in two phases) of the initial and the result avro schemas (and not only for compatibility). For testUUIDType, though it is a good idea to check the compatibility of the avro schemas.

Sounds good

PARQUET-1827: UUID type currently not supported by parquet-mr

b7dc3f9

shangxinli reviewed May 2, 2020

View reviewed changes

PARQUET-1827: Addressing comments

9b0b979

shangxinli approved these changes Jun 3, 2020

View reviewed changes

gszadovszky merged commit 84c954d into apache:master Jun 4, 2020

asfimport mentioned this pull request Apr 2, 2021

UUID type currently not supported by parquet-mr #2475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1827: UUID type currently not supported by parquet-mr #778

PARQUET-1827: UUID type currently not supported by parquet-mr #778

gszadovszky commented Apr 3, 2020

shangxinli May 2, 2020

gszadovszky May 7, 2020

shangxinli May 29, 2020

shangxinli May 2, 2020

gszadovszky May 7, 2020

shangxinli May 29, 2020

shangxinli May 2, 2020

gszadovszky May 7, 2020

shangxinli May 10, 2020

gszadovszky Jun 2, 2020

shangxinli May 2, 2020

gszadovszky May 7, 2020

shangxinli May 29, 2020

shangxinli May 2, 2020

gszadovszky May 7, 2020

shangxinli May 10, 2020 •

edited

Loading

shangxinli May 29, 2020

gszadovszky Jun 2, 2020

gszadovszky Jun 3, 2020

shangxinli Jun 3, 2020

PARQUET-1827: UUID type currently not supported by parquet-mr #778

PARQUET-1827: UUID type currently not supported by parquet-mr #778

Conversation

gszadovszky commented Apr 3, 2020

Jira

Tests

Commits

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli May 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli May 10, 2020 •

edited

Loading