Persist proper logical type parameters in parquet files#23388
Conversation
|
|
ZacBlanco
left a comment
There was a problem hiding this comment.
Please also add a unit test which verifies the behavior for basic timestamp and types with nested timestamps
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergFileWriterFactory.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergFileWriterFactory.java
Outdated
Show resolved
Hide resolved
|
|
||
| List<Types.NestedField> updatedFields = icebergSchema.columns().stream() | ||
| .map(field -> { | ||
| if (field.type() instanceof Types.TimestampType) { | ||
| return Types.NestedField.of( | ||
| field.fieldId(), | ||
| field.isOptional(), | ||
| field.name(), | ||
| Types.fromPrimitiveString(Types.TimestampType.withoutZone().toString())); | ||
| } | ||
|
|
||
| if (field.type().isNestedType()) { | ||
| return field; | ||
| } | ||
|
|
||
| return field;}) | ||
| .collect(Collectors.toList()); | ||
| icebergSchema = new Schema(updatedFields); |
There was a problem hiding this comment.
As far as I can see, the icebergSchema has no need to be translated here. Seems that the information loss of adjustToUTC is omitted when flushing the parquet message type to the parquet file's footer, and vice versa reading parquet message type from the parquet file's footer metadata.
Referring to MessageTypeConverter.toParquetSchema() and MetadataReader.readTypeSchema().
There was a problem hiding this comment.
I see your point. I think this issue ultimately boils down to the Apache OriginalType class missing any parameter information, and the conversion methods basically just defaulting to isAdjustedToUTC. The PrimitiveType class we use in messageTypeConverter has a logical type annotation, but there is no way to directly build a LogicalType from this.
Within the Apache library, there exists an interface LogicalTypeAnnotationVisitor, and it's inheritor LogicalTypeConverterVisitor does exactly what we need here, however it is private to ParquetMetadataConverter.
This implementation can only be used through toParquetMetadata, which I'm not sure why we aren't currently making use of. Instead of setting the schema of our file metadata using a custom method that currently isn't working for logical types, we could just convert the entire FileMetaData object using the method that Apache already provides us.
The other solution would be to copy the implementation of 'LogicalTypeConverterVisitor' into our message type converter so we can get types from annotations.
TLDR: Apache doesn't expose a method for us to keep going with our custom converter implementation, we can either switch to using their metadata converter or copy some of their protected code.
There was a problem hiding this comment.
Yes, you are right. Our own implementation omitted the logical type annotation. Besides the methods you mentioned above which may introduce a lot of change, another shortcut way maybe that we do special handle for the primitive type with logical type annotation of TimestampLogicalTypeAnnotation in TypeVisitor.visit(primitiveType), set logical type for it's result SchemaElement. And do the opposite thing in MetadataReader.readTypeSchema(...). This is not a universal method, but it can be proven to be feasible.
There was a problem hiding this comment.
I tried coding it up this way on a separate branch...
public void visit(PrimitiveType primitiveType)
{
SchemaElement element = new SchemaElement(primitiveType.getName());
element.setRepetition_type(toParquetRepetition(primitiveType.getRepetition()));
element.setType(getType(primitiveType.getPrimitiveTypeName()));
if (primitiveType.getOriginalType() != null) {
element.setConverted_type(getConvertedType(primitiveType.getOriginalType()));
LogicalTypeAnnotation annotation = primitiveType.getLogicalTypeAnnotation();
if (annotation instanceof LogicalTypeAnnotation.TimestampLogicalTypeAnnotation) {
TimeUnit unit = TimeUnit.MICROS(new MicroSeconds());
switch(((LogicalTypeAnnotation.TimestampLogicalTypeAnnotation) annotation).getUnit()) {
case NANOS:
unit = TimeUnit.NANOS(new NanoSeconds());
break;
case MILLIS:
unit = TimeUnit.MILLIS(new MilliSeconds());
break;
}
element.setLogicalType(LogicalType.TIMESTAMP(
new TimestampType(((LogicalTypeAnnotation.TimestampLogicalTypeAnnotation) annotation).isAdjustedToUTC(), unit)));
}
}
This could obviously be separated into another method if need be.
I think this will work but wouldn't it be better to use the method Apache already provides? What if in the future we run into a similar problem with another type? We are maintaining a custom type converter that doesn't have any custom functionality apart from the provided one as far as I can tell.
There was a problem hiding this comment.
wouldn't it be better to use the method Apache already provides?
Of course, that would be better.
| .setRequiredConfigurationProperties(ImmutableMap.of()) | ||
| .initialize(); | ||
|
|
||
| this.icebergFileWriterFactory = injector.getInstance(IcebergFileWriterFactory.class); |
There was a problem hiding this comment.
I've been trying to figure out the best way to instantiate this class in the tests here but it has been a very convoluted process. Right now I have this injector implementation but it is still giving me some explicit bindings error for my hive metastore. Does anybody else know if there is a better way to go about this?
There was a problem hiding this comment.
Maybe we can build IcebergFileWriterFactory directly. I think the following code could work, you can have a try:
Session session = getQueryRunner().getDefaultSession();
ConnectorSession connectorSession = session.toConnectorSession(new ConnectorId("iceberg"));
TypeManager typeManager = new TypeManager()
{
@Override
public Type getType(TypeSignature signature)
{
for (Type type : getTypes()) {
if (signature.getBase().equals(type.getTypeSignature().getBase())) {
return type;
}
}
return null;
}
@Override
public Type getParameterizedType(String baseTypeName, List<TypeSignatureParameter> typeParameters)
{
return getType(new TypeSignature(baseTypeName, typeParameters));
}
@Override
public boolean canCoerce(Type actualType, Type expectedType)
{
throw new UnsupportedOperationException();
}
private List<Type> getTypes()
{
return ImmutableList.of(BOOLEAN, INTEGER, BIGINT, DOUBLE, VARCHAR, VARBINARY, TIMESTAMP, DATE, HYPER_LOG_LOG);
}
};
HdfsContext hdfsContext = new HdfsContext(connectorSession);
HdfsEnvironment hdfsEnvironment = IcebergDistributedTestBase.getHdfsEnvironment();
IcebergFileWriterFactory icebergFileWriterFactory = new IcebergFileWriterFactory(hdfsEnvironment, typeManager,
new FileFormatDataSourceStats(), new NodeVersion("test"), new OrcFileWriterConfig(), HiveDwrfEncryptionProvider.NO_ENCRYPTION);
Path path = new Path(createTempDir().getAbsolutePath());
Schema schema = toIcebergSchema(ImmutableList.of(new ColumnMetadata("a", VARCHAR), new ColumnMetadata("b", INTEGER), new ColumnMetadata("c", TIMESTAMP)));
IcebergFileWriter icebergFileWriter = icebergFileWriterFactory.createFileWriter(path, schema, new JobConf(), connectorSession,
hdfsContext, FileFormat.PARQUET, MetricsConfig.getDefault());
List<Page> input = rowPagesBuilder(VARCHAR, BIGINT, TIMESTAMP)
.addSequencePage(100, 0, 0, 123)
.addSequencePage(100, 100, 100, 223)
.addSequencePage(100, 200, 200, 323)
.build();
for (Page page : input) {
icebergFileWriter.appendRows(page);
}
icebergFileWriter.commit();
There was a problem hiding this comment.
Thanks! I was able to add this into the test. It is currently failing but I'm pretty sure this is due to the reader not being fixed yet. Will work on that this afternoon
| TimestampType timestamp = type.getTIMESTAMP(); | ||
| typeBuilder.as(LogicalTypeAnnotation.timestampType(timestamp.isAdjustedToUTC, convertTimeUnit(timestamp.unit))); | ||
| } | ||
| } |
There was a problem hiding this comment.
I only had to include these four because they are the ones that include logical type parameters that need to persist through the conversion process. This is an adaptation of the getLogicalTypeAnnotation method from the parquet metadata converter. The actual method does not work since it uses a switch statement on the wrong type, or a type that can't be used for some reason...
There was a problem hiding this comment.
Can you explain this a bit more? I don't understand why we need this corresponding reader change.
There was a problem hiding this comment.
This is because we were losing the logical type information upon reading the parquet schema. The test will not pass without this logic in place. It's fine if we don't need logical type info for storage in certain databases/formats but within the parquet module it is important to at least read it so that choice can be made within other connectors
There was a problem hiding this comment.
Do you feel that there are sufficient tests in place to ensure that the type conversion logic works with the Parquet built in converter?
There was a problem hiding this comment.
Which converter are you referring to? I'm not sure what tests Apache has in place for their libraries...
There was a problem hiding this comment.
Let's work on some of those tests, the scope of the change makes me nervous that the existing unit test coverage isn't sufficient.
There was a problem hiding this comment.
I don't think we should special-case for only certain types when reading logical types. If a logical type is set, we should call a method which does the logical type conversion, and then sets the logical type in the descriptor. If the logical type isn't known during conversion, then throw an exception.
See here for an example: https://github.com/trinodb/trino/blob/0d5576a67a855bf95123acaf3ea836d1963861ef/lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java#L221-L227
There was a problem hiding this comment.
I don't think this is necessary, the logical type annotations are already of the proper type. I just added the date type which isn't included in the logic here, to testWriteParquetFileWithLogicalTypes and it is still passing. The only purpose of this logic is to make sure that the parameters are correct, so it only needs to check the logical types that take parameters. Adding conditions for each logical type would be overkill, and confusing since most of the conditions wouldn't actually result in any additional functionality.
That being said, I can extract it into a method named something like getLogicalTypeAnnotationWithParameters to make the intentions a little more clear here.
There was a problem hiding this comment.
For completeness' sake, I do think that we should exhaustively convert all logical types so we don't have to touch this code afterwards. I agree it doesn't actually make much difference except for parameterized types or other logical types which are stored in the same physical format (e.g. long decimal and UUID). I would prefer to do the exhaustive conversion.
On another note, is this the error you encountered with the other approach?
java.lang.NoSuchMethodError: org.apache.parquet.format.LogicalType.getSetField()Lcom/facebook/presto/hive/$internal/parquet/org/apache/thrift/TFieldIdEnum;
at org.apache.parquet.format.converter.ParquetMetadataConverter.getLogicalTypeAnnotation(ParquetMetadataConverter.java:1084)
at org.apache.parquet.format.converter.ParquetMetadataConverter.buildChildren(ParquetMetadataConverter.java:1715)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetSchema(ParquetMetadataConverter.java:1670)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1526)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1490)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:591)
at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:654)
at org.apache.iceberg.parquet.ParquetUtil.fileMetrics(ParquetUtil.java:80)
at org.apache.iceberg.parquet.ParquetUtil.fileMetrics(ParquetUtil.java:75)
at com.facebook.presto.iceberg.IcebergParquetFileWriter.lambda$getMetrics$0(IcebergParquetFileWriter.java:77)
at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
There was a problem hiding this comment.
Yes I believe that was the error I got, the getSetField method does not exist.
I can update it to an exhaustive list.
0645477 to
4f1ff5a
Compare
|
Suggest revising the release note entry for consistency with the Release Notes Guidelines. Maybe something like this: |
| List<String> fileColumnNames = icebergSchema.columns().stream() | ||
| .map(Types.NestedField::name) | ||
| .collect(toImmutableList()); | ||
|
|
There was a problem hiding this comment.
Please revert this whitespace change
4f1ff5a to
f357e82
Compare
| Session session = getQueryRunner().getDefaultSession(); | ||
| ConnectorId connectorId = new ConnectorId("iceberg"); | ||
| this.connectorSession = session.toConnectorSession(connectorId); | ||
| TypeManager typeManager = new TypeManager() |
There was a problem hiding this comment.
Please move this into a separate class, like TestingTypeManager.
| MessageType originalSchema = convert(icebergSchema, "table"); | ||
| assertEquals(originalSchema, writtenSchema); | ||
| } | ||
| catch (IOException e) { |
There was a problem hiding this comment.
Just have the test throws Exception
| import static org.testng.Assert.fail; | ||
|
|
||
| public class TestIcebergFileWriter | ||
| extends AbstractTestQueryFramework |
There was a problem hiding this comment.
Does this need a QueryRunner? I can't tell where we actually run a full query.
There was a problem hiding this comment.
We aren't running a full query anywhere in the test, but the query runner is important for getting a valid connector session. There are a lot of variables that go into creating a connector session that actually works here. Maybe we don't need to extend the abstract class, and we can just instantiate the query runner in the before class method.
There was a problem hiding this comment.
Could we use TestingSession to help? Here's an example from the codebase:
public static final Session TEST_SESSION = testSessionBuilder()
.setCatalog("tpch")
.setSchema(TINY_SCHEMA_NAME)
.build();There was a problem hiding this comment.
I tried that and it causes the test to fail due to missing session properties... I used "iceberg" as catalog. There is a method to add properties, I can look into how to manually build a good test session for the use case but so far adding the same properties as I do in the constructor for the query runner hasn't worked
There was a problem hiding this comment.
This should work, let me know what failed and we can look into it
| fileMetaData.getCreated_by()); | ||
|
|
||
| ParquetMetadataConverter metadataConverter = new ParquetMetadataConverter(); | ||
| FileMetaData parquetMetaData = metadataConverter.toParquetMetadata(1, new ParquetMetadata(parquetMetaDataInput, ImmutableList.of())); |
There was a problem hiding this comment.
I believe this method is stateless. If so, can you make metadataConverter a constant?
| org.apache.parquet.hadoop.metadata.FileMetaData parquetMetaDataInput = new org.apache.parquet.hadoop.metadata.FileMetaData( | ||
| messageType, | ||
| ImmutableMap.of(), | ||
| fileMetaData.getCreated_by()); | ||
|
|
||
| ParquetMetadataConverter metadataConverter = new ParquetMetadataConverter(); | ||
| FileMetaData parquetMetaData = metadataConverter.toParquetMetadata(1, new ParquetMetadata(parquetMetaDataInput, ImmutableList.of())); | ||
|
|
||
| fileMetaData.setSchema(parquetMetaData.getSchema()); |
There was a problem hiding this comment.
Can you extract a helper method to encapsulate the conversion logic to get the Parquet schema?
| } | ||
|
|
||
| @Test | ||
| public void testMetadataCreation() |
There was a problem hiding this comment.
@auden-woolfson can you help me understand how these tests run before and after your main code changes?
6608044 to
6de9114
Compare
tdcmeehan
left a comment
There was a problem hiding this comment.
LGTM % nit about visibility of LongDecimalType
| import static io.airlift.slice.SizeOf.SIZE_OF_LONG; | ||
|
|
||
| final class LongDecimalType | ||
| public final class LongDecimalType |
There was a problem hiding this comment.
It's not anymore, I initially added it to check which decimal type we had in test metadata creation. Good catch
|
Please squash commits and we'll be good to merge. |
4006f8a to
8989792
Compare
There was a problem hiding this comment.
LGTM, but you'll need to squash the commits again. Glad we got those tests passing. You may want to look at the new pom changes @hantangwangd . These fixed the NoSuchMethod exception the tests were throwing before.
hantangwangd
left a comment
There was a problem hiding this comment.
Great, thanks for fixing this maven dependency issue @auden-woolfson. The whole change looks good to me.
bebdcec to
d1ff3c5
Compare
|
Great, thanks for all the help + reviews @ZacBlanco @tdcmeehan @hantangwangd, rebased and squashed |
d038e5e
d038e5e to
fe4b48a
Compare
The parquet writer, metadata reader, and schema converter were not properly maintaining logical type parameters when processing types. Explicit logic to do this had to be added to each of these modules.
fe4b48a to
e431b58
Compare
Description
When reading/writing parquet files, logical type parameters are not being persisted properly (see issue #23361). This was noticed using the timestamp type with iceberg, where
isAdjustedUTCwas always set to true. The parquet reading and writing functionality was updated, and a test was added to the iceberg connector in order to fix this issue.Similar issues were noticed in the hive connector in #23454, these required some changes to the parquet schema converter.