Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Jul 13, 2020

This PR did the refactor for GenericOrcWriter :

  1. generate the orc writer by using OrcSchemaWithTypeVisitor#visit, so that we could abstract the common data type writers in a separated class named GenericOrcWriters.
  2. In the future FlinkOrcWriter , it will share the common writers from GenericOrcWriters.

@rdblue rdblue requested a review from rdsr July 13, 2020 17:35

public static OrcValueWriter<Record> buildWriter(TypeDescription fileSchema) {
return new GenericOrcWriter(fileSchema);
private final GenericOrcWriters.Converter converter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benefit of using GenericOrcWriters.Converter<?>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it seems don't have much difference with a <?> or not, but I can changed to keep symmetry as we've discussed above.

}
}

@SuppressWarnings("unchecked")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this suppress unchecked can be removed once we paramterize the converter with a wildcard e.g converter -> converter<?>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We should avoid using types without parameters.

Copy link
Member Author

@openinx openinx Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I did not change to use OrcValueWriter<?> because if we do then we have the following to write child:

for (int c = 0; c < writers.size(); ++c) {
      OrcValueWriter<?> child = writers.get(c);
      child.write(row, value.get(c, child.getJavaClass()), output.cols[c]);
}

The value is a StructLike and the get in StructLike is <T> T get(int pos, Class<T> javaClass), while child.getJavaClass is a class like OrcValueWriter<?>.class, it will throw the compile error:

Incompatible types. Required capture of ? but 'get' was inferred to T: no instance(s) of of type variables(s) exist so that capture of ? conforms to capture of ?

@openinx
Copy link
Member Author

openinx commented Jul 14, 2020

@rdsr @shardulm94 Mind to take another look ? I've updated the pull request, Thanks.

private static <D> OrcValueWriter<D> newOrcValueWriter(
TypeDescription schema, Function<TypeDescription, OrcValueWriter<?>> createWriterFunc) {
return (OrcValueWriter<D>) createWriterFunc.apply(schema);
private static <D> OrcRowWriter<D> newOrcValueWriter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Change method name to reflect class name change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reminding, Good point.

@openinx
Copy link
Member Author

openinx commented Jul 15, 2020

Ping @rdsr @shardulm94 @rdblue , any other concern ? Thanks.

}
}

private static class MapWriter implements OrcValueWriter<Map> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parameter types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

}
}

private static class ListWriter implements OrcValueWriter<List> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parameter type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not provide the explicit parameter type here, because getJavaClass() will need to return a List class with generic type, while Java don't support this now. Pls see here

Copy link
Contributor

@rdblue rdblue Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to add the parameterized types.

The FAQ entry you pointed to explains why there is no class literal, like List<String>.class. All variants of List use List.class because there is only one concrete type at runtime. But we still want to use type parameters to be explicit about what is passed around.

This class handles lists of some type, T. The class should be parameterized by T so that we can use type-safe operations to pass around T instances. The wrapped value writer should be OrcValueWriter<T> elementWriter. By doing this, the implementation of nonNullWrite will get a List<T> and will be able to pass those values to the elementWriter without casting.

* @param rowId the row in the ColumnVector
* @param column either the column number or element number
* @param data either an InternalRow or ArrayData
* @param data either an InternalRow or ArrayData
Copy link
Contributor

@rdblue rdblue Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I guess I did a code format before.

@openinx
Copy link
Member Author

openinx commented Jul 21, 2020

@rdblue @rdsr I think this patch is ready to merge now, please merge this if no other concern.

try (FileAppender<Record> appender = ORC.write(outFile)
.schema(FILE_SCHEMA)
.createWriterFunc(GenericOrcWriter::buildWriter)
.createWriterFunc(typeDesc -> GenericOrcWriter.buildWriter(FILE_SCHEMA, typeDesc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the schema is passed to the write builder, what about adding a createWriterFunc method that accepts BiFunction<Schema, TypeDescription>? Then this wouldn't need to change.

We do this in Avro: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/Avro.java#L199-L203

That would cut down on the number of files that need to change in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that sounds good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to add createWriterFunc method that accepts BiFunction<Schema, TypeDescription> instead of replacing the existing one? Replacing the existing createWriterFunc causes changes in the files?

TestSparkOrcReadMetadataColumns.java
TestSparkOrcReader.java
TestOrcWrite.java
SparkAppenderFactory.java

*/
public interface OrcValueWriter<T> {

Class<T> getJavaClass();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will this be used for? I don't see anything calling it in this commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used for reading the field from Record and casting the value to target class , see here: https://github.com/apache/iceberg/pull/1197/files#diff-69c0f1e45966d2eb49a315fe32734cf5R125.

Copy link
Member Author

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all the comments (Include missing parameter type, Function -> BiFunction).

*/
public interface OrcValueWriter<T> {

Class<T> getJavaClass();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used for reading the field from Record and casting the value to target class , see here: https://github.com/apache/iceberg/pull/1197/files#diff-69c0f1e45966d2eb49a315fe32734cf5R125.

try (FileAppender<Record> appender = ORC.write(outFile)
.schema(FILE_SCHEMA)
.createWriterFunc(GenericOrcWriter::buildWriter)
.createWriterFunc(typeDesc -> GenericOrcWriter.buildWriter(FILE_SCHEMA, typeDesc))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that sounds good to me.

@rdsr
Copy link
Contributor

rdsr commented Jul 22, 2020

Thanks @openinx ! I will have another look, today.

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great to me! We just need to address very minor comments and a question on replacing vs adding a BiFunction.

Copy link
Member Author

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About this issue #1197 (comment), I think we'd better not to introduce another createWriterFunc method here, sir. That makes more confuse and the OrcFileAppender need to choose the non-null function to create the OrcRowWriter. Now, for both Function argument or BiFunction argument, we need to change few files, then we just choose one.

I will push the next patch to address other comments, Thanks for the patient reviewing.

@openinx
Copy link
Member Author

openinx commented Jul 26, 2020

Ping @rdblue @rdsr , Mind to take a look again and help to merge this PR ? Thanks.

@rdsr
Copy link
Contributor

rdsr commented Jul 26, 2020

@openinx
I felt adding a BiFunction was better than replacing it. And regarding

That makes more confuse and the OrcFileAppender need to choose the non-null function to create the OrcRowWriter

Avro handles that well. See org.apache.iceberg.avro.Avro.ReadBuilder#build
But I'm ok with the patch in its current form. Since @rdblue initiated this comment. I'll wait for his final approval before merging.

@openinx
Copy link
Member Author

openinx commented Jul 27, 2020

@rdsr I've refactored the SparkOrcWrier by using OrcSchemaWithTypeVisitor in here, we can see that the constructor of SparkOrcWriter will also need the two arguments: iceberg schema and TypeDescription. So actually, although we could add a Function in here, but we actually won't use it in the newly refactored spark orc writer, that's why I say we don't need to introduce the redundant Function method.

@rdsr
Copy link
Contributor

rdsr commented Jul 27, 2020

@rdsr I've refactored the SparkOrcWrier by using OrcSchemaWithTypeVisitor in here, we can see that the constructor of SparkOrcWriter will also need the two arguments: iceberg schema and TypeDescription. So actually, although we could add a Function in here, but we actually won't use it in the newly refactored spark orc writer, that's why I say we don't need to introduce the redundant Function method.

@openinx That makes sense to me. Thanks!

@openinx
Copy link
Member Author

openinx commented Jul 27, 2020

Thanks for the confirmation, so if no other concern, please help to merge this PR so that we could move the following flink ORC reader writer work forward. Thanks in advance.

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Giving it a day for others to have a look before we merge. Thanks @openinx !

@simonsssu
Copy link
Contributor

Can we abstract a BaseOrcWriter in the future ? Then make GenericOrcWriter, FlinkOrcWriter, SparkOrcWriter extends it.

@openinx
Copy link
Member Author

openinx commented Jul 29, 2020

@simon0806 we won't need to abstract the BaseOrcWriter again in this patch because we have discussed to share the common data type writers in the GenericOrcWriter, FlinkOrcWriter, SparkOrcWriter. That's helpful to decouple the codes for different compute engine. Thanks.

BTW, ping @rdblue to merge this patch, Thanks.

@rdblue rdblue merged commit d2ceb5b into apache:master Jul 29, 2020
@rdblue
Copy link
Contributor

rdblue commented Jul 29, 2020

Looks good. Thanks, @openinx!

@openinx openinx deleted the geneirc-orc-refactor branch July 30, 2020 00:24
cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants