-
Notifications
You must be signed in to change notification settings - Fork 3k
Refactor the GenericOrcWriter by using OrcSchemaWithTypeVisitor#visit #1197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
|
|
||
| public static OrcValueWriter<Record> buildWriter(TypeDescription fileSchema) { | ||
| return new GenericOrcWriter(fileSchema); | ||
| private final GenericOrcWriters.Converter converter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any benefit of using GenericOrcWriters.Converter<?>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, it seems don't have much difference with a <?> or not, but I can changed to keep symmetry as we've discussed above.
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| @SuppressWarnings("unchecked") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this suppress unchecked can be removed once we paramterize the converter with a wildcard e.g converter -> converter<?>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We should avoid using types without parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I did not change to use OrcValueWriter<?> because if we do then we have the following to write child:
for (int c = 0; c < writers.size(); ++c) {
OrcValueWriter<?> child = writers.get(c);
child.write(row, value.get(c, child.getJavaClass()), output.cols[c]);
}The value is a StructLike and the get in StructLike is <T> T get(int pos, Class<T> javaClass), while child.getJavaClass is a class like OrcValueWriter<?>.class, it will throw the compile error:
Incompatible types. Required capture of ? but 'get' was inferred to T: no instance(s) of of type variables(s) exist so that capture of ? conforms to capture of ?
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
|
@rdsr @shardulm94 Mind to take another look ? I've updated the pull request, Thanks. |
| private static <D> OrcValueWriter<D> newOrcValueWriter( | ||
| TypeDescription schema, Function<TypeDescription, OrcValueWriter<?>> createWriterFunc) { | ||
| return (OrcValueWriter<D>) createWriterFunc.apply(schema); | ||
| private static <D> OrcRowWriter<D> newOrcValueWriter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Change method name to reflect class name change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reminding, Good point.
|
Ping @rdsr @shardulm94 @rdblue , any other concern ? Thanks. |
| } | ||
| } | ||
|
|
||
| private static class MapWriter implements OrcValueWriter<Map> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing parameter types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| } | ||
| } | ||
|
|
||
| private static class ListWriter implements OrcValueWriter<List> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing parameter type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not provide the explicit parameter type here, because getJavaClass() will need to return a List class with generic type, while Java don't support this now. Pls see here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to add the parameterized types.
The FAQ entry you pointed to explains why there is no class literal, like List<String>.class. All variants of List use List.class because there is only one concrete type at runtime. But we still want to use type parameters to be explicit about what is passed around.
This class handles lists of some type, T. The class should be parameterized by T so that we can use type-safe operations to pass around T instances. The wrapped value writer should be OrcValueWriter<T> elementWriter. By doing this, the implementation of nonNullWrite will get a List<T> and will be able to pass those values to the elementWriter without casting.
| * @param rowId the row in the ColumnVector | ||
| * @param column either the column number or element number | ||
| * @param data either an InternalRow or ArrayData | ||
| * @param data either an InternalRow or ArrayData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary whitespace changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I guess I did a code format before.
| try (FileAppender<Record> appender = ORC.write(outFile) | ||
| .schema(FILE_SCHEMA) | ||
| .createWriterFunc(GenericOrcWriter::buildWriter) | ||
| .createWriterFunc(typeDesc -> GenericOrcWriter.buildWriter(FILE_SCHEMA, typeDesc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the schema is passed to the write builder, what about adding a createWriterFunc method that accepts BiFunction<Schema, TypeDescription>? Then this wouldn't need to change.
We do this in Avro: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/Avro.java#L199-L203
That would cut down on the number of files that need to change in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to add createWriterFunc method that accepts BiFunction<Schema, TypeDescription> instead of replacing the existing one? Replacing the existing createWriterFunc causes changes in the files?
TestSparkOrcReadMetadataColumns.java
TestSparkOrcReader.java
TestOrcWrite.java
SparkAppenderFactory.java
| */ | ||
| public interface OrcValueWriter<T> { | ||
|
|
||
| Class<T> getJavaClass(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this be used for? I don't see anything calling it in this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used for reading the field from Record and casting the value to target class , see here: https://github.com/apache/iceberg/pull/1197/files#diff-69c0f1e45966d2eb49a315fe32734cf5R125.
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed all the comments (Include missing parameter type, Function -> BiFunction).
| */ | ||
| public interface OrcValueWriter<T> { | ||
|
|
||
| Class<T> getJavaClass(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used for reading the field from Record and casting the value to target class , see here: https://github.com/apache/iceberg/pull/1197/files#diff-69c0f1e45966d2eb49a315fe32734cf5R125.
| try (FileAppender<Record> appender = ORC.write(outFile) | ||
| .schema(FILE_SCHEMA) | ||
| .createWriterFunc(GenericOrcWriter::buildWriter) | ||
| .createWriterFunc(typeDesc -> GenericOrcWriter.buildWriter(FILE_SCHEMA, typeDesc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that sounds good to me.
…ender#newOrcRowWriter.
|
Thanks @openinx ! I will have another look, today. |
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great to me! We just need to address very minor comments and a question on replacing vs adding a BiFunction.
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About this issue #1197 (comment), I think we'd better not to introduce another createWriterFunc method here, sir. That makes more confuse and the OrcFileAppender need to choose the non-null function to create the OrcRowWriter. Now, for both Function argument or BiFunction argument, we need to change few files, then we just choose one.
I will push the next patch to address other comments, Thanks for the patient reviewing.
|
@openinx
|
|
@rdsr I've refactored the SparkOrcWrier by using OrcSchemaWithTypeVisitor in here, we can see that the constructor of SparkOrcWriter will also need the two arguments: iceberg schema and TypeDescription. So actually, although we could add a |
@openinx That makes sense to me. Thanks! |
|
Thanks for the confirmation, so if no other concern, please help to merge this PR so that we could move the following flink ORC reader writer work forward. Thanks in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Giving it a day for others to have a look before we merge. Thanks @openinx !
|
Can we abstract a BaseOrcWriter in the future ? Then make GenericOrcWriter, FlinkOrcWriter, SparkOrcWriter extends it. |
|
@simon0806 we won't need to abstract the BTW, ping @rdblue to merge this patch, Thanks. |
|
Looks good. Thanks, @openinx! |
This PR did the refactor for
GenericOrcWriter:OrcSchemaWithTypeVisitor#visit, so that we could abstract the common data type writers in a separated class namedGenericOrcWriters.FlinkOrcWriter, it will share the common writers fromGenericOrcWriters.