Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

@chenjunjiedada chenjunjiedada commented Jul 30, 2020

This is sub PR for #1237. I will rebase this once #1266 get merged.

import org.apache.parquet.schema.Type;

public class FlinkParquetReaders extends BaseParquetReaders<Row> {
public class FlinkParquetReaders {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class seems don't have to be public, only the FlinkParquetReader will access those readers. It also don't need to be accessed by other classes I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me.

@Override
public ParquetValueReader<?> struct(Types.StructType ignored, GroupType struct,
List<ParquetValueReader<?>> fieldReaders) {
// the expected struct is ignored because nested fields are never found when the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment is not complete ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

return new ParquetValueReaders.UnboxedReader<>(desc);
}
case TIME_MICROS:
return new TimeMillisReader(desc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is there any problem here ? the original type is TIME_MICROS, while the reader name is TimeMillisReader ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because Flink only supports milliseconds and the parquet store microseconds, so the naming express that it reads out milliseconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing. There are other places where we use a unit in the class name to indicate the unit being read. Instead, let's be more specific and use something like LossyMicrosToMillisTimeReader.

}
}

protected MessageType type() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will any subclass of ReadBuilder access the message type ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the FallbackReader uses it. Now I think this could be removed since the fallback reader defines its own type . That is because we can't get the type from passing builder.

};
}

private static Iterable<Record> generateIcebergGenerics(Schema schema, int numRecords,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it could share the common code with RandomGenericData#generate ? Make the RandomGenericData#generate to return a Iterable ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, let me refactor this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method accepts a Record supplier and then generate records. We should keep it for generating fallback records and dictionary encoded records. But for generateRecords method we can update it to call RandomGenericData#generate directly.

}
}

private static class TimeMicrosWriter extends ParquetValueWriters.PrimitiveWriter<Integer> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader is named TimeMillisReader, and the writer is TimeMicrosWriter, could them be symmetrical ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming logic is what we actually perform. In the reader side, we read in the milliseconds for Flink. In the writer side, we write out microseconds for Parquet.

public void write(int repetitionLevel, DecimalData decimal) {
Preconditions.checkArgument(decimal.scale() == scale,
"Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, decimal);
Preconditions.checkArgument(decimal.precision() <= precision,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem the upper bound of precision of IntegerDecimalWriter is 9 ? Could we add the precision <= 9 assertion ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use the latest DecimalUtil.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems DecimalUtil doesn't handle this. I fixed in the new commit.

public void write(int repetitionLevel, DecimalData decimal) {
Preconditions.checkArgument(decimal.scale() == scale,
"Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, decimal);
Preconditions.checkArgument(decimal.precision() <= precision,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could we add the precision <= 18 assertion ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding this when allocating the writer? Seems like that would be a suitable place since here we are checking Flink type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx, any comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to do this in the constructor, like @chenjunjiedada suggests. That way we have a check that precision is not larger than the maximum allowed by the type, and that the correct writer is used for the type.

@chenjunjiedada
Copy link
Collaborator Author

@openinx, Thanks a lot for your comments. Will rebase and update PRs.

@Override
public FileAppender<Row> newAppender(OutputFile outputFile, FileFormat format) {
MetricsConfig metricsConfig = MetricsConfig.fromProperties(props);
LogicalType logicalType = FlinkSchemaUtil.convert(schema);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have simiar issue to the comment, which will break the unit test. If we rebase the master once #1320 get merged, then it should have no problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we may also need to add the parquet into the parameterized unit tests, such as TestIcebergStreamWriter & TestTaskWriters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will take a look when these PRs get in.

};
}

private static Iterable<RowData> generateRowData(Schema schema, int numRecords,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

public void write(int repetitionLevel, DecimalData decimal) {
Preconditions.checkArgument(decimal.scale() == scale,
"Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, decimal);
Preconditions.checkArgument(decimal.precision() <= 9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it should be decimal.precision <= precision ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I misunderstood your comments, let me update this.

public void write(int repetitionLevel, DecimalData decimal) {
Preconditions.checkArgument(decimal.scale() == scale,
"Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, decimal);
Preconditions.checkArgument(decimal.precision() <= 18,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +346 to +380
private class ElementIterator<E> implements Iterator<E> {
private final int size;
private final ArrayData list;
private int index;

private ElementIterator(ArrayData list) {
this.list = list;
size = list.size();
index = 0;
}

@Override
public boolean hasNext() {
return index != size;
}

@Override
@SuppressWarnings("unchecked")
public E next() {
if (index >= size) {
throw new NoSuchElementException();
}

E element;
if (list.isNullAt(index)) {
element = null;
} else {
element = (E) ArrayData.createElementGetter(elementType).getElementOrNull(list, index);
}

index += 1;

return element;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this ElementIterator to be a static class, then the map's EntryIterator could share it ? Seems we could do it, you could decide wether there is necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 'm not sure how can it be shared with EntryIterator.

import org.apache.parquet.schema.PrimitiveType;
import org.apache.parquet.schema.Type;

public class ParquetWithFlinkSchemaVisitor<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we could share both flink and spark ParquetSchemaVisitor in a common class , can be a separate issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would prefer to do the refactor in a separated PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a WithPartner visitor like @JingsongLi added would be great.

@chenjunjiedada chenjunjiedada force-pushed the use-schema-visitor-for-flink-writer branch from 940e435 to 09c1b41 Compare August 12, 2020 11:31
@rdblue
Copy link
Contributor

rdblue commented Aug 12, 2020

From other comments, it sounds like I should review #1320 first and then this will be rebased. I also reviewed the read side, which can be done in parallel.

@chenjunjiedada chenjunjiedada force-pushed the use-schema-visitor-for-flink-writer branch from 09c1b41 to a73a7d7 Compare August 20, 2020 06:18
@chenjunjiedada
Copy link
Collaborator Author

@rdblue @openinx, Just rebased this and also added the follow-up from the reader side. Please take a look at your convenience.

@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2020

@chenjunjiedada, looks like this is conflicting again. Must have been one of the patches I merged this morning. Sorry about that!

I'll take a look at this one next, thanks for your patience with reviews. I've been running behind on reviews lately.

@chenjunjiedada
Copy link
Collaborator Author

@rdblue , Never mind, it is just a small conflict that already fixed. Take your time.

}

@Override
public ParquetValueWriter<?> primitive(LogicalType sType, PrimitiveType primitive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s in sType indicates Spark. The equivalent here would be fType or a better name.


@Override
public void write(int repetitionLevel, Integer value) {
long micros = Long.valueOf(value) * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion from Integer doesn't make much sense. Java exposes 2 valueOf with string arguments and one with a primitive long argument. The last is what is called here. In that case, this is implicitly casting Integer to long, boxing the result, and then multiplying to produce a primitive.

It would be better to use value.longValue() * 1000 instead.

if (list.isNullAt(index)) {
element = null;
} else {
element = (E) ArrayData.createElementGetter(elementType).getElementOrNull(list, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called in a tight loop, so for performance any preparation that can be done in advance should be.

That means this getter should be created in the constructor and stored as an instance field. Then it can be called here.

Also, if there is already a null check above, does this need to call getElementOrNull or should it just call a get variant that assumes the value is non-null?

Alternatively, you could replace the if here:

E element = (E) getter.getElementOrNull(list, index);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means this getter should be created in the constructor and stored as an instance field. Then it can be called here.

Yeah, that sounds good to me, great point.

does this need to call getElementOrNull or should it just call a get variant that assumes the value is non-null?
The getter in ArrayData don't have a get interface, it have only the interface:

	/**
	 * Accessor for getting the elements of an array during runtime.
	 *
	 * @see #createElementGetter(LogicalType)
	 */
	interface ElementGetter extends Serializable {
		@Nullable Object getElementOrNull(ArrayData array, int pos);
	}

Replacing the if-else to be E element = (E) getter.getElementOrNull(list, index); sounds reasonable to me.

}

if (values.isNullAt(index)) {
entry.set((K) ArrayData.createElementGetter(keyType).getElementOrNull(keys, index), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The getters for keys and values should be instance fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keys are not allowed to be null, so there should be no need to call getElementOrNull for the key.

protected Object get(Row row, int index) {
return row.getField(index);
protected Object get(RowData struct, int index) {
return RowData.createFieldGetter(types.get(index), index).getFieldOrNull(struct);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each getter should be stored as a field in an array.

final int fieldPos = i;
assertEquals(types.get(i), logicalType, expected,
() -> RowData.createFieldGetter(logicalType, fieldPos).getFieldOrNull(actualRowData));
RowData.createFieldGetter(logicalType, i).getFieldOrNull(actualRowData));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing these.

@rdblue
Copy link
Contributor

rdblue commented Aug 25, 2020

@chenjunjiedada, I'm going to merge this. The remaining issues are minor or are not correctness problems. Just be sure to follow up and fix the getter problems or else this will be slower than it should be.

@rdblue rdblue merged commit 6a7c0db into apache:master Aug 25, 2020
@rdblue
Copy link
Contributor

rdblue commented Aug 25, 2020

Thanks, @chenjunjiedada for building this, and @openinx for reviewing!

@chenjunjiedada chenjunjiedada deleted the use-schema-visitor-for-flink-writer branch August 25, 2020 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants