-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Flink: add _row_id and _last_updated_sequence_number readers #14148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
mark ci fail . https://github.com/apache/iceberg/actions/runs/17906247731/job/50907838966?pr=14148 |
5c21839 to
a0372b4
Compare
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java
Outdated
Show resolved
Hide resolved
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java
Outdated
Show resolved
Hide resolved
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java
Outdated
Show resolved
Hide resolved
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java
Show resolved
Hide resolved
|
Only some minor comments. @mxm: Could you please take a look? |
|
|
||
| OutputFile output = new InMemoryOutputFile(); | ||
| try (FileAppender<Record> writer = | ||
| Parquet.write(Files.localOutput(testFile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for moving from local file to in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #12836, spark and Core have changed their behavior, so I aligned this part.
| ParquetValueReader<?> reader = readersById.get(id); | ||
| if (idToConstant.containsKey(id)) { | ||
| // containsKey is used because the constant may be null | ||
| int fieldMaxDefinitionLevel = | ||
| maxDefinitionLevelsById.getOrDefault(id, defaultMaxDefinitionLevel); | ||
| reorderedFields.add( | ||
| ParquetValueReaders.constant(idToConstant.get(id), fieldMaxDefinitionLevel)); | ||
| } else if (id == MetadataColumns.ROW_POSITION.fieldId()) { | ||
| reorderedFields.add(ParquetValueReaders.position()); | ||
| } else if (id == MetadataColumns.IS_DELETED.fieldId()) { | ||
| reorderedFields.add(ParquetValueReaders.constant(false)); | ||
| } else if (reader != null) { | ||
| reorderedFields.add(reader); | ||
| } else if (field.initialDefault() != null) { | ||
| reorderedFields.add( | ||
| ParquetValueReaders.constant( | ||
| RowDataUtil.convertConstant(field.type(), field.initialDefault()), | ||
| maxDefinitionLevelsById.getOrDefault(id, defaultMaxDefinitionLevel))); | ||
| } else if (field.isOptional()) { | ||
| reorderedFields.add(ParquetValueReaders.nulls()); | ||
| } else { | ||
| throw new IllegalArgumentException( | ||
| String.format("Missing required field: %s", field.name())); | ||
| } | ||
| ParquetValueReader<?> reader = | ||
| ParquetValueReaders.replaceWithMetadataReader( | ||
| id, readersById.get(id), idToConstant, constantDefinitionLevel); | ||
| reorderedFields.add(defaultReader(field, reader, constantDefinitionLevel)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, how did you decide to make this refactoring? You consolidate several branches of the if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #12836, the core layer integrated the same code between Spark and Core. I noticed that the code for Flink is also the same, so I reused it.
| import org.apache.flink.table.data.GenericRowData; | ||
| import org.apache.flink.table.data.MapData; | ||
| import org.apache.flink.table.data.RawValueData; | ||
| import org.apache.flink.table.data.RowData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR needs to go against the 2.1 target (currently: 2.0), as #13714 has been merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we focus on the 2.0.0 version for this PR first, and then backport it to 2.1.0 and 1.20 later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with both.
|
Mark ci fail . https://github.com/apache/iceberg/actions/runs/17911586761/job/50924149718?pr=14148 |
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java
Outdated
Show resolved
Hide resolved
e4bf06b to
8cf2f8e
Compare
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java
Outdated
Show resolved
Hide resolved
550ef7b to
6356e37
Compare
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java
Outdated
Show resolved
Hide resolved
bba000f to
d6a877c
Compare
|
@mxm: Any comments? |
mxm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Merged to main. |
This pr is aim to adds readers in Flink for _row_id and _last_updated_sequence_number to support lineage in flink.
This change mainly aligns with/references #12836