Implement MR v1 (mapred) input format #1104
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello,
This is another attempt at implementing a MR v1 input format (mapred) for Iceberg. For context, when I started working on this PR, #933 had been inactive for about a month. There's been new activity since then, but since I'm finished I thought I'd still push this branch to offer an alternative.
In this PR, I've tried to address the main concerns raised in #933, mostly about reusing the input format, record reader, and split classes implemented for MR v2.
I've also modified the input format test suite to be able to run against both MR input formats.
A Hive input format can easily be built on top of this MR v1 input format. The
IcebergSplitclass can be wrapped into aFileSplitif necessary (see a comment from @massdosage on #933). I believe the nice test suite from #933 could also be reused:The PR is split into multiple commits:
Refactor:
6f6f813: Move ConfigBuilder and InMemoryDataModel out of IcebergInputFormat
5608c1b: Move IcebergRecordReader and IcebergSplit out of IcebergInputFormat
c9a90d4: Refactor TestIcebergInputFormat, mostly factoring out duplicate code
Rename:
df5b5e6: Rename TestIcebergInputFormat class:
TestIcebergInputFormat->TestIcebergInputFormatSFeature:
8e6ab97: Implement MR v1 (mapred) input format, wrapping the v2 classes and introducing a
Containerclass to deal with the cumbersome MR v1 API (createValue)@rdblue @rdsr
@cmathiesen @massdosage