Skip to content

Conversation

@guilload
Copy link
Contributor

@guilload guilload commented Jun 8, 2020

Hello,

This is another attempt at implementing a MR v1 input format (mapred) for Iceberg. For context, when I started working on this PR, #933 had been inactive for about a month. There's been new activity since then, but since I'm finished I thought I'd still push this branch to offer an alternative.

In this PR, I've tried to address the main concerns raised in #933, mostly about reusing the input format, record reader, and split classes implemented for MR v2.

I've also modified the input format test suite to be able to run against both MR input formats.

A Hive input format can easily be built on top of this MR v1 input format. The IcebergSplit class can be wrapped into a FileSplit if necessary (see a comment from @massdosage on #933). I believe the nice test suite from #933 could also be reused:

public class IcebergWritable extends Container<Record> {}

public class IcebergInputFormat extends MapredIcebergInputFormat<IcebergWritable, Record> implements CombineHiveInputFormat.AvoidSplitCombination {

    @Override
    public boolean shouldSkipCombine(Path path, Configuration conf) {
        return true;
    }

}

The PR is split into multiple commits:

Refactor:
6f6f813: Move ConfigBuilder and InMemoryDataModel out of IcebergInputFormat
5608c1b: Move IcebergRecordReader and IcebergSplit out of IcebergInputFormat
c9a90d4: Refactor TestIcebergInputFormat, mostly factoring out duplicate code

Rename:
df5b5e6: Rename TestIcebergInputFormat class: TestIcebergInputFormat -> TestIcebergInputFormatS

Feature:
8e6ab97: Implement MR v1 (mapred) input format, wrapping the v2 classes and introducing a Container class to deal with the cumbersome MR v1 API (createValue)

@rdblue @rdsr
@cmathiesen @massdosage

@guilload guilload force-pushed the guilload--mapred-input-format branch from 5492bbe to 301d056 Compare June 8, 2020 21:50
@guilload guilload force-pushed the guilload--mapred-input-format branch from 301d056 to 8e6ab97 Compare June 8, 2020 21:57
@guilload guilload changed the title Implement MR v1 input format (mapred) Implement MR v1 (mapred) input format Jun 8, 2020
@massdosage
Copy link
Contributor

Hey there, some context on why #933 was paused for a month - in developing it we noticed issues with Guava dependencies required by Hive so we focused on #935 and #1067 which have subsequently been merged and proven to work for #933 (and within a Hive client). In the meantime we continued development of additional features for the InputFormat as well as a SerDe and a StorageHandler for Hive, all of which are temporarily at https://github.com/ExpediaGroup/hiveberg. We are now in the process of trying to merge these changes into master via #933 and #1103. Then we'd like to add more of the integration tests from Hiveberg and ideally tidy up and refactor more of the code between the two input formats. I'd have a preference for merging any improvements that you have here into #933 or subsequent follow-up PRs so hopefully we end up with the best of both worlds and a really feature-rich, well-tested InputFormat. @rdblue @rdsr what are your thoughts?

@rdblue
Copy link
Contributor

rdblue commented Jun 10, 2020

I agree with @massdosage, now that #933 is unblocked, I'd like to focus on getting that in to unblock the other Hive work. I appreciate the work you've done here, @guilload, and I hope that isn't too disruptive for you. Let's try to get improvements you've made into #933 or to add them afterwards. Is that okay with you?

@guilload
Copy link
Contributor Author

Yeah, no problem.

@guilload guilload closed this Jun 10, 2020
This was referenced Jun 11, 2020
@guilload guilload deleted the guilload--mapred-input-format branch July 22, 2020 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants