Skip to content

Support native parquet writer in hive module and Misc fixes#3400

Merged
dain merged 12 commits intotrinodb:masterfrom
qqibrow:add_parquet_writer
Jun 24, 2020
Merged

Support native parquet writer in hive module and Misc fixes#3400
dain merged 12 commits intotrinodb:masterfrom
qqibrow:add_parquet_writer

Conversation

@qqibrow
Copy link
Copy Markdown
Contributor

@qqibrow qqibrow commented Apr 10, 2020

No description provided.

@cla-bot cla-bot bot added the cla-signed label Apr 10, 2020
@dain dain self-requested a review April 10, 2020 03:14
@qqibrow qqibrow changed the title Support native parquet writer in hive module [WIP] Support native parquet writer in hive module Apr 14, 2020
@qqibrow qqibrow changed the title [WIP] Support native parquet writer in hive module Support native parquet writer in hive module and Misc fixes Apr 19, 2020
Copy link
Copy Markdown
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Just some minor comments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider caching this value if it is expensive to calculate. In the ORC writer, we update the cached value after each write operation is processed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resolved this. Also, I don't quite understand why in OrcWriter:

    @Override
    public long getWrittenBytes()
    {
        return orcWriter.getWrittenBytes() + orcWriter.getBufferedBytes();
    }

why bufferedBytes is part of written bytes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there statistics for non-primitive types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This aims to fill the stats part for primitive columns https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L740 . There is no support for this on non-primitive types.

@qqibrow qqibrow force-pushed the add_parquet_writer branch from 673eead to 5e46b04 Compare May 24, 2020 04:12
qqibrow added 10 commits May 23, 2020 22:33
resetDictionary in reset() cleaned dictionary regardless of the fact
that dictionary page is null or not. resetDictionary should happen
only after get dictionary page and the page is not null.
Set both setting to smaller value could force writer to write multiple
row groups and multiple pages in each row group, which could better
simulate real world examples.
@qqibrow qqibrow force-pushed the add_parquet_writer branch from 5e46b04 to a5f7df3 Compare May 24, 2020 05:34
@qqibrow qqibrow requested a review from dain May 25, 2020 17:05
Copy link
Copy Markdown
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@dain dain merged commit f0f3e53 into trinodb:master Jun 24, 2020
@dain dain mentioned this pull request Jun 24, 2020
9 tasks
@findepi findepi added this to the 337 milestone Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants