Skip to content

Propagate selected compression codec to Avro file writer#12639

Merged
losipiuk merged 3 commits intotrinodb:masterfrom
losipiuk:lo/avro-compression
Jun 8, 2022
Merged

Propagate selected compression codec to Avro file writer#12639
losipiuk merged 3 commits intotrinodb:masterfrom
losipiuk:lo/avro-compression

Conversation

@losipiuk
Copy link
Copy Markdown
Member

@losipiuk losipiuk commented Jun 1, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

How would you describe this change to a non-technical end user or system administrator?

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Hive
* Propagate compression codec selected by `hive.compression-codec` config property or `compression_codec` session property to Avro file writer. 
  If unsupported codec is selected (GZIP, LZ4) then files will be written using DEFLATE compression which matches previous behavior ({issue}`12639`)

@cla-bot cla-bot bot added the cla-signed label Jun 1, 2022
@losipiuk losipiuk requested review from electrum and findepi June 1, 2022 18:28
@losipiuk losipiuk force-pushed the lo/avro-compression branch from 2a3bd2c to c5ed456 Compare June 2, 2022 19:11
@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Jun 2, 2022

@findepi reworked toward what we dicussed. So we throw if unsupported compression codec is selected for Avro.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why none for unknown format?

i think it's a behavioral change. if so, should be code-commented and highlited in the commit message

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it is a behavioral change. I think it is fine though. This code path is used only for empty files. And compressing those does not make much sense anyway. I will add comment and change commit message.
Do you think it should make it to RN?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a bunch of BS comments here, which I deleted later on (e.g. this code path is not really empty-files only)
Current state is that I make it not change behaviour for now.
It may be considered to disable compression for empty files as a followup but it does not happen so far.

@losipiuk losipiuk force-pushed the lo/avro-compression branch from c5ed456 to 284774d Compare June 6, 2022 13:34
@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Jun 6, 2022

AC.
I also needed to make changes to DeltaConnector as apparently a session property which shares name in Hive and Delta connector has to have same underlying type.

@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Jun 6, 2022

PTAL @findepi

@losipiuk losipiuk requested a review from findepi June 6, 2022 13:35
@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 6, 2022

I also needed to make changes to DeltaConnector as apparently a session property which shares name in Hive and Delta connector has to have same underlying type.

when i introduced compression in Detla, the hive compression mechanics and semantics made sense for Delta (and Iceberg) too.

with logical compression in Hive, this is no longer the case, we should decouple the two.
What would it take to have separate enums in Hive and Delta?
(do you want to give it a try, or want me to do this?)

@losipiuk losipiuk force-pushed the lo/avro-compression branch from 284774d to 584dc1e Compare June 6, 2022 17:08
@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Jun 6, 2022

I also needed to make changes to DeltaConnector as apparently a session property which shares name in Hive and Delta connector has to have same underlying type.

when i introduced compression in Detla, the hive compression mechanics and semantics made sense for Delta (and Iceberg) too.

with logical compression in Hive, this is no longer the case, we should decouple the two. What would it take to have separate enums in Hive and Delta? (do you want to give it a try, or want me to do this?)

@findepi This was a test issue. The session based on HiveSessionProperties was used in delta tests. Changed to use one based on DeltaLakeSessionProperties and it looks fine now.

@losipiuk losipiuk force-pushed the lo/avro-compression branch 2 times, most recently from cf24199 to 1350aa0 Compare June 7, 2022 11:04
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(logically conflicts with #12687)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I know. This is why I am in rush.

losipiuk added 3 commits June 7, 2022 14:05
This will breaks the 1-1 mapping between compression codec used when
writing data files via Hive connector, and value passed as
hive.compression-codec configuration property or compression_codec
session property.
Now when choosing final compression codec to be used file format is also
taken into account.

Actual codec selection logic is not changed as part of this commit. New
mechanics will be exploited in a following ones.
@losipiuk losipiuk force-pushed the lo/avro-compression branch from 1350aa0 to a5d1de0 Compare June 7, 2022 12:06
@losipiuk losipiuk merged commit 98945e8 into trinodb:master Jun 8, 2022
@github-actions github-actions bot added this to the 385 milestone Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants