Skip to content

Add test for write.metadata.compression-codec property in Iceberg#13953

Merged
ebyhr merged 2 commits intotrinodb:masterfrom
ebyhr:ebi/iceberg-write-metadata-compression-codec
Sep 8, 2022
Merged

Add test for write.metadata.compression-codec property in Iceberg#13953
ebyhr merged 2 commits intotrinodb:masterfrom
ebyhr:ebi/iceberg-write-metadata-compression-codec

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Sep 1, 2022

Description

Add test for write.metadata.compression-codec property in Iceberg

Property Default Description
write.metadata.compression-codec none Metadata compression codec; none or gzip

https://iceberg.apache.org/docs/latest/configuration/

@cla-bot cla-bot bot added the cla-signed label Sep 1, 2022
@ebyhr ebyhr force-pushed the ebi/iceberg-write-metadata-compression-codec branch 2 times, most recently from 53d4d2d to fef36fa Compare September 1, 2022 07:38
@ebyhr ebyhr marked this pull request as ready for review September 1, 2022 11:30
@ebyhr ebyhr force-pushed the ebi/iceberg-write-metadata-compression-codec branch from fef36fa to bac2472 Compare September 1, 2022 11:34
@findinpath
Copy link
Copy Markdown
Contributor

trino:default> create table test1 (x integer) WITH (metadata_compression_codec='gzip');

trino:default> show create table test1;
                                             Create Table                                             
------------------------------------------------------------------------------------------------------
 CREATE TABLE iceberg.default.test1 (                                                                 
    x integer                                                                                         
 )                                                                                                    
 WITH (                                                                                               
    format = 'PARQUET',                                                                               
    format_version = 2,                                                                               
    location = 'hdfs://hadoop-master:9000/user/hive/warehouse/test1-d1e1ae768e304003b0c322b9a5df5368' 
 )                                                                                                    
(1 row)

metadata_compression_codec property is missing

@findinpath
Copy link
Copy Markdown
Contributor

Spark Iceberg is able to change on the fly the write.metadata.compression-codec property and this gets reflected in the metadata files created :

➜  metadata ls
00000-20dad955-ef84-4b40-b9ae-2cfebcf02d8d.gz.metadata.json
00001-1502c408-3b4d-4358-8265-de5d8e64636e.gz.metadata.json
00002-d2bc1c57-62af-4238-a64c-7d298b53eeca.metadata.json
00003-2c2b5b36-5e37-44ee-8cf6-052fddb92951.metadata.json

Notice in the snippet above that there are both gz.metadata.json and metadata.json files.

Corresponding test scenario:

spark-sql> create table test1 (x integer) tblproperties ('write.metadata.compression-codec' = 'gzip');
Response code
Time taken: 2.634 seconds
spark-sql> insert into test1 values 1;
Response code
Time taken: 2.877 seconds
spark-sql> alter table test1 set tblproperties ('write.metadata.compression-codec' = 'none');
Response code
Time taken: 0.075 seconds
spark-sql> insert into test1 values 2;
spark-sql> select * from test1;
x
2
1
Time taken: 0.549 seconds, Fetched 2 row(s)

Should we allow changing this property as well in io.trino.plugin.iceberg.IcebergMetadata#setTableProperties ?

@alexjo2144
Copy link
Copy Markdown
Member

On the testing side, can we assert that the files are actually compressed? Also, there are a few property tests in TestIcebergV2 like testUpdatingAllTableProperties that could be updated

@alexjo2144
Copy link
Copy Markdown
Member

Optional: We could add ametadata_compression_codec_default to IcebergConfig so if you wanted to have gzip enabled by default you could.

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Sep 2, 2022

metadata_compression_codec property is missing

I intentionally omitted it when the value is none. That is why IcebergUtil has if block.

Should we allow changing this property as well in io.trino.plugin.iceberg.IcebergMetadata#setTableProperties ?

This is also intentionally disallowed to start with small feature (= we can support it when needed).

@ebyhr ebyhr force-pushed the ebi/iceberg-write-metadata-compression-codec branch from bac2472 to fda0c88 Compare September 2, 2022 02:12
@ebyhr ebyhr force-pushed the ebi/iceberg-write-metadata-compression-codec branch from fda0c88 to 19afc40 Compare September 2, 2022 05:25
@ebyhr ebyhr requested a review from findepi September 2, 2022 09:31
Comment on lines 568 to 569
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances a user would want to set this?

cc @danielcweeks @rdblue @alexjo2144

@ebyhr ebyhr force-pushed the ebi/iceberg-write-metadata-compression-codec branch from 19afc40 to 096c979 Compare September 2, 2022 12:01
@ebyhr ebyhr force-pushed the ebi/iceberg-write-metadata-compression-codec branch from 096c979 to 2427891 Compare September 2, 2022 12:24
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 2, 2022

@findepi, I don't think that adding configuration for this is a good idea. This should be configured at the table level, not in processing engines.

@alexjo2144
Copy link
Copy Markdown
Member

@rdblue sorry, I'm not sure exactly what you're suggesting. The code here is really just an alias for setting the Iceberg TableProperties: write.metadata.compression-codec so the value is being set as a table property, to be used across engines.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 2, 2022

@alexjo2144, metadata files are typically handled within Iceberg, so I don't think Trino should need to do anything to make sure write.metadata.compression-codec is handled correctly. Trino should not need to worry about this property at all.

@alexjo2144
Copy link
Copy Markdown
Member

Trino doesn't have an arbitrary SET TBLPROPERTIES syntax that takes any string, property keys need to be explicitly enabled. So, any table properties we want users to be able to enable/disable from Trino need some minimal amount of handling.

Unless you're saying users shouldn't be able to enable this property from Trino at all?

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Sep 2, 2022

I don't think that adding configuration for this is a good idea. This should be configured at the table level, not in processing engines.

This change doesn't add a configuration. IcebergTableProperties is for table property.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 2, 2022

@ebyhr, @alexjo2144, I didn't realize that there wasn't a way to set table configuration in Trino. I'm still not sure that the right option is to implement PRs like this for individual table properties. And, I'm a bit concerned about creating a custom name that doesn't match Iceberg. That just requires more documentation and fragments support across the ecosystem -- people won't be able to set metadata_compression_codec anywhere else. And it's also a bit strange to show the underlying table property.

Can you pass properties from WITH through if they aren't mapped to anything specific, like partitioning?

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 5, 2022

I didn't realize that there wasn't a way to set table configuration in Trino

Trino allows to set table properties that are declared by the connector.

Can you pass properties from WITH through if they aren't mapped to anything specific

Not currently. Trino table properties are strictly typed and so they need to be declared

like partitioning?

partitioning is handled explicitly by Trino

There are two aspects of this

  • technical side: how a user should be able to set the write.metadata.compression-codec table property
    • exposing as metadata_compression_codec is one way of doing it
    • allowing users to set arbitrary string-valued properties is another way of doing this
  • usefulness question: Under what circumstances a user would want to set this? (Add test for write.metadata.compression-codec property in Iceberg #13953 (comment))
    • if the answer is "never", we should not add a new table property
    • if the answer is "always", we should also not add a table property
    • if the answer is "depends on the metadata size", then I'd argue that maybe we don't need the table property either -- Trino knowns metadata size, so it could apply compression or not automatically.

cc @electrum @dain

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 5, 2022

if the answer is "always", we should also not add a table property

We always set it by default for new tables.

I think this is mostly an administrator option and not something that should be exposed to users. Users probably don't know or care what to do with this setting.

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 5, 2022

An administrator/operator is also a user from the Trino project perspective.
When would an administrator want to set/change value of write.metadata.compression-codec Iceberg table property?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 5, 2022

@findepi, an administrator would set this for all tables in the catalog, not for individual tables.

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Sep 5, 2022

@rdblue What's the situation an administrator sets gzip compression for metadata file in the catalog?

Also, your answer denies your past comment #13953 (comment)

This should be configured at the table level

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 6, 2022

@ebyhr, this is a table level setting because that's the granularity at which you can change it. We also don't expect users to interact with this. It is probably carried through from the catalog when a table gets created -- that's what we do. It's reasonable for an engine to allow changing table-level settings like this, but it isn't something that should be specific to an engine (with a special property that only works in Trino).

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Sep 6, 2022

@rdblue Thanks, I hope your will answer to this question "What's the situation an administrator sets gzip compression for metadata file in the catalog?"

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 7, 2022

@ebyhr, this setting is most likely controlled by administrators that set table defaults. Users will probably never go into this level of detail.

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Sep 7, 2022

@rdblue I'm not asking who sets the property. My question was "When administrators want to use gzip for metadata compression globally"? e.g. estimated file size > n

@ebyhr ebyhr force-pushed the ebi/iceberg-write-metadata-compression-codec branch from 2427891 to b37f1ee Compare September 8, 2022 00:10
@ebyhr ebyhr changed the title Add support for write.metadata.compression-codec property in Iceberg Add test for write.metadata.compression-codec property in Iceberg Sep 8, 2022
@ebyhr ebyhr added the no-release-notes This pull request does not require release notes entry label Sep 8, 2022
@ebyhr ebyhr requested a review from findepi September 8, 2022 02:40
@ebyhr ebyhr merged commit c4434ed into trinodb:master Sep 8, 2022
@ebyhr ebyhr deleted the ebi/iceberg-write-metadata-compression-codec branch September 8, 2022 13:07
@github-actions github-actions bot added this to the 396 milestone Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed docs no-release-notes This pull request does not require release notes entry

Development

Successfully merging this pull request may close these issues.

5 participants