Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow adding user defined metadata to ParquetSink #10224

Merged
merged 6 commits into from
Apr 26, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Apr 24, 2024

Which issue does this PR close?

Closes #10223 .

Related to #9493

Rationale for this change

Restore the ability to set user-inserted metadata into the writer properties of ParquetSink, and to enable this as a SQL level API configurable.

What changes are included in this PR?

Broken up per commit:

  • chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions
    • this is code cleanup to make explicit the reported bug
  • refactor: restore the ability to add kv metadata into the generated file sink
  • test: demonstrate API contract for metadata TableParquetOptions
    • this adds a new SQL level API for adding metadata, including tests.

New feature added UPDATED syntax:

COPY source_table TO 'sink' STORED AS PARQUET OPTIONS ('format.metadata::key' 'value')

Are these changes tested?

Yes. We have both integration tests for the TableParquetOptions API, as well as sqllogictests for the SQL level API.

Are there any user-facing changes?

Yes. The TableParquetOptions API is extended, as well as new SQL level API for these options.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 24, 2024
@wiedld wiedld changed the title 10223/provide metadata to parquet sink fix(10223): provide metadata to parquet sink Apr 24, 2024
@wiedld wiedld marked this pull request as ready for review April 24, 2024 23:56
@alamb alamb changed the title fix(10223): provide metadata to parquet sink Allow adding user defined metadata to parquet sink Apr 25, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- this looks great to me. I had some small documentation and test tweak suggestions, but I also think this PR could be merged as is and we could do them as a follow on PR

Just let me know what you prefer

cc @devinjdangelo and @metesynnada who I think have worked on this code recently

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/common/src/file_options/parquet_writer.rs Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/copy.slt Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/copy.slt Outdated Show resolved Hide resolved
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld the code looks good to me. The expected syntax of the metadata value isn't 100% clear to me, but this could just be down to the fact I've not worked with parquet kv metadata before.

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking close

TO 'test_files/scratch/copy/table_with_metadata/'
STORED AS PARQUET
OPTIONS (
'format.metadata' 'key1:value1 key2:value2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite and I missed it the first time around -- nice eyes @devinjdangelo . @wiedld could you please add documentation to the TableParquetOptions that documents this behavior?

Specifically, I would be interested to know "what if you want to store metadata values that have spaces in them" (key1:my value with spaces)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it would be fine if the answer is "you will get an error / is not supported yet" -- it might just be good to document that behavior

I could see wanting to support things like key1:"my awesome value" key2:"my other awesome value" (again not in this PR, but we should at least document how it works I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could leverage a syntax like:

(
'format.metadata.key1' 'val1',
'format.metadata.key2' 'val2 with space',
...
)

to support values with spaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 You beat me to it.

I started with the internal double quotes approach, also considered escaped spaces; then I realized that these are introducing lexical rules which did not feel SQL appropriate. Landed on the approach suggested by @devinjdangelo , commit will be up shortly.

@wiedld wiedld force-pushed the 10223/provide-metadata-to-parquet-sink branch from 6d6e5ba to b7808f1 Compare April 25, 2024 20:34
… key parsing be a part of the format.metadata::key
@wiedld wiedld force-pushed the 10223/provide-metadata-to-parquet-sink branch from b7808f1 to 5b06bf3 Compare April 25, 2024 20:36
Comment on lines +315 to +323
# accepts multiple entries with the same key (will overwrite)
statement ok
COPY source_table
TO 'test_files/scratch/copy/table_with_metadata/'
STORED AS PARQUET
OPTIONS (
'format.metadata::key1' 'value',
'format.metadata::key1' 'value'
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwriting is a common feature to all of the config OPTIONS(...). If we want to change, then I recommend a followup ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there is no need to change it in this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice -- thank you for changes @wiedld and thanks for the comments and suggestions @devinjdangelo . I think this PR looks very nice

/// Multiple entries are permitted
/// ```sql
/// OPTIONS (
/// 'format.metadata::key1' '',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.set("format.metadata::key3", "value with spaces ")
.unwrap();
table_config
.set("format.metadata::key4", "value with special chars :: :")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines +315 to +323
# accepts multiple entries with the same key (will overwrite)
statement ok
COPY source_table
TO 'test_files/scratch/copy/table_with_metadata/'
STORED AS PARQUET
OPTIONS (
'format.metadata::key1' 'value',
'format.metadata::key1' 'value'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there is no need to change it in this PR

@alamb alamb changed the title Allow adding user defined metadata to parquet sink Allow adding user defined metadata to ParquetSink Apr 26, 2024
@alamb alamb merged commit 9c8873a into apache:main Apr 26, 2024
24 checks passed
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 26, 2024
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions

* refactor: restore the ability to add kv metadata into the generated file sink

* test: demomnstrate API contract for metadata TableParquetOptions

* chore: update code docs

* fix: parse on proper delimiter, and improve tests

* fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 26, 2024
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions

* refactor: restore the ability to add kv metadata into the generated file sink

* test: demomnstrate API contract for metadata TableParquetOptions

* chore: update code docs

* fix: parse on proper delimiter, and improve tests

* fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 30, 2024
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions

* refactor: restore the ability to add kv metadata into the generated file sink

* test: demomnstrate API contract for metadata TableParquetOptions

* chore: update code docs

* fix: parse on proper delimiter, and improve tests

* fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 30, 2024
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions

* refactor: restore the ability to add kv metadata into the generated file sink

* test: demomnstrate API contract for metadata TableParquetOptions

* chore: update code docs

* fix: parse on proper delimiter, and improve tests

* fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ParquetSink: restore ability to provide additional user metadata into the encoded parquet file.
3 participants