Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make format prefix optional for format options in COPY #9723

Merged
merged 5 commits into from
Mar 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion datafusion/sql/src/statement.rs
Original file line number Diff line number Diff line change
Expand Up @@ -850,7 +850,16 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
return plan_err!("Unsupported Value in COPY statement {}", value);
}
};
options.insert(key.to_lowercase(), value_string.to_lowercase());
if !(&key.contains('.')) {
// If config does not belong to any namespace, assume it is
// a format option and apply the format prefix for backwards
// compatibility.

let renamed_key = format!("format.{}", key);
options.insert(renamed_key.to_lowercase(), value_string.to_lowercase());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tinfoil-knight @alamb, just would like to know the history of why we would make the value string lower case, I am encountering this case issue in another PR as described https://github.com/apache/datafusion/pull/10792/files#r1632877093. Thanks :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was originally added in this PR: #9382

It seems like the author of that PR was trying to consolidate case changes (using to_lowercase) in multiple places to just a single place.

For now, you could add an exclusion for the access token option that is case sensitive in your PR so you can continue your work.

Ideally, we should probably stop lower-casing the options and handle case-sensitivity in the specific reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having keys lowercased makes sense for standardization purposes, but I'm not sure why we are also lowercasing option values. It may either be because (1) of the same standardization concerns, or (2) by mistake.

The most flexible way would be to make value (not key) standardization logic customizable, with the default being lowercase. If we had that, users like @xinlifoobar would be able to avoid standardization modifications. Other users who want other kinds of standardizations would be able to override it etc.

@tinfoil-knight, would you be interested in thinking about how we can do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened an issue here: #10853

Making the value standardization logic customizable makes sense. Integrations should receive the original value and be able to decide what to do with it.

I'll attempt a fix for the issue this weekend if it hasn't already been picked up by then.

} else {
options.insert(key.to_lowercase(), value_string.to_lowercase());
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for @devinjdangelo

In the previous similar PR, we were only renaming keys when the file type matches JSON, CSV or PARQUET.
See https://github.com/apache/arrow-datafusion/pull/9594/files#diff-bb98aeaec478d9576e33911f252745961f57e38d25f6e9d13803894dfbad25d5R876-R881

I decided to not do this now since it'd seem weird if some file types need a format prefix & others don't.

Let me know if you think differently.


Sidenote: Other file formats don't have any supported format options currently.


let file_type = if let Some(file_type) = statement.stored_as {
Expand Down
39 changes: 39 additions & 0 deletions datafusion/sqllogictest/test_files/copy.slt
Original file line number Diff line number Diff line change
Expand Up @@ -474,6 +474,45 @@ select * from validate_arrow;
1 Foo
2 Bar

# Format Options Support without the 'format.' prefix

# Copy with format options for Parquet without the 'format.' prefix
query IT
COPY source_table TO 'test_files/scratch/copy/format_table.parquet'
OPTIONS (
compression snappy,
'compression::col1' 'zstd(5)'
);
----
2

# Copy with format options for JSON without the 'format.' prefix
query IT
COPY source_table to 'test_files/scratch/copy/format_table'
STORED AS JSON OPTIONS (compression gzip);
----
2

# Copy with format options for CSV without the 'format.' prefix
query IT
COPY source_table to 'test_files/scratch/copy/format_table.csv'
OPTIONS (
has_header false,
compression xz,
datetime_format '%FT%H:%M:%S.%9f',
delimiter ';',
null_value 'NULLVAL'
);
----
2

# Copy with unknown format options without the 'format.' prefix to ensure error is sensible
query error DataFusion error: Invalid or Unsupported Configuration: Config value "unknown_option" not found on CsvOptions
COPY source_table to 'test_files/scratch/copy/format_table2.csv'
OPTIONS (
unknown_option false,
);


# Error cases:

Expand Down
Loading