Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support newlines_in_values CSV option #11533

Merged
merged 10 commits into from
Jul 21, 2024
Merged

Conversation

connec
Copy link
Contributor

@connec connec commented Jul 18, 2024

Which issue does this PR close?

Closes #11472.

Rationale for this change

This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can only be controlled by the session-level datafusion.optimizer.repartition_file_scans and datafusion.optimizer.repartition_file_min_size settings.

What changes are included in this PR?

This commit introduces a newlines_in_values option to CsvOptions and plumbs it through to CsvExec, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis.

I've added newlines_in_values using similar conventions to has_header, with CsvOptions using an Option<bool> and a default value coming from datafusion::common::config::CatalogOptions.

For now, in the interests of being surgical, I've just added a new argument to CsvExec::new, which is now triggering clippy::too_many_arguments. Before going any further I wanted to see if this was overall a good approach, but I'm happy to refactor this into an options struct or similar.

Are these changes tested?

Yes – a new test has been added alongside the existing tests for file scan repartioning in datafusion/core/src/datasource/file_format/csv.rs.

Are there any user-facing changes?

  • Breaking: Add public datafusion::common::config::CatalogOptions::newlines_in_values: bool field, default: false.
  • Breaking: Add public datafusion::common::config::CsvOptions::newlines_in_values: Option<bool> field, default: None.
  • Breaking: Add public datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values: bool field, default: false.
  • Breaking: Add newlines_in_values: bool argument to datafusion::datasource::physical_plan::CsvExec::new.
  • Add public datafusion::common::config::CsvOptions::with_newlines_in_values method.
  • Add public datafusion::datasource::file_format::csv::CsvFormat::with_newlines_in_values method.
  • Add public datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values method.
  • Add public datafusion::datasource::physical_plan::CsvExec::newlines_in_values method.
  • Add newlines_in_values to relevant proto files.

This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.

This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.

BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.
@github-actions github-actions bot added the core Core DataFusion crate label Jul 18, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @connec -- this PR looks very nice to me (and a very nice first contribution🏅)

I think the only thing missing is an end to end test showing that with this change DataFusion can actually read CSV files with newlines in them

Here are the instructions for such end to end tets https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

Maybe you could add an example data file in https://github.com/apache/datafusion/tree/main/datafusion/core/tests/data

And the write a test that defined the external table with the appropriate options and read from it

Ideally you should be able to follow the model of one of the existing test files, perhaps

# create_external_table_with_quote_escape
statement ok
CREATE EXTERNAL TABLE csv_with_quote (
c1 VARCHAR,
c2 VARCHAR
) STORED AS CSV
LOCATION '../core/tests/data/quote.csv'
OPTIONS ('format.quote' '~',
'format.delimiter' ',',
'format.has_header' 'true');

cc @2010YOUY01

@@ -1665,6 +1670,14 @@ impl CsvOptions {
self
}

/// Set true to ensure that newlines in (quoted) values are supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Set true to ensure that newlines in (quoted) values are supported.
/// Set true to support newlines in (quoted) values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing was a bit awkward initially as I was thinking of the flag as "ensuring support", rather than directly "supporting", since newlines in values are already supported if the file is below the datafusion.optimizer.repartition_file_min_size.

I'm not sure if it's worth conveying that detail through these docs, or else to document this as providing support and treat the fact that newlines in values will "just work" for smaller files as an implementation detail that might change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded and normalised the documentation. I've gone for documenting the flag as enabling support for newlines in values, with the fact that newlines in values might work without it left as an implementation-defined detail.

@@ -233,6 +233,14 @@ impl CsvFormat {
self
}

/// Set true to ensure that newlines in (quoted) values are supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Set true to ensure that newlines in (quoted) values are supported.
/// Set true to support newlines in (quoted) values.

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the new feature support! It is a great UX improvement given 'newlines in quotes' is defined in CSV standard

I also find this PR's design is good and very easy to read, I think it's good to go with extra integration tests for CSVs with newlines_in_values fields as @alamb mentioned.

Comment on lines 188 to 189
/// Default value for `format.newlines_in_values` for `CREATE EXTERNAL TABLE`
// if not specified explicitly in the statement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more descriptive comment help for users without context to understand this config option, which should include 1. It's CSV specific 2. Can be overridden by the same config field in CsvOptions 3. Its behavior as stated in CsvOptions's comment.

And update in https://github.com/apache/datafusion/blob/main/docs/source/user-guide/configs.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded the documentation. I forgot about the user guide so I'll do that now as well.

@2010YOUY01
Copy link
Contributor

I'm trying to provide more context for the rationale for this PR's implementation:

Currently, DataFusion's parallel CSV scan will first split the file range evenly, and adjust the range border by probing \n forward.
It's super hard to support newliens_in_values in parallel scan in this way: executor have to probe backwards to determine if the current field is qutoed, and it don't know when to stop.

So this PR's approach (explicitly adding one CsvOption) is reasonable, maybe in the future we can parse the whole file during planning to find out the correct split points, if it can help performance in some use cases.

@connec
Copy link
Contributor Author

connec commented Jul 19, 2024

Thanks both for the rapid feedback. I'll take a look and update the PR.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 19, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 19, 2024
@connec
Copy link
Contributor Author

connec commented Jul 19, 2024

I've made some documentation updates and added/fixed sqllogictests.

There's the new clippy issue around having too many arguments for CsvExec::new, which I can either fix with a new Options struct of some kind, or suppress and fix in a follow-up.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

There's the new clippy issue around having too many arguments for CsvExec::new, which I can either fix with a new Options struct of some kind, or suppress and fix in a follow-up.

I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR

I would personally suggest a builder pattern like

let csv_exec = CsvExec::builder()
  .with_has_header(true)
  .with_newlines_in_values(true)
  ...
  .build()?

Following the model of https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExecBuilder.html

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @connec -- this looks good to me 👍

I think once we have the CI passing this PR is good to go from my perspective.

@@ -0,0 +1,13 @@
id,message
1,"hello
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@connec
Copy link
Contributor Author

connec commented Jul 19, 2024

I think that should be the fixes needed for CI.

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR

I filed #11565 to track improving the clippy lint

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @connec and @2010YOUY01 👌

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

🤔 the slt test seems to be failing on windows (likely due to newline nonsense).

https://github.com/apache/datafusion/actions/runs/10014892577/job/27695257736

We should probably figure out a way to not run that test on windows. Gah!

Is this something you have time to look at @connec ? If not, I think we could comment out the slt test to get this PR in and then sort out the testing as a follow on PR.

Note to reduce the feedback loop cycle time, I merged this branch to main and pushed the commit (which I think means the CI should automatically run now on new commits). Once we have merged this PR I think your subsequent PRs will also have the CI run automatically.

@connec
Copy link
Contributor Author

connec commented Jul 20, 2024

I'd guess it might be related to there being a \r on windows? I have some time to look so I'll see if I can resolve it.

Edit: ran out of time today, but I can continue tomorrow.

@connec
Copy link
Contributor Author

connec commented Jul 20, 2024

Unfortunately it seems like the workflows still need approval @alamb.

I'm not entirely sure if .gitattributes change will fix the slts (I don't have a Windows system to test on) but 🤞 I've got ahold of a windows system and this hasn't worked. I'll do some more debugging when I can.

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.
@connec
Copy link
Contributor Author

connec commented Jul 20, 2024

The issue is caused by the line endings in the newlines_in_values.csv file being converted to CRLF on windows. While either line ending is accepted as a CSV record terminator by default, newlines in values are not normalised at all. This means that the value returned from the query on Windows includes CRLF rather than LF. Since newlines in the .slt file itself are normalised, this then leads to the diff.

For now, I've added a .gitattributes entry to checkout the newlines_in_values.csv file with LF line endings always, which resolves the problem on my windows machine at least.

@alamb
Copy link
Contributor

alamb commented Jul 21, 2024

🚀
Thanks again @connec

@alamb alamb merged commit 2587df0 into apache:main Jul 21, 2024
25 checks passed
@connec connec deleted the csv-newlines-in-values branch July 22, 2024 07:23
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* feat!: support `newlines_in_values` CSV option

This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.

This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.

BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.

* docs: normalise `newlines_in_values` documentation

* test: add/fix sqllogictests for `newlines_in_values`

* docs: document `datafusion.catalog.newlines_in_values`

* fix: typo in config.md

* chore: suppress lint on too many arguments for `CsvExec::new`

* fix: always checkout `*.slt` with LF line endings

This is a bit of a stab in the dark, but it might fix multiline tests on
Windows.

* fix: always checkout `newlines_in_values.csv` with `LF` line endings

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.

---------

Co-authored-by: Andrew Lamb <[email protected]>
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* feat!: support `newlines_in_values` CSV option

This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.

This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.

BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.

* docs: normalise `newlines_in_values` documentation

* test: add/fix sqllogictests for `newlines_in_values`

* docs: document `datafusion.catalog.newlines_in_values`

* fix: typo in config.md

* chore: suppress lint on too many arguments for `CsvExec::new`

* fix: always checkout `*.slt` with LF line endings

This is a bit of a stab in the dark, but it might fix multiline tests on
Windows.

* fix: always checkout `newlines_in_values.csv` with `LF` line endings

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for newlines_in_values to CsvOptions
3 participants