Add support for `newlines_in_values` to `CsvOptions` #11472

connec · 2024-07-15T11:06:50Z

Is your feature request related to a problem or challenge?

I'm trying to read CSVs that include newlines in (quoted) values.

Describe the solution you'd like

Some googling revealed that this isn't supported currently by the arrow-csv crate, whereas that functionality does exist in the C++ (ParseOptions::newlines_in_values) and Python (ParseOptions.newlines_in_values) implementations.

Ideally, a newlines_in_values field could be added to datafusion::common::config::CsvOptions to support this functionality.

Note that the Python docs call out the performance implications of this:

Setting this to True reduces the performance of multi-threaded CSV reading.

I haven't dug into the implementation, but I imagine it becomes harder to find the right split point for multi-threaded reading (though, it seems not dissimilar to finding the prev/next linebreak, so perhaps not insurmountable...).

Describe alternatives you've considered

The only alternative I can see would be to preprocess the CSV before feeding it into DF. I haven't explored this option as I imagine it would take a lot of DF plumbing, and it seems valuable to have parity with other arrow CSV packages (C++ and Python, at least).

Additional context

I was originally planning to report this against the arrow-rs repository, but since my use-case is with datafusion I decided to report it here. Let me know if this issue would be more appropriate there and I can move/copy it.

The text was updated successfully, but these errors were encountered:

2010YOUY01 · 2024-07-16T13:00:46Z

I haven't dug into the implementation, but I imagine it becomes harder to find the right split point for multi-threaded reading

That is correct, DataFusion parallel CSV scan does:

For example it gets a 1000B CSV file and wants to read in 2 threads.
File is split into [0, 500), [500, 1000) two ranges, then probe the actual line boundary and adjust the range
Let arrow-rs handle reading a valid file byte range.

So I think the first step will be to add arrow support

connec · 2024-07-16T13:46:57Z

I see, so an approach could be to:

~~Add newlines_in_values: bool to arrow_csv::reader::Format. The implementation could use this to consume terminators within quotes.~~ arrow-csv operates in this way by default.
Add newlines_in_values: bool to datafusion::config::CsvOptions (and datafusion::datasource::file_format::csv::CsvFormat). The implementation could use this in two ways:
1. ~~Pass the newlines_in_values flag on to csv-arrow.~~
2. Disable parallelism in CSV scanning (I don't think it would be possible to reliably determine if you're inside a quoted string without seeing the whole file).~~

connec · 2024-07-18T08:51:25Z

Turns out arrow-csv does correctly parse newlines in quoted values, so the issue comes from reading CSVs in parallel. Limiting the target number of partitions to 1 solves the issue.

For example, if I add this to my code before I run ctx.sql("SELECT * FROM <csv including linebreaks>") then the query executed successfully (without it there are errors about wrong numbers of columns):

ctx.state_weak_ref()
    .upgrade()
    .unwrap()
    .write_arc()
    .config_mut()
    .options_mut()
    .execution
    .target_partitions = 1;

From a UX perspective, this setting is quite disconnected from the intent. It also required changing the session-level target_partitions setting, which (I assume, untested) would adversely affect reads of CSVs that might not be expected to contain linebreaks. A per-CSV newlines_in_values setting would help to signpost this situation and allow more efficient plans when only some executed CSVs have this requirement.

2010YOUY01 · 2024-07-18T10:30:07Z

Turns out arrow-csv does correctly parse newlines in quoted values, so the issue comes from reading CSVs in parallel. Limiting the target number of partitions to 1 solves the issue.

Thank you for the investigation @connec , if that's the case, setting datafusion.optimizer.repartition_file_scans to false might help: https://datafusion.apache.org/user-guide/configs.html
IIRC this option will only limit file scan to one thread, other parts of the plan will still be parallelized using target_partitions

connec · 2024-07-18T10:56:45Z

Thanks for the pointer! That would certainly be better than disabling parallelization throughout.

Do you think it would still be desirable to be able to turn off file-level partioning for CSVs only (and/or to vary it on a per-type basis)?

I'm not familiar enough with Parquet to understand if there would be a functional reason to disable it for that format?

connec · 2024-07-18T10:59:43Z

Furthermore, I think there could be value in controlling this on a per-file basis. E.g. if you're working with multiple CSVs, one of which you know contains newlines in values and others that you know do not (and so can benefit from parallelisation).

2010YOUY01 · 2024-07-18T12:07:32Z

Do you think it would still be desirable to be able to turn off file-level partioning for CSVs only (and/or to vary it on a per-type basis)?

I'm not familiar enough with Parquet to understand if there would be a functional reason to disable it for that format?

Yes, using the global option is only a quick fix.

I think it's a good idea to add an extra option field for individual files, and later the planner can make better decisions based on per file's newlines_in_values property (like for now disable parallel execution)

And it can be CSV-specific, parquet seems not relevant here.

connec · 2024-07-18T14:23:45Z

I've opened a PR with the most straightforward implementation of this I could think of. I'd be glad to receive any feedback on that approach.

connec added the enhancement New feature or request label Jul 15, 2024

connec mentioned this issue Jul 18, 2024

Support newlines_in_values CSV option #11533

Merged

alamb closed this as completed in #11533 Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `newlines_in_values` to `CsvOptions` #11472

Add support for `newlines_in_values` to `CsvOptions` #11472

connec commented Jul 15, 2024

2010YOUY01 commented Jul 16, 2024

connec commented Jul 16, 2024 •

edited

Loading

connec commented Jul 18, 2024

2010YOUY01 commented Jul 18, 2024 •

edited

Loading

connec commented Jul 18, 2024

connec commented Jul 18, 2024

2010YOUY01 commented Jul 18, 2024

connec commented Jul 18, 2024

Add support for newlines_in_values to CsvOptions #11472

Add support for newlines_in_values to CsvOptions #11472

Comments

connec commented Jul 15, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

2010YOUY01 commented Jul 16, 2024

connec commented Jul 16, 2024 • edited Loading

connec commented Jul 18, 2024

2010YOUY01 commented Jul 18, 2024 • edited Loading

connec commented Jul 18, 2024

connec commented Jul 18, 2024

2010YOUY01 commented Jul 18, 2024

connec commented Jul 18, 2024

Add support for `newlines_in_values` to `CsvOptions` #11472

Add support for `newlines_in_values` to `CsvOptions` #11472

connec commented Jul 16, 2024 •

edited

Loading

2010YOUY01 commented Jul 18, 2024 •

edited

Loading