Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support newlines_in_values CSV option #11533

Merged
merged 10 commits into from
Jul 21, 2024
Merged

Commits on Jul 18, 2024

  1. feat!: support newlines_in_values CSV option

    This significantly simplifies the UX when dealing with large CSV files
    that must support newlines in (quoted) values. By default, large CSV
    files will be repartitioned into multiple parallel range scans. This is
    great for performance in the common case but when large CSVs contain
    newlines in values the parallel scan will fail due to splitting on
    newlines within quotes rather than actual line terminators.
    
    With the current implementation, this behaviour can be controlled by the
    session-level `datafusion.optimizer.repartition_file_scans` and
    `datafusion.optimizer.repartition_file_min_size` settings.
    
    This commit introduces a `newlines_in_values` option to `CsvOptions` and
    plumbs it through to `CsvExec`, which includes it in the test for whether
    parallel execution is supported. This provides a convenient and
    searchable way to disable file scan repartitioning on a per-CSV basis.
    
    BREAKING CHANGE: This adds new public fields to types with all public
    fields, which is a breaking change.
    connec committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    5321e25 View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2024

  1. Configuration menu
    Copy the full SHA
    e05ca0e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9ca9065 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    34dcdb0 View commit details
    Browse the repository at this point in the history
  4. fix: typo in config.md

    connec committed Jul 19, 2024
    Configuration menu
    Copy the full SHA
    8c2d98d View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    ed0075d View commit details
    Browse the repository at this point in the history

Commits on Jul 20, 2024

  1. Configuration menu
    Copy the full SHA
    356f46b View commit details
    Browse the repository at this point in the history
  2. fix: always checkout *.slt with LF line endings

    This is a bit of a stab in the dark, but it might fix multiline tests on
    Windows.
    connec committed Jul 20, 2024
    Configuration menu
    Copy the full SHA
    b9cc96b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4d06432 View commit details
    Browse the repository at this point in the history
  4. fix: always checkout newlines_in_values.csv with LF line endings

    The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.
    connec committed Jul 20, 2024
    Configuration menu
    Copy the full SHA
    35198b6 View commit details
    Browse the repository at this point in the history