[SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer #49233

jabbaugh · 2024-12-18T18:42:45Z

What changes were proposed in this pull request?

The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?

This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request #17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?

Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?

One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?

No

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

dongjoon-hyun

Shall we use extension instead of fileExtension?

What changes were proposed in this pull request? The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file). Why are the changes needed? This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...). Notes on Previous Pull Request apache#17973 A pull request adding this option was discussed 7 years ago. One reason it wasn't added was: "I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user." I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension. Does this PR introduce any user-facing change? Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option. How was this patch tested? One unit test was added to validate a file is written with the new extension. Was this patch authored or co-authored using generative AI tooling? No

dongjoon-hyun

+1, LGTM. Thank you, @jabbaugh .
Merged to master for Apache Spark 4.0.0.

dongjoon-hyun · 2025-01-10T16:46:27Z

Congratulations for your first commit.

I added you, James Baug, to the Apache Spark contributor group and assigned SPARK-50616 to you.

Welcome to the Apache Spark community, @jabbaugh !

github-actions bot added SQL DOCS labels Dec 18, 2024

jabbaugh force-pushed the jbaugh-add-csv-file-ext branch 2 times, most recently from cede15e to acd7d3e Compare December 18, 2024 21:18

github-actions bot added STRUCTURED STREAMING DEPLOY BUILD CORE WINDOWS INFRA PYTHON CONNECT and removed STRUCTURED STREAMING DEPLOY BUILD CORE WINDOWS INFRA PYTHON CONNECT labels Dec 18, 2024

jabbaugh force-pushed the jbaugh-add-csv-file-ext branch from eee47c5 to de3b891 Compare December 18, 2024 21:50

jabbaugh changed the title ~~Add File Extension Option to CSV DataSource Writer~~ [SPARK-50616] Add File Extension Option to CSV DataSource Writer Dec 18, 2024

HyukjinKwon changed the title ~~[SPARK-50616] Add File Extension Option to CSV DataSource Writer~~ [SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer Dec 19, 2024

MaxGekk reviewed Dec 19, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala Outdated Show resolved Hide resolved

jabbaugh force-pushed the jbaugh-add-csv-file-ext branch from de3b891 to 2ff5f81 Compare December 20, 2024 17:31

github-actions bot added the CORE label Dec 20, 2024

jabbaugh force-pushed the jbaugh-add-csv-file-ext branch 2 times, most recently from 3c63274 to ec7d8cc Compare December 20, 2024 22:11

github-actions bot removed the CORE label Dec 20, 2024

dongjoon-hyun reviewed Jan 7, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jan 7, 2025

View reviewed changes

jabbaugh force-pushed the jbaugh-add-csv-file-ext branch 5 times, most recently from 42721f7 to f9b7b0a Compare January 9, 2025 23:10

jabbaugh force-pushed the jbaugh-add-csv-file-ext branch from f9b7b0a to c9be504 Compare January 9, 2025 23:11

dongjoon-hyun approved these changes Jan 10, 2025

View reviewed changes

dongjoon-hyun closed this in f9cb80a Jan 10, 2025

jabbaugh deleted the jbaugh-add-csv-file-ext branch January 10, 2025 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer #49233

[SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer #49233

Uh oh!

jabbaugh commented Dec 18, 2024

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer #49233

[SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer #49233

Uh oh!

Conversation

jabbaugh commented Dec 18, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants