-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24244][SQL] Passing only required columns to the CSV parser #21296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
9cffa0f
Adding tests for select only requested columns
MaxGekk fdbcbe3
Select indexes of required columns only
MaxGekk 578f47b
Fix the case when number of parsed fields are not matched to required…
MaxGekk 0f942c3
Using selectIndexes if required number of columns are less than its t…
MaxGekk c4b1160
Fix the test: force to read all columns
MaxGekk 8cf6eab
Fix merging conflicts
MaxGekk 5b2f0b9
Benchmarks for many columns
MaxGekk 6d1e902
Make size of requiredSchema equals to amount of selected columns
MaxGekk 4525795
Removing selection of all columns
MaxGekk 8809cec
Updating benchmarks for select indexes
MaxGekk dc97ceb
Addressing Herman's review comments
MaxGekk 51b3148
Updated benchmark result for recent changes
MaxGekk e3958b1
Add ticket number to test title
MaxGekk a4a0a54
Removing unnecessary benchmark
MaxGekk fa86015
Updating the migration guide
MaxGekk 15528d2
Moving some values back as it was.
MaxGekk f90daa7
Renaming the test title
MaxGekk 4d9873d
Improving of the migration guide
MaxGekk 7dcfc7a
Merge remote-tracking branch 'origin/master' into csv-column-pruning
MaxGekk f89eeb7
Fix example
MaxGekk 6ff6d4f
Adding spark.sql.csv.parser.columnPruning.enabled
MaxGekk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think we are already doing the column pruning by avoiding casting cost which is relatively expensive comparing to the parsing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, avoiding unnecessary casting speeds up more than 2 times. We can see that on this benchmark before my changes. Without the changes, selecting only one string column takes 44.5 seconds but select of all columns ~80 seconds.
As the benchmark shows we can achieve performance improvements in parsing too. Selecting only 1 out of 1000 columns takes 22.5 seconds but without the PR it takes 44.5:
8809cec