Avoid checking isSplittable for files smaller than the split max size#14877
Conversation
|
Will be more straightforward to add a test after #14866 is reviewed and merged. |
|
Marked as draft while discussions about max split size vs max initial split size are ongoing here |
For some input formats, the isSplittable check is non-trivial and can add a significant amount of time to split generation when handling a large number of very small files. This change allows files smaller than the max initial split size to avoid that check and considers them unsplittable instead.
5edac27 to
8c7b2e6
Compare
|
No longer marked as draft, settled on using initial split size instead of max split size in the other PR. |
mbasmanova
left a comment
There was a problem hiding this comment.
Thank you. CC: @jainxrohit
|
@pettyjamesm @mbasmanova
Is isSplittable call gets skipped for OrcInputFormat, so this check would be useful for other file formats. Can we clarify it in the release notes? |
The check is skipped regardless of input format when the file is smaller than the initial split size. |
Correct. I was trying to say the check was always skipped for HiveInputFormat. |
|
Got it, I've reworded the release notes section- let me know if you'd like another phrasing. |
For some input formats, the isSplittable check is non-trivial and can add a significant amount of time to split generation. This change allows files smaller than the max split size to avoid that check and simply call them unsplittable since they're within the split target range already.