Skip to content

Avoid checking isSplittable for files smaller than the split max size#14877

Merged
mbasmanova merged 1 commit intoprestodb:masterfrom
pettyjamesm:minimum-splittable-size
Jul 30, 2020
Merged

Avoid checking isSplittable for files smaller than the split max size#14877
mbasmanova merged 1 commit intoprestodb:masterfrom
pettyjamesm:minimum-splittable-size

Conversation

@pettyjamesm
Copy link
Contributor

@pettyjamesm pettyjamesm commented Jul 22, 2020

For some input formats, the isSplittable check is non-trivial and can add a significant amount of time to split generation. This change allows files smaller than the max split size to avoid that check and simply call them unsplittable since they're within the split target range already.

== RELEASE NOTES ==
Hive Changes
* Improves split generation by avoiding an unncessary splittable check when files are smaller than the initial split max size, regardless of their input format.

@pettyjamesm
Copy link
Contributor Author

Will be more straightforward to add a test after #14866 is reviewed and merged.

@pettyjamesm pettyjamesm marked this pull request as draft July 22, 2020 20:04
@pettyjamesm
Copy link
Contributor Author

Marked as draft while discussions about max split size vs max initial split size are ongoing here

For some input formats, the isSplittable check is non-trivial and can
add a significant amount of time to split generation when handling a
large number of very small files. This change allows files smaller
than the max initial split size to avoid that check and considers them
unsplittable instead.
@pettyjamesm pettyjamesm force-pushed the minimum-splittable-size branch from 5edac27 to 8c7b2e6 Compare July 27, 2020 19:32
@pettyjamesm pettyjamesm marked this pull request as ready for review July 27, 2020 19:33
@pettyjamesm
Copy link
Contributor Author

No longer marked as draft, settled on using initial split size instead of max split size in the other PR.

@pettyjamesm pettyjamesm requested a review from highker July 30, 2020 13:46
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. CC: @jainxrohit

@jainxrohit jainxrohit self-requested a review July 30, 2020 14:43
@mbasmanova mbasmanova merged commit 4110ec6 into prestodb:master Jul 30, 2020
@pettyjamesm pettyjamesm deleted the minimum-splittable-size branch July 30, 2020 15:27
@jainxrohit
Copy link
Contributor

@pettyjamesm @mbasmanova
On the release notes:

  • Improves split generation by avoiding an unncessary splittable check when files are smaller than the max split size

Is isSplittable call gets skipped for OrcInputFormat, so this check would be useful for other file formats. Can we clarify it in the release notes?

@pettyjamesm
Copy link
Contributor Author

@pettyjamesm @mbasmanova
On the release notes:

* Improves split generation by avoiding an unncessary splittable check when files are smaller than the max split size

Is isSplittable call gets skipped for OrcInputFormat, so this check would be useful for other file formats. Can we clarify it in the release notes?

The check is skipped regardless of input format when the file is smaller than the initial split size.

@jainxrohit
Copy link
Contributor

@pettyjamesm @mbasmanova
On the release notes:

* Improves split generation by avoiding an unncessary splittable check when files are smaller than the max split size

Is isSplittable call gets skipped for OrcInputFormat, so this check would be useful for other file formats. Can we clarify it in the release notes?

The check is skipped regardless of input format when the file is smaller than the initial split size.

Correct. I was trying to say the check was always skipped for HiveInputFormat.

@pettyjamesm
Copy link
Contributor Author

Got it, I've reworded the release notes section- let me know if you'd like another phrasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants