Skip to content

Conversation

@RussellSpitzer
Copy link
Member

No description provided.

Adds an implementation for Spark3 for performing Rewrites using the new
action api. Only implements for Spark3 at the moment with BinPack Strategy.
Breaks execute method into multiple helper functions
Adds option validation
Breaks execute into two different code paths, complete and partial progress
A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).

Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
that would be chosen by {@link BinPackStrategy} will be rewrite candidates.

In the future other algorithms for determining files to rewrite will be provided.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant