Spark3 sort compaction #2829

RussellSpitzer · 2021-07-15T19:50:10Z

WIP Based on Other PR's we need in first
#2820
#2828

RussellSpitzer · 2021-07-26T22:28:40Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+          .format("iceberg")
+          .option(SparkWriteOptions.REWRITTEN_FILE_SCAN_TASK_SET_ID, groupID)
+          .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, writeMaxFileSize())
+          .mode("append")


@aokolnychyi :)

Anton worried about this naming when we designed this special write task. It doesn't actually append to the table in this action, instead it stages these files to be committed in the rewrite operation performed by the base action. So the "append" here really just means "write these files but don't add them to the table". This is triggered by the REWRITTEN_FILE_SCAN_TASK_SET_ID parameter which causes us not to use the normal write path. Sorry it's pretty weird, but we are basically trying to hijack the existing spark write mechanism and while adding a side-effect to track which files were written.

RussellSpitzer · 2021-09-09T16:43:45Z

build.gradle

  apply plugin: 'scala'

  sourceSets {
+    main {


@rdblue never-mind this IS required if you want to do cross compilation (Have Java code depend on Scala code in the same module)

Otherwise it compiles Java without the Scala source before the Scala code.

Do we need the Java plugin if we are using Scala here? I'm wondering why we were using both in the past. Maybe we only had Scala depending on Java code.

Yeah the other direction is fine, we don't have the Java plugin set for this project but the Scala plugin does extends the Java one.

RussellSpitzer · 2021-09-09T16:49:04Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

+            if (task.file().specId() == table.spec().specId()) {
+              return task.file().partition();
+            } else {
+              return EmptyStruct.instance();


Tasks which are comprised of data which is not partitioned according to the curtain spec must treated as if they were not partitioned. We can probably ease this restriction for partitioning that satisfies the current partitioning (Ie Table is set to partition on day but this is an hour partition) but this is the simplest approach for now.

RussellSpitzer · 2021-09-09T16:51:14Z

@aokolnychyi + @rdblue Now that we've moved distribution and sort utils, we can get this implementation in as well :) Please review when you have a chance

rdblue · 2021-09-09T17:43:11Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

+
+    @Override
+    public <T> void set(int pos, T value) {
+


Should this throw an exception because you're trying to write to a slot that doesn't exist?

rdblue · 2021-09-09T17:43:23Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

+
+    @Override
+    public <T> T get(int pos, Class<T> javaClass) {
+      return null;


Should this throw an exception?

rdblue · 2021-09-09T17:44:48Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

+
+    private static EmptyStruct instance;
+
+    static EmptyStruct instance() {


For most singletons, we use public static EmptyStruct get() and do the initialization at class init time to avoid a race condition:

private static final EmptyStruct INSTANCE = new EmptyStruct(); static EmptyStruct get() { return INSTANCE; }

rdblue · 2021-09-09T17:48:10Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+  }
+
+  protected LogicalPlan sortPlan(Distribution distribution, SortOrder[] ordering, long numOutputFiles,
+      LogicalPlan plan, SQLConf conf) {


We'd normally wrap the line to start at the same position as Distribution.

rdblue · 2021-09-09T22:12:27Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

+            if (task.file().specId() == table.spec().specId()) {
+              return task.file().partition();
+            } else {
+              return EmptyStruct.get();


What was the failure? Is our hash or equals function not correct for empty structs? Should this use a StructLikeMap instead?

The issue is for files with the old partitioning, if we have another way of making an empty struct here that's fine. I can't remember why I chose to make a new class here since it was a while ago now.

The core issue is
Say we have originally have a table which is partitioned Bucket(x, 5) meaning any original data files are written with values of x more or less randomly distributed in our data files. Then our table has a partitioning changed to something like Bucket(x, 10). Worst case scenario is that when we write we end up having to make 10 files for every partition in our original bucketing. This says let's just assume all those files created with old partitioning are best dealt with at the same time, rather than splitting them up using the old partitioning.

Okay, so all data where we're changing the partitioning goes into a separate task. That makes sense to me. Probably worth a comment here to explain it.

You may also want to change how this works slightly and use a StructLikeMap. Otherwise, EmptyStruct looks a lot like structs for the unpartitioned spec. It would or wouldn't be equal depending on the order of the comparison, which probably isn't what you want.

Instead, I would keep a StructLikeMap for the current partitioning scheme and then create a separate group using a List for the rest.

I'm not quite sure I understand what you are describing here as a fix.

Actually I think I see now. Should have this in a bit

StructLike makes no guarantees about equals/hashCode behavior, so using it as a map key is like using a CharSequence as a map key. Probably not a good idea because it will break if the underlying implementation changes or differs. I'd recommend using StructLikeMap that handles consistent behavior. But that requires using a specific struct type. Since you really only need the table spec's struct type, you can use that. Then keep any tasks that are not in the current table spec in a list, like this:

StructLikeMap<FileScanTask> filesByPartition = StructLikeMap.create(table.spec().partitionType()); List<FileScanTask> tasksFromOtherSpecs = Lists.newArrayList(); Streams.stream(fileScanTasks).forEach(task -> { if (task.file().specId() != table.spec().specId()) { tasksFromOtherSpecs.add(task); } else { filesByPartition.put(task.file().partition(), task); } });

The only issue I can think of here is that it doesn't fit well with the rest of the code which assumes all FileGroups have an associated partition value, I think empty is the right thing to pass since we are basically treating those files as unpartitioned.

Trying something like

StructLikeMap<List<FileScanTask>> filesByPartition = StructLikeMap.create(table.spec().partitionType()); StructLike emptyStruct = GenericRecord.create(table.schema()); fileScanTasks.forEach(task -> { /* If a task uses an incompatible partition spec the data inside could contain values which belong to multiple partitions in the current spec. Treating all such files as un-partitioned and grouping them together helps to minimize new files made. */ StructLike taskPartition = task.file().specId() == table.spec().specId() ? task.file().partition() : emptyStruct; List<FileScanTask> files = filesByPartition.get(taskPartition); if (files == null) { files = Lists.newArrayList(); } files.add(task); filesByPartition.put(taskPartition, files); }); ```

Yeah, you may need to pass an empty struct through the rest of the code. But it is probably a good idea not to use StructLike as a map key.

rdblue · 2021-09-09T22:14:07Z

spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java

+            .sort()
+            .option(SortStrategy.REWRITE_ALL, "true")
+            .option(RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Integer.toString(fileSize * 2 + 100))
+            .option(BinPackStrategy.MIN_INPUT_FILES, "1")


Why use BinPackStrategy.MIN_INPUT_FILES with a sort strategy? Should this be moved or renamed?

Looks like the next test uses SortStrategy.MIN_INPUT_FILES.

I guess this isn't needed now that I added REWRITE_ALL, This can be removed now

I forgot I set REWRITE ALL to only apply to filtering and not to the group limits, I should make it apply to both

spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java

rdblue · 2021-09-09T22:21:17Z

spark3/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSpark3Action.java

  }

+  @Override
+  protected SortStrategy sortStrategy() {


This is a little strange to me. It feels more like a createSortStrategy(), but I think it's fine since it isn't exposed.

Yeah this is just copying what we did for binPack ^ but we can change them both, if needed

rdblue · 2021-09-09T22:21:50Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+
+public class Spark3SortStrategy extends SortStrategy {
+
+  public static final String SIZE_ESTIMATE_MULTIPLE = "size-estimate-multiple";


It isn't clear to me from the name what this does.

Yeah let me work on that naming/documenting. This is from a discussion with @aokolnychyi where we were trying to figure out how to deal with the shuffle size estimation and trying to give users a way to bail out of the automatic sizing. Let me go back to our notes and fix the name here. This is meant to be an advanced user escape valve.

For our in-house implementation of this compaction strategy the advanced user bailout approach used "rows per file" which mostly worked but wasn't stable in the face of extensive schema changes (file sized grew with column growth) but was largely stable.

I added in a description, we can discuss alternatives with that context. The current difficulty is that at this point in time we know task file sizes and our nob for adjusting output size is the number of shuffle partitions we create.

rdblue · 2021-09-09T22:22:52Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+
+  public static final String SIZE_ESTIMATE_MULTIPLE = "size-estimate-multiple";
+
+  public static final String SHUFFLE_TASKS_PER_FILE = "shuffle-tasks-per-file";


Per input file? Basically the default is to balance the data across the same number of files? Seems like we would prefer to try to output totalInputSize / tableTargetFileSize.

Removed this since we don't have the output task combiner code in this PR.

Should this constant also be removed then?

rdblue · 2021-09-09T22:23:47Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+  public RewriteStrategy options(Map<String, String> options) {
+    sizeEstimateMultiple = PropertyUtil.propertyAsDouble(options,
+        SIZE_ESTIMATE_MULTIPLE,
+        1.0);


What about "compression factor" or something similar? That seems like what you mean by this.

rdblue · 2021-09-09T22:24:34Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+        1.0);
+
+    Preconditions.checkArgument(sizeEstimateMultiple > 0,
+        "Cannot use Spark3Sort Strategy without %s being positive, found %s",


I think this should be simpler. How about "Invalid size estimation factor: %s (not positive)".

rdblue · 2021-09-09T22:26:41Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+    boolean requiresRepartition = !filesToRewrite.get(0).spec().equals(table.spec());
+    SortOrder[] ordering;
+    Distribution distribution;
+    ordering = Spark3Util.convert(sortOrder());


Why not set the default in the variable declaration?

rdblue · 2021-09-09T22:27:27Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+          Arrays.stream(Spark3Util.buildRequiredOrdering(distribution, table())),
+          Arrays.stream(ordering)).toArray(SortOrder[]::new);
+    } else {
+      distribution = Distributions.ordered(ordering);


I don't quite understand the logic here. Won't an ordered distribution result in a repartition as well?

It's the difference between running a repartition with the Partition Columns included in the ordering or just using the requested sort Columns. We can probably combine the logic here

Why would we not use the partition columns in the sort? It seems to me that we want to always go through the utility methods to build the distribution and sort.

I think that sounds good, I think the only reason I had this here is that I have another PR for allowing compacting between different specs for BinPack as well and it used basically this same branching logic. For sort I think you are correct and we are fine just always including the partition columns ... I don't think this would effect the range partitioner even if all the values are the same.

Did you want to make this change, or leave it as is?

rdblue · 2021-09-09T22:29:07Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+
+    // Reset Shuffle Partitions for our sort
+    long numOutputFiles = numOutputFiles((long) (inputFileSize(filesToRewrite) * sizeEstimateMultiple));
+    long numShufflePartitions = numOutputFiles * shuffleTasksPerFile;


Why increase the number of tasks? Won't that cause more output files than numOutputFiles to be created?

Sorry this is part of our internal implementation with an additional feature that @aokolnychyi cooked up. We add in a special combiner that allows multiple shuffle tasks to end up writing to the same file, this helped us out when users have an especially large target size but can't handle a shuffle with outputs that large.

Ah, something like keeping files open across tasks in your internal version? Do you plan to remove this from open source, or is this something we need to keep so you don't have too many deltas from master?

I think we want to bring in the combiner to OSS as well but I believe it requires some additional catalyst extending. I'll remove it for now and @aokolnychyi and I can discuss whether to bring that feature into OSS. I think the only issue is that we will have to add more Spark Version dependent scala code.

Let's omit it for now then. No sense in adding it if it won't work, right?

rdblue · 2021-09-09T22:31:17Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+
+  protected LogicalPlan sortPlan(Distribution distribution, SortOrder[] ordering, LogicalPlan plan, SQLConf conf) {
+    Seq<SortOrder> scalaOrder = JavaConverters.asScalaBuffer(Lists.newArrayList(ordering));
+    return DistributionAndOrderingUtils$.MODULE$.prepareQuery(distribution, scalaOrder, plan, conf);


Is it possible to use a List or Array for order rather than creating a Seq and passing that? Using Seq in our APIs can cause compatibility problems across Scala versions. It would be better to pass everything as an Array.

We are using this as Seq in our Scala code, I can change the API to always use array but I'm not sure that helps us if we have other library code using Seq?

We can pass Array into a method that requires a Seq in Scala, right?

I meant that our SparkExtensions will still have the Scala version incompatibility. I switched this to array here though just in case.

RussellSpitzer · 2021-09-13T22:32:29Z

@rdblue + @adamkennedy Changes applied, if you have time please take another look

rdblue · 2021-10-18T16:54:17Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java


+  @Override
+  public RewriteDataFiles sort(SortOrder sortOrder) {
+    this.strategy = sortStrategy().sortOrder(sortOrder);


Should we add a check that this is only replacing the default binPack strategy? Right now, we would allow calling binPack().sort(SortOrder...build()).execuite() which would be strange.

rdblue · 2021-10-18T16:54:47Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

+        If a task uses an incompatible partition spec the data inside could contain values which
+        belong to multiple partitions in the current spec. Treating all such files as un-partitioned and
+        grouping them together helps to minimize new files made.
+        */


Style: we typically use // even for multi-line comments.

rdblue · 2021-10-18T16:56:28Z

spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

    }
  }
+
+  static class EmptyStruct implements StructLike {


Looks like this is no longer needed?

rdblue · 2021-10-18T17:05:40Z

spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+    if (requiresRepartition) {
+      distribution = Spark3Util.buildRequiredDistribution(table);
+      ordering = Stream.concat(
+          Arrays.stream(Spark3Util.buildRequiredOrdering(distribution, table())),


This is going to use the table order if it is RANGE partitioned and has a table ordering. That will override the given ordering from the action config.

I think you should do something similar to Spark3Util.buildRequiredDistribution but pass around the action's sortOrder() rather than using the table's sort order. Also, the mode would always be range, so you can simplify it to just this:

ordering = Spark3Util.convert(SortOrderUtil.buildSortOrder(table.schema(), table.spec(), sortOrder())) distribution = Distribution.ordered(ordering);

Good call, refactored this

Implements Spark3 Sort Based compaction. Uses similar logic to the Spark3BinPack Strategy but instead of doing a direct read then right, issues a read, sort, and then write.

* Fix Sort Strategy when custom Sort Order is provided but table requires repartitioning

rdblue · 2021-10-18T20:01:35Z

spark/v3.0/build.gradle

+      java.srcDirs = []
+    }
+  }
+


Why is this needed now? I thought that everything built just fine.

It only builds fine if you aren't accessing Scala code from Java code. It works fine in the other direction, see
#2829 (comment)

Okay, so this was working fine because we hadn't accessed Scala from Java yet. Thanks!

rdblue · 2021-10-18T20:02:09Z

spark/v3.0/spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3SortStrategy.java

+   */
+  public static final String COMPRESSION_FACTOR = "compression-factor";
+
+  public static final String SHUFFLE_TASKS_PER_FILE = "shuffle-tasks-per-file";


Looks like this isn't used. Can we remove it?

yep will remove it

rdblue

Looks good overall, but I'm curious about the build change and the unused config property.

RussellSpitzer · 2021-10-18T22:03:28Z

Thanks @rdblue for the review!

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

github-actions bot added API build core spark labels Jul 15, 2021

RussellSpitzer closed this Jul 15, 2021

RussellSpitzer deleted the SparkSortCompaction branch July 15, 2021 20:01

RussellSpitzer restored the SparkSortCompaction branch July 20, 2021 15:04

RussellSpitzer reopened this Jul 20, 2021

RussellSpitzer mentioned this pull request Jul 21, 2021

Adds SortRewriteStrategy #2609

Merged

kamalbhavyam added a commit to kamalbhavyam/iceberg that referenced this pull request Jul 22, 2021

apache#2829 + Hilbert Integer Implementation + Parquet Metrics(WIP)

e1ff72b

RussellSpitzer commented Jul 26, 2021

View reviewed changes

RussellSpitzer force-pushed the SparkSortCompaction branch from 2297ba8 to 3a62d78 Compare September 9, 2021 16:35

RussellSpitzer commented Sep 9, 2021

View reviewed changes

RussellSpitzer requested a review from aokolnychyi September 9, 2021 16:50

rdblue reviewed Sep 9, 2021

View reviewed changes

spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java Show resolved Hide resolved

rdblue reviewed Sep 9, 2021

View reviewed changes

spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java Show resolved Hide resolved

rdblue reviewed Sep 9, 2021

View reviewed changes

RussellSpitzer force-pushed the SparkSortCompaction branch from 71e4c43 to f6691b7 Compare September 14, 2021 22:14

github-actions bot added the INFRA label Sep 15, 2021

RussellSpitzer force-pushed the SparkSortCompaction branch from 20317b1 to 3003f38 Compare September 15, 2021 21:41

rdblue reviewed Oct 18, 2021

View reviewed changes

Spark: Adds Spark3 Sort Based Compaction

862c4c1

Implements Spark3 Sort Based compaction. Uses similar logic to the Spark3BinPack Strategy but instead of doing a direct read then right, issues a read, sort, and then write.

RussellSpitzer force-pushed the SparkSortCompaction branch from ab9756c to 862c4c1 Compare October 18, 2021 18:16

Reviewer Comments

dacffa9

* Fix Sort Strategy when custom Sort Order is provided but table requires repartitioning

rdblue reviewed Oct 18, 2021

View reviewed changes

rdblue approved these changes Oct 18, 2021

View reviewed changes

Unused Var

5f8132e

RussellSpitzer merged commit 61fe9b1 into apache:master Oct 18, 2021

RussellSpitzer deleted the SparkSortCompaction branch October 18, 2021 22:03


		private static EmptyStruct instance;

		static EmptyStruct instance() {


		public class Spark3SortStrategy extends SortStrategy {

		public static final String SIZE_ESTIMATE_MULTIPLE = "size-estimate-multiple";


		public static final String SIZE_ESTIMATE_MULTIPLE = "size-estimate-multiple";

		public static final String SHUFFLE_TASKS_PER_FILE = "shuffle-tasks-per-file";

Spark3 sort compaction #2829

Spark3 sort compaction #2829

Uh oh!

Conversation

RussellSpitzer commented Jul 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Sep 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Sep 9, 2021 •

edited

Loading