Add a Parallelized Spark Job Planning Path #1421

RussellSpitzer · 2020-09-03T21:09:56Z

This is the second of two WIPs for parallelizing Spark Read Job Planning

The other is located at #1420

To parallelize the creation of TableScanTasks, we use the
metadata tables to get a listing of DataFiles and do filtering in
spark before starting the scan job. Once the correct datafiles are
identified, scan tasks are created and returned.

spark/src/main/java/org/apache/iceberg/spark/source/SparkPlannerUtil.java

core/src/main/java/org/apache/iceberg/DataFilesTable.java

spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java

api/src/main/java/org/apache/iceberg/DataFile.java

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

core/src/main/java/org/apache/iceberg/BaseTableScan.java

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java

core/src/main/java/org/apache/iceberg/TableScanContext.java

core/src/main/java/org/apache/iceberg/V1Metadata.java

RussellSpitzer · 2020-10-05T15:30:24Z

core/src/main/java/org/apache/iceberg/BaseFile.java

          found = true;
          fromProjectionPos[i] = j;
        }
+        if (fields.get(i).fieldId() == ManifestFile.SPEC_ID.fieldId()) {


These modifications allow BaseFile to translate into a SparkRow with the specID as a column

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

RussellSpitzer · 2020-10-05T15:34:10Z

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

+  private final SparkSession spark;
+  private final Snapshot snapshot;
+
+  private PlanScanAction(SparkSession spark, Table table, TableScan scan) {


for type safety public entry points will only be for valid scan types

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

RussellSpitzer · 2020-10-05T15:35:35Z

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

+  public CloseableIterable<CombinedScanTask> execute() {
+    Map<String, String> options = ((BaseTableScan) scan).options();
+    TableMetadata meta = ((BaseTableScan) scan).tableOps().current(); // TODO maybe pass through metadata instead
+    long splitSize;


This block is mostly copied directly from outside configuration for setting up the task combiner

Would it make sense to add a comment there indicating if changes are made there they should consider if they need to update this code too?

I'm a little afraid of comment rot there, I'm hoping that the fact that we run the tests for this code block with both distributing planning on and off, so if a proper test is added for any changes, it should fail tests when distributed planning is on. I think that is probably the best defense we can do since the comment would have to be on a parent of the Scan class... we also wouldn't be able to warn folks overriding the method in other Scan Impls.

I think the tests are our best chance here.

That makes sense 👍

RussellSpitzer · 2020-10-05T15:38:18Z

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

+  }
+
+  private CloseableIterable<FileScanTask> planDataTableScanFiles() {
+    // TODO Currently this approach reads all manifests, no manifest filtering - Maybe through pushdowns or table read


Currently the scan on ManifestEntries handles no pushdowns, which means every record must be serialized and every manifest must be read when doing distributed planning.

RussellSpitzer · 2020-10-05T15:43:08Z

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

+    // Read entries which are not deleted and are datafiles and not delete files
+    Dataset<Row> dataFileEntries =
+        manifestEntries.filter(manifestEntries.col("data_file").getField(DataFile.CONTENT.name()).equalTo(0))
+        .filter(manifestEntries.col("status").notEqual(2)); // not deleted


These numbers are both magic because ManifestEntry is protected and we can't access the constants from here, may want to change this as well

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

RussellSpitzer · 2020-10-05T15:47:49Z

spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java

+    specIdPosition = positions.get("partition_spec_id");
+  }
+
+  private SparkDataFile(SparkDataFile other) {


Copy constructor since we need to actually serialize this representation back to Spark when we distributed CombinedScanTasks. This means we actually need a SparkDataFile object for every BaseFileScanTask

kbendick · 2020-10-22T19:35:29Z

core/src/main/java/org/apache/iceberg/BaseFile.java

+        if (fields.get(i).fieldId() == ManifestFile.SPEC_ID.fieldId()) {
+          found = true;
+          fromProjectionPos[i] = 14;
+        }


This is not related to your PR but while we're here: once we find the projected value and found is true, do we need to iterate over the rest of the entries?

We don't, but I don't think it's that much of a time sink

kbendick · 2020-10-22T19:36:02Z

core/src/main/java/org/apache/iceberg/BaseTableScan.java

+
+  public Long toSnapshotId() {
+    return context.toSnapshotId();
+  }


core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

aokolnychyi · 2020-10-23T18:53:36Z

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

+          .filter(dataFileEntries.col("status").equalTo(1)); // Added files only
+    } else if (context.snapshotId() != null) {
+      LOG.debug("Planning scan at snapshot {}", context.snapshotId());
+      return dataFileEntries


I don't think this is correct. It is not really an incremental scan. It is time travelling. We have to read all files that were valid at that snapshot, not those that were created in that snapshot.

If scanContext has set a specific snapshot id, we have to set it there.

Let me add another test too then, because our current test coverage is not catching that behavior

spark/src/main/java/org/apache/iceberg/actions/PlanScanAction.java

spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java

Planning for very large scans can take a considerable amount of time even for queries that will not return a large amount of data. Some of the reason of this is that all planning for Spark Reads happens locally prior to any data actually be fetched from Spark. We can speed this up by providing a distributing planning Action which will distribute the generation of the actual scan plan. This should allow us to scale planning with compute and IO resources.

RussellSpitzer · 2020-10-26T19:32:52Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

      result = Lists.newArrayList(Actions.forTable(table).planScan().withContext(scan.tableScanContext()).execute());
    } catch (Exception e) {
+      if (SparkSession.active().conf().get(PlanScanAction.ICEBERG_TEST_PLAN_MODE).equals("true")) {
+        throw e;


When we run tests we just want to break if distributed planning fails

holdenk · 2020-12-16T17:56:54Z

I see this is out of date with the dev branch, are you still working on this?

rymurr · 2021-07-11T14:11:17Z

Hey @RussellSpitzer and @aokolnychyi what was the status of this PR? We could use some of the changes here to core/api for our planning. Is this close to merging once conflicts are fixed?

RussellSpitzer · 2021-07-11T23:18:26Z

I have to do some major updates so that it works with deletes. I plan on getting back to this soon but if you let me know what you want extracted maybe we can do some smaller prs real fast?

rymurr · 2021-07-12T08:33:27Z

I have to do some major updates so that it works with deletes. I plan on getting back to this soon but if you let me know what you want extracted maybe we can do some smaller prs real fast?

I really only need the visibility modifications and the util classes in core and api. We will likely follow a similar strategy to how you are doing it in Spark but those changes aren't needed for us.

I suspect our first iteration isn't focused on V2 delete files.

rymurr · 2021-07-12T08:52:27Z

Forgot to add the most important part: Shout if I can help!

cc @vvellanki @pravindra @snazy and @nastra

probot-autolabeler bot added API core spark labels Sep 3, 2020

This was referenced Sep 3, 2020

WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup #1420

Closed

Distributing Job Planning on Spark #1422

Closed

aokolnychyi reviewed Sep 4, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/SparkPlannerUtil.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Sep 4, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/SparkPlannerUtil.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Sep 4, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/SparkPlannerUtil.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Sep 4, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/SparkPlannerUtil.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Sep 4, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/SparkPlannerUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 9, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/DataFilesTable.java Show resolved Hide resolved

rdblue reviewed Sep 9, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/DataFilesTable.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Sep 22, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java Outdated Show resolved Hide resolved

RussellSpitzer force-pushed the SparkReadPlanParallelize branch from 569572d to 13296e4 Compare October 2, 2020 23:03