Depend on Spark-3.0.0-palantir.18 (cont'd) by yifeih · Pull Request #1 · yifeih/incubator-iceberg

yifeih · 2019-03-08T03:01:43Z

Working off matt's work here: mccheah#5 (created new PR since I can't push to matt's fork)

Update with upstream

* add write encryption codepath * add reader side * remove unnecessary field * add log warn * fix check * try a single iterator * new reader * addressing comments * remove unused struct

This reverts commit f8ee9cd.

This reverts commit 0cbf39b.

rdblue · 2019-03-08T20:55:30Z

+  }
+
+  @Override
+  public org.apache.spark.sql.sources.v2.Table getTable(DataSourceOptions options, StructType schema) {


There should be no need to implement this method. Iceberg tables always have a schema so it makes no sense to supply one. That's for cases like CSV where a schema is required to interpret the data (column names, and types). Otherwise, normal projection works just fine.

Don't you still need it for write? Sort of analogous to the previous getWriter() method? Or does DSV2 expect that the table will already be created since the datasource layer is separate from the catalog layer, which includes table and schema information?

In DSv2, the writer is passed the physical plan and schema, then calls TableCatalog.createTable with that schema. After it creates the table, it uses the instance returned by create.

These methods are called by the DataFrameReader and DataFrameWriter and assume that the table already exists.

Throwing UnsupportedOperationException is correct then, so we're good here.

yifeih · 2019-03-09T05:30:32Z

I'm not familiar with parquet, so I'm struggling to debug the TestParquetAvroReader.testCorrectness test which is failing. Here are the things that I've narrowed it down to:

It happens on record 77001. In that record, it fails at this assertion: https://github.com/palantir/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L87
The assertion failed on the base writer for the winter.list.element.wheeze column.
Looking at the stacktrace, it fails while calling MessageColumnIO.startGroup() on the MessageGroupIO object associated with winter.list.element. At that point, the groupNullCache is set to 1 for winter.list.element. It's indeed true that record 77000 had a null {} inside the winter.list array as its last element, so this is expected.
In the previous record (number 77000), the ColumnWriterBase.writePage() method was called on the writer for winter.list.element.wheeze. That was probably what set the pageRowCount field to 0, causing the assertion to fail.

Currently, I think my confusion is caused by my lack of understanding of what repetitionLevel means and how it's set. I'll pick this up again next week.

mccheah · 2019-03-11T18:31:50Z

Let's make the pull from master a separate PR, so that this diff doesn't include things like integrating encryption everywhere.

yifeih · 2019-03-11T18:52:35Z

Oh actually, I think this was just a bad merge with some code that we ended up deleting before the final merge of the encryption code. let me clean that up.

mccheah · 2019-03-11T23:45:25Z

  public void testCorrectness() throws IOException {
-    Iterable<Record> records = RandomData.generate(COMPLEX_SCHEMA, 250_000, 34139);
+    // TODO (yifeih): Change the seed back to 34139 after merging https://github.com/apache/parquet-mr/pull/620
+    Iterable<Record> records = RandomData.generate(COMPLEX_SCHEMA, 250_000, 34138);


@rdblue for SA - we hit another Parquet bug, I think, which is also fixed by apache/parquet-java#620. We don't think the actual upgrades we're doing in this patch are related to the bug. It's very specific - an array is written, the last value in the array is null, the page is written (without the cached null) and then an assertion in Parquet fails with the pageRowCount being 0 but the repetitionLevel being 1 when nulls are flushed from the previous page.

We don't understand the root cause here, but since the problem goes away from applying the above patch that flushes nulls eagerly, I think we're fine here.

mccheah

Overall looks great, this is about what we would expect from migrating from the old DSv2 APIs to the newer ones. We'll have some interesting discussions when logical plan and catalog changes come down the line.

The merge conflicts we could foreseeably get don't look as intimidating as I thought they might be. The most code we moved was porting the Reader setup to the ScanBuilder side, so any changes made to that part of the Reader upstream would have to be mirrored in our ScanBuilder. Shouldn't be too problematic.

mccheah · 2019-03-11T23:47:30Z

-                                                      Schema readSchema) {
-      return Avro.read(location)
+    private CloseableIterable<InternalRow> newAvroIterable(InputFile inputFile,
+                                                         FileScanTask task,


The indentation doesn't have to change.

mccheah · 2019-03-11T23:47:49Z

+      CloseableIterable<InternalRow> iter;
      InputFile location = inputFiles.get(task.file().path().toString());
      Preconditions.checkNotNull(location, "Could not find InputFile associated with FileScanTask");
-      CloseableIterable<InternalRow> iter;


There was no need to move this.

mccheah · 2019-03-11T23:49:31Z

-import org.apache.spark.sql.sources.v2.reader.SupportsPushDownFilters;
-import org.apache.spark.sql.sources.v2.reader.SupportsPushDownRequiredColumns;
-import org.apache.spark.sql.sources.v2.reader.SupportsReportStatistics;
+import org.apache.spark.sql.sources.v2.reader.*;


Don't wildcard import here.

mccheah · 2019-03-11T23:51:25Z

    private final FileIO fileIo;
    private final Map<String, InputFile> inputFiles;

+    private final Iterator<FileScanTask> tasks;


No need to move this. In general let's try to avoid moving fields around here - since this is a fork from the upstream codebase we really want to avoid merge conflicts as much as we can.

mccheah · 2019-03-11T23:52:08Z

-
-public class IcebergSource implements DataSourceV2, ReadSupport, WriteSupport, DataSourceRegister {
+public class IcebergSource implements
+        TableProvider,


These can be on the same line as public class IcebergSource implements

mccheah · 2019-03-11T23:53:07Z

+  }
+
+  @Override
+  public org.apache.spark.sql.sources.v2.Table getTable(DataSourceOptions options, StructType schema) {


Throwing UnsupportedOperationException is correct then, so we're good here.

mccheah · 2019-03-11T23:54:59Z

+    }
+
+    @Override
+    public WriteBuilder mode(SaveMode mode) {


I think we should just say that we don't support SaveMode here, since we're specifically checking that only appending is allowed anyways. We'll discuss how to get Iceberg connected to table catalogs and the V2 logical plans when the time comes.

Ok yup, the reason I did that was because some tests relied on it, but I can update those tests too.

mccheah · 2019-03-11T23:55:29Z

+
+
+  private static class IcebergWriterBuilder implements WriteBuilder,
+      SupportsSaveMode {


I think we should just say that we don't support SaveMode here, since we're specifically checking that only appending is allowed anyways. We'll discuss how to get Iceberg connected to table catalogs and the V2 logical plans when the time comes.

mccheah · 2019-03-11T23:57:07Z

+          .toUpperCase(Locale.ENGLISH));
+    }
+
+    public void setFileFormat(String format) {


Let's just make FileFormat final, and in the caller of the constructor, check the option and default to Parquet if it is not present.

mccheah · 2019-03-12T00:02:01Z

 import com.netflix.iceberg.types.TypeUtil;
 import com.netflix.iceberg.types.Types;
-import org.apache.commons.lang.SerializationUtils;
+import org.apache.commons.lang3.SerializationUtils;


Did we need to use commons-lang3? Not entirely opposed to it, just wondering if we can minimize the diff just a tiny bit.

Hmm the original import doesn't seem to be part of the natural dependency tree anymore :/

That's fine then, actually we probably want to be exclusively using lang3 in upstream as well. We'll catch this if we introduce Baseline and linting has a rule for it.

mccheah · 2019-03-12T00:03:08Z

 import org.junit.runner.RunWith;
 import org.junit.runners.Parameterized;
+
+import javax.annotation.processing.SupportedOptions;


We don't use this import

mccheah · 2019-03-12T00:03:16Z

-import org.apache.spark.sql.sources.v2.reader.SupportsPushDownFilters;
+import org.apache.spark.sql.sources.v2.SupportsBatchRead;
+import org.apache.spark.sql.sources.v2.TableProvider;
+import org.apache.spark.sql.sources.v2.reader.*;


Don't use wildcard import

mccheah

This is fine despite the SaveMode comment elsewhere - we can figure out how we deal with SaveMode when upstream does.

mccheah · 2019-03-12T23:28:17Z

But I can't merge PRs on your fork so feel free to merge yourself =P

* Integrate encryption into datasource * add write encryption codepath * add reader side * remove unnecessary field * add log warn * fix check * try a single iterator * new reader * addressing comments * remove unused struct * Begin upgrading Spark * Revert "Begin upgrading Spark" This reverts commit f8ee9cd. * Revert "Revert "Begin upgrading Spark"" This reverts commit 0cbf39b. * writer implementation migrated * read side * simplify writer builder * oops * welp everything works except parquet avro correctness * delete vestigial encryption code * change seed * delete vestigial encryption code * try again * some cleanups * address comments and eliminate some diffs * delete some more stuff

mccheah and others added 10 commits February 28, 2019 14:39

Merge pull request apache#3 from apache/master

ef7af82

Update with upstream

Integrate encryption into datasource

81f87d9

* add write encryption codepath * add reader side * remove unnecessary field * add log warn * fix check * try a single iterator * new reader * addressing comments * remove unused struct

Begin upgrading Spark

f8ee9cd

Revert "Begin upgrading Spark"

0cbf39b

This reverts commit f8ee9cd.

Merge remote-tracking branch 'origin/master' into yh/use-palantir-spark

2149ba0

Revert "Revert "Begin upgrading Spark""

019b796

This reverts commit 0cbf39b.

writer implementation migrated

9b2f3e0

read side

cd76f66

simplify writer builder

f3bfada

oops

a486536

rdblue reviewed Mar 8, 2019

View reviewed changes

welp everything works except parquet avro correctness

e74ed5e

yifeih added 5 commits March 11, 2019 11:56

delete vestigial encryption code

4a821e4

change seed

f4018da

Merge branch 'master' into yh/use-palantir-spark

c732fd8

try again

f727e6e

some cleanups

eb08e1c

mccheah reviewed Mar 11, 2019

View reviewed changes

mccheah suggested changes Mar 12, 2019

View reviewed changes

mccheah reviewed Mar 12, 2019

View reviewed changes

yifeih added 2 commits March 11, 2019 17:31

address comments and eliminate some diffs

d04c3e4

delete some more stuff

4b327ea

mccheah approved these changes Mar 12, 2019

View reviewed changes

yifeih merged commit 71ead9d into master Mar 13, 2019



		private static class IcebergWriterBuilder implements WriteBuilder,
		SupportsSaveMode {

Conversation

yifeih commented Mar 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifeih commented Mar 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mccheah commented Mar 11, 2019

Uh oh!

yifeih commented Mar 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah left a comment

Choose a reason for hiding this comment

Uh oh!

mccheah commented Mar 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yifeih commented Mar 8, 2019 •

edited

Loading

yifeih commented Mar 9, 2019 •

edited

Loading