Allow writers to control size of files generated #432

xabriel · 2019-08-30T16:29:54Z

For big jobs where parquet files generated get to be >= 10GB, we have found latency on read related to reading the parquet footer.

For our data and tech stack, we observe that it takes about 1 second per 10GB of file size:

Time to read footer of ~302 MB parquet file:
ms: 273
ms: 214
ms: 289
ms: 262

Time to read footer of ~17GB parquet file:
ms: 1907
ms: 1925
ms: 1933
ms: 1855

Time to read footer of ~67GB parquet file:
ms: 6073
ms: 5587
ms: 5293
ms: 5691

To avoid this, we propose this PR that allows iceberg writers to close and open new files when a target file size is achieved. The semantics of having at most one file open per writers are not changed, and for the case of a PartitionedWriter, the semantics of failing if the data is not ordered is kept as well.

With this PR, now we can do:

df
  .sort(...)
  .write
  .format("iceberg")
  .option("target-file-size", 1 * 1024 * 1024 * 1024) // target 1GB files
  .mode("append")
  .save("...")

rdblue · 2019-08-31T00:33:45Z

@xabriel, do you know how large the min/max values are for each row group in the file? I'm curious whether this is caused by the number of row groups or by the stats in the row group metadata.

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

xabriel · 2019-09-03T15:58:40Z

@rdblue:

do you know how large the min/max values are for each row group in the file?

We do have lots and lots of String types, and some times the max has >= 100 characters. I just learned about #173, so we will give that a try.

rdblue · 2019-09-03T16:05:14Z

#173 works for Iceberg metadata, but that won't help if the Parquet footer is too large because it stores a lot of large stats values. We might need to get something like #254 done in Parquet.

xabriel · 2019-09-03T18:59:02Z

Test failures seem unrelated, all around Hive Metastore:

org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveManifestsTable FAILED
    java.lang.RuntimeException at TestIcebergSourceHiveTables.java:466
        Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:466
            Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:466

rdblue · 2019-09-03T19:56:11Z

@xabriel, I pushed a fix for the flaky Java test this morning. And we're looking into the python test.

xabriel · 2019-09-03T22:52:50Z

(Rebased to pickup java test fixes.)

xabriel · 2019-09-04T00:06:01Z

Travis CI still unhappy with the following unrelated test:

org.apache.iceberg.hive.HiveTableTest > testConcurrentConnections FAILED
    java.lang.AssertionError at HiveTableTest.java:330

xabriel · 2019-09-04T18:29:10Z

Fix in #450 indeed takes care of the test failure. Thanks @rdblue.

@rdblue, @aokolnychyi: This PR is now ready for further consideration.

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

…ry 1000 rows.

…BaseWriter

xabriel · 2019-09-05T20:30:42Z

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java


    @Override
    public WriterCommitMessage commit() throws IOException {
-      Preconditions.checkArgument(currentAppender != null, "Commit called on a closed writer: %s", this);


These precondition checks on currentAppender, which were only done in UnpartitionedWriter are no longer done in BaseWriter since they make PartitionedWriter fail.

site/docs/configuration.md

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

aokolnychyi

I did a quick pass, LGTM.

rdblue · 2019-09-14T00:43:57Z

Sorry that I haven't reviewed this yet! I was out for ApacheCon this week and I'm going to be out until Wednesday next week. I'll have a look when I'm back.

rdblue · 2019-09-19T22:00:20Z

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

+        this.fileCount = 0;
+      }
+
+      private synchronized String generateFilename() {


I don't think this should be synchronized unless fileCount is volatile. This doesn't need to be synchronized anyway because each write is single-threaded. I would just remove this to make it a bit simpler.

rdblue · 2019-09-19T22:12:55Z

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

+    protected PartitionKey currentKey = null;
+    protected FileAppender<InternalRow> currentAppender = null;
+    protected EncryptedOutputFile currentFile = null;
+    protected long currentRows;


Nit: all fields should be initialized.

It is also strange that this is only used in child classes.

Can this class provide a writeInternal method that updates currentRows, writes to the appender, and checks when to open and close the current appender? That would be cleaner and would no longer require all the protected fields. Does performance really degrade when the subclasses are a bit more separated from this base class?

I'd still need field currentKey to be accessible, since it is used in openCurrent() and closeCurrent() in base class, but also on PartitionedWriter's write().

So I can:

Only make currentKey protected.
or

Add getter/setter for this particular field.

WDYT?

Does performance really degrade when the subclasses are a bit more separated from this base class?

A method call is more expensive than a field access, although I admit that the JIT compiler should pick this up right away. So will fix.

I'd say to use the getter/setter.

I think this commit is almost ready, except for the encapsulation in this class. If we think the JIT compiler will handle this, then let's go ahead and make this base class handle the important parts.

rdblue · 2019-09-19T22:14:50Z

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

+  private abstract static class BaseWriter implements DataWriter<InternalRow> {
+    protected static final int ROWS_DIVISOR = 1000;
+
+    protected final Set<PartitionKey> completedPartitions = Sets.newHashSet();


This is only used in the partitioned case. Can this be a field of the partitioned writer instead?

rdblue · 2019-09-19T22:20:07Z

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

      }
    }
+
+    private class EncryptedOutputFileFactory implements OutputFileFactory<EncryptedOutputFile> {


I don't think OutputFileFactory is needed. This is the only implementation of it. This could also be named OutputFileFactory and not mention encryption because the files may not actually be encrypted if the plaintext encryption manager is used.

xabriel · 2019-09-23T20:33:01Z

(Test failures around TestIcebergSourceHiveTables, thus unrelated to changes.)

rdblue · 2019-09-24T23:24:30Z

Thanks for fixing this, @xabriel! I'll merge it.

rdblue reviewed Aug 31, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

xabriel added 3 commits September 3, 2019 13:57

Allow writers to control size of files genenrated.

b5546d4

Add table property 'write.target-file-size'

dcc777f

Update docs

560230a

xabriel force-pushed the target-file-size-for-writers branch from 31419ad to 560230a Compare September 3, 2019 22:51

rdblue mentioned this pull request Sep 4, 2019

Move Hive concurrency tests and run one at a time #450

Merged

rdblue reviewed Sep 4, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 4, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 4, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 4, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

xabriel added 2 commits September 5, 2019 11:19

Address review comments on file name and checking for target size eve…

a5aff93

…ry 1000 rows.

Refactor Partitioned and UnpartitionedWriter to share common code in …

62dc280

…BaseWriter

xabriel commented Sep 5, 2019

View reviewed changes

aokolnychyi reviewed Sep 9, 2019

View reviewed changes

site/docs/configuration.md Outdated Show resolved Hide resolved

aokolnychyi reviewed Sep 9, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Sep 9, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

Rename flag to target-file-size-bytes. Use PropertyUtil.

f45a317

xabriel requested a review from aokolnychyi September 10, 2019 00:25

aokolnychyi reviewed Sep 10, 2019

View reviewed changes

Fix indentation of BaseWriter to match WriterFactory

87c7907

Fix ParquetWriter#length() to account for currently buffered row group.

d275322

rdblue reviewed Sep 19, 2019

View reviewed changes

Address latest review comments on Writer.java

cce0d6a

xabriel requested a review from rdblue September 23, 2019 20:33

rdblue merged commit c2435d6 into apache:master Sep 24, 2019

Allow writers to control size of files generated #432

Allow writers to control size of files generated #432

Uh oh!

Conversation

xabriel commented Aug 30, 2019

Uh oh!

rdblue commented Aug 31, 2019

Uh oh!

Uh oh!

xabriel commented Sep 3, 2019

Uh oh!

rdblue commented Sep 3, 2019

Uh oh!

xabriel commented Sep 3, 2019

Uh oh!

rdblue commented Sep 3, 2019

Uh oh!

xabriel commented Sep 3, 2019

Uh oh!

xabriel commented Sep 4, 2019

Uh oh!

xabriel commented Sep 4, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xabriel commented Sep 23, 2019

Uh oh!

rdblue commented Sep 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants