Skip to content

Conversation

@xabriel
Copy link
Contributor

@xabriel xabriel commented Aug 30, 2019

For big jobs where parquet files generated get to be >= 10GB, we have found latency on read related to reading the parquet footer.

For our data and tech stack, we observe that it takes about 1 second per 10GB of file size:

Time to read footer of ~302 MB parquet file:
ms: 273
ms: 214
ms: 289
ms: 262

Time to read footer of ~17GB parquet file:
ms: 1907
ms: 1925
ms: 1933
ms: 1855

Time to read footer of ~67GB parquet file:
ms: 6073
ms: 5587
ms: 5293
ms: 5691

To avoid this, we propose this PR that allows iceberg writers to close and open new files when a target file size is achieved. The semantics of having at most one file open per writers are not changed, and for the case of a PartitionedWriter, the semantics of failing if the data is not ordered is kept as well.

With this PR, now we can do:

df
  .sort(...)
  .write
  .format("iceberg")
  .option("target-file-size", 1 * 1024 * 1024 * 1024) // target 1GB files
  .mode("append")
  .save("...")

@rdblue
Copy link
Contributor

rdblue commented Aug 31, 2019

@xabriel, do you know how large the min/max values are for each row group in the file? I'm curious whether this is caused by the number of row groups or by the stats in the row group metadata.

@xabriel
Copy link
Contributor Author

xabriel commented Sep 3, 2019

@rdblue:

do you know how large the min/max values are for each row group in the file?

We do have lots and lots of String types, and some times the max has >= 100 characters. I just learned about #173, so we will give that a try.

@rdblue
Copy link
Contributor

rdblue commented Sep 3, 2019

#173 works for Iceberg metadata, but that won't help if the Parquet footer is too large because it stores a lot of large stats values. We might need to get something like #254 done in Parquet.

@xabriel
Copy link
Contributor Author

xabriel commented Sep 3, 2019

Test failures seem unrelated, all around Hive Metastore:

org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveManifestsTable FAILED
    java.lang.RuntimeException at TestIcebergSourceHiveTables.java:466
        Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:466
            Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:466

@rdblue
Copy link
Contributor

rdblue commented Sep 3, 2019

@xabriel, I pushed a fix for the flaky Java test this morning. And we're looking into the python test.

@xabriel xabriel force-pushed the target-file-size-for-writers branch from 31419ad to 560230a Compare September 3, 2019 22:51
@xabriel
Copy link
Contributor Author

xabriel commented Sep 3, 2019

(Rebased to pickup java test fixes.)

@xabriel
Copy link
Contributor Author

xabriel commented Sep 4, 2019

Travis CI still unhappy with the following unrelated test:

org.apache.iceberg.hive.HiveTableTest > testConcurrentConnections FAILED
    java.lang.AssertionError at HiveTableTest.java:330

@xabriel
Copy link
Contributor Author

xabriel commented Sep 4, 2019

Fix in #450 indeed takes care of the test failure. Thanks @rdblue.

@rdblue, @aokolnychyi: This PR is now ready for further consideration.


@Override
public WriterCommitMessage commit() throws IOException {
Preconditions.checkArgument(currentAppender != null, "Commit called on a closed writer: %s", this);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These precondition checks on currentAppender, which were only done in UnpartitionedWriter are no longer done in BaseWriter since they make PartitionedWriter fail.

@xabriel xabriel requested a review from aokolnychyi September 10, 2019 00:25
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick pass, LGTM.

@rdblue
Copy link
Contributor

rdblue commented Sep 14, 2019

Sorry that I haven't reviewed this yet! I was out for ApacheCon this week and I'm going to be out until Wednesday next week. I'll have a look when I'm back.

this.fileCount = 0;
}

private synchronized String generateFilename() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be synchronized unless fileCount is volatile. This doesn't need to be synchronized anyway because each write is single-threaded. I would just remove this to make it a bit simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

protected PartitionKey currentKey = null;
protected FileAppender<InternalRow> currentAppender = null;
protected EncryptedOutputFile currentFile = null;
protected long currentRows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: all fields should be initialized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also strange that this is only used in child classes.

Can this class provide a writeInternal method that updates currentRows, writes to the appender, and checks when to open and close the current appender? That would be cleaner and would no longer require all the protected fields. Does performance really degrade when the subclasses are a bit more separated from this base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still need field currentKey to be accessible, since it is used in openCurrent() and closeCurrent() in base class, but also on PartitionedWriter's write().

So I can:

  1. Only make currentKey protected.
    or
  2. Add getter/setter for this particular field.

WDYT?

Does performance really degrade when the subclasses are a bit more separated from this base class?

A method call is more expensive than a field access, although I admit that the JIT compiler should pick this up right away. So will fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say to use the getter/setter.

I think this commit is almost ready, except for the encapsulation in this class. If we think the JIT compiler will handle this, then let's go ahead and make this base class handle the important parts.

private abstract static class BaseWriter implements DataWriter<InternalRow> {
protected static final int ROWS_DIVISOR = 1000;

protected final Set<PartitionKey> completedPartitions = Sets.newHashSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in the partitioned case. Can this be a field of the partitioned writer instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

}
}

private class EncryptedOutputFileFactory implements OutputFileFactory<EncryptedOutputFile> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think OutputFileFactory is needed. This is the only implementation of it. This could also be named OutputFileFactory and not mention encryption because the files may not actually be encrypted if the plaintext encryption manager is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@xabriel
Copy link
Contributor Author

xabriel commented Sep 23, 2019

(Test failures around TestIcebergSourceHiveTables, thus unrelated to changes.)

@xabriel xabriel requested a review from rdblue September 23, 2019 20:33
@rdblue rdblue merged commit c2435d6 into apache:master Sep 24, 2019
@rdblue
Copy link
Contributor

rdblue commented Sep 24, 2019

Thanks for fixing this, @xabriel! I'll merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants