Abstract the generic task writers for sharing the common codes between spark and flink #1213

openinx · 2020-07-16T13:31:04Z

When I implement the PR #1145, I found that the flink TaskWriter share most of the codes with spark. So I did some abstraction to move the common logics in the iceberg-core module, so that both of them could share it.

FYI @rdblue .

rdblue · 2020-07-20T22:06:37Z

I'm assuming that this one should go in before #1145 and will review this one next. If that's not the case, please let me know!

openinx · 2020-07-21T01:53:29Z

Yeah, you're right. Please help to review this patch firstly if you have time , sir.

core/src/main/java/org/apache/iceberg/taskio/FileAppenderFactory.java

core/src/main/java/org/apache/iceberg/taskio/BaseTaskWriter.java

rdblue · 2020-07-23T00:14:19Z

core/src/main/java/org/apache/iceberg/taskio/BaseTaskWriter.java

+  }
+
+  @Override
+  public List<DataFile> pollCompleteFiles() {


I don't think it's a good idea to have a poll method like this one because it leaks critical state (completedFiles) and creates an opportunity for threading issues between write and pollCompleteFiles.

Instead, I think the base implementation should use a push model, where each file is released as it is closed.

/** * Called when a data file is completed and no longer needed by the writer. */ protected abstract void completedFile(DataFile file);

Then closeCurrent would call completedFile(dataFile) and the implementation of completedFile would handle it from there.

I read the BaseWriter code again and got the difference here. For spark streaming writer, once we did a commit , then we will create another new streaming writer to write the future records, so we don't need a method like pollCompleteFiles() to poll the newly added DataFile continusely. In the current iceberg flink writer implementation, I will use the same TaskWriter to write record even if a checkpoint happen, so I designed the pollCompleteFiles to fetch all completed data files incrementally. I think it's design difference, the state leaks issues and threading issues as you said, it's not a problem in current version but I agree that it's easy to get into those issues if others did not handle it carefully. I can align with the current spark design.

rdblue · 2020-07-23T00:21:19Z

core/src/main/java/org/apache/iceberg/taskio/FileAppenderFactory.java

+  /**
+   * Create a new {@link FileAppender}.
+   *
+   * @param outputFile indicate the file location to write.


Minor: I think the Javadoc for arguments should describe the argument's purpose, like an OutputFile used to create an output stream. If the purpose is clear from the expected type, then keeping it simple is fine, like an OutputFile.

rdblue · 2020-07-23T00:22:38Z

core/src/main/java/org/apache/iceberg/taskio/OutputFileFactory.java

 */

-package org.apache.iceberg.spark.source;
+package org.apache.iceberg.taskio;


Why not just use the existing io package? That, or maybe a tasks package.

rdblue · 2020-07-23T00:51:53Z

core/src/main/java/org/apache/iceberg/taskio/BaseTaskWriter.java

+    return new WrappedFileAppender(partitionKey, outputFile, appender);
+  }
+
+  class WrappedFileAppender {


I don't see much value in this class. Its primary use is to keep track of whether a file is large enough to release, but it doesn't actually have any of the logic to do that. As a consequence, the code is now split across multiple places.

This also has the logic for closing an appender and converting it to a DataFile, but that could just as easily be done in a DataFile closeAppender(FileAppender appender) method.

It would make sense to keep this class if it completely encapsulated the logic of rolling new files. That would require some refactoring so that it could create new files using the file and appender factories. It would also require passing a Consumer<DataFile> so that it can release closed files. Otherwise, I think we should remove this class.

I created this class because in fanout writer we will have several opening writers and when building the DataFile, we will need all the informations for the given FileAppender, such as partitionKey, EncryptedOutputFile etc. The previous spark implementations won't need the class because all of the context information are maintained inside the PartitionedWriter (currentXXX ), that's not work for fanout writer. It will be better to have such a class to hold those infos to build DataFile.

It would make sense to keep this class if it completely encapsulated the logic of rolling new files

Good point. Make the WrappedFileAppender to accomplish all the rolling things, let me refactor this.

rdblue · 2020-07-23T00:53:52Z

core/src/main/java/org/apache/iceberg/taskio/BaseTaskWriter.java

+    }
+
+    boolean shouldRollToNewFile() {
+      //TODO: ORC file now not support target file size before closed


We should consider changing the ORC appender to simply return 0 if the file isn't finished. That way this check is still valid, but the file will never be rolled.

It could be a separate issue to address this ORC issue you described ? I think we could focus on the writer refactor.

rdblue · 2020-07-23T01:03:28Z

core/src/main/java/org/apache/iceberg/taskio/PartitionedFanoutWriter.java

+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+public class PartitionedFanoutWriter<T> extends BaseTaskWriter<T> {
+  private final Function<T, PartitionKey> keyGetter;


Instead of passing a function, I think this should be an abstract method:

/** * Create a PartitionKey from the values in row. * <p> * Any PartitionKey returned by this method can be reused by the implementation. * * @param row a data row */ protected abstract PartitionKey partition(T row);

Passing a function is good if we need to inject behavior that might need to be customized, but here the only customization that would be required is to partition the objects that this class is already parameterized by. So it will be easier just to add a method for subclasses to implement. And that puts the responsibility on the implementation instead of on the code that constructs the writer.

rdblue · 2020-07-23T01:04:40Z

core/src/main/java/org/apache/iceberg/taskio/PartitionedWriter.java

    super(spec, format, appenderFactory, fileFactory, io, targetFileSize);
-    this.key = new PartitionKey(spec, writeSchema);
-    this.wrapper = new InternalRowWrapper(SparkSchemaUtil.convert(writeSchema));
+    this.keyGetter = keyGetter;


Like the other partitioned writer, I think this should use an abstract method to be implemented by subclasses.

openinx · 2020-07-26T13:04:48Z

Ping @rdblue , I think this issue is currently the biggest blocker to move flink sink connector forward now, pls take a look if you have time, thanks.

kbendick · 2020-07-26T21:46:39Z

core/src/main/java/org/apache/iceberg/io/PartitionedFanoutWriter.java

+      // NOTICE: we need to copy a new partition key here, in case of messing up the keys in writers.
+      PartitionKey copiedKey = partitionKey.copy();
+      writer = new RollingFileAppender(copiedKey);
+      writers.put(copiedKey, writer);


This code here is handling the case where we've not seen this partition key yet. This is especially likely to happen when users did not keyBy or otherwise pre-shuffle the data according to the partition key.

Is pre-shuffling something that the users should be doing before writing to the table (either keyBy or ORDER BY in Flink SQL)? I understand that this is specifically a PartitionedFanoutWriter, and so it makes sense that keys might not always come together (and even in the case where users did keyBy the partition key, if the number of TaskManager slots that are writing does not equal the cardinality of the partition key you'll still wind up with multiple RollingFileAppenders in a single Flink writing task and thus fanout). However, for long running streaming queries, it's possible that this TaskManager doesn't see this partition key again for days or even weeks (especially at a high enough volume to emit a complete file of the given target file size).

I guess my concern is that users wind up with a very high cardinality of keys on a single TaskManager. Either because they didn't pre-shuffle their data or perhaps they have an imbalance between the cardinality on the partition key and the parallelism at the write stage such that records might not naturally group together enough to emit an entire file. Or, as another edge case, one partition key value is simply not common enough to emit an entire file from this PartitionedFanoutWriter.

IIUC, if the PartitionedFanoutWriter does not see this partition key enough times in this TaskManager again to emit a full file for quite some time, a file containing this data won't be written until close is called. For very long running streaming jobs, this could be days or even weeks in my experience. This could also lead to small files upon close. Is this a concern that Iceberg should take into consideration or is this left to the users in their Flink query to determine when tuning their queries?

I imagine with S3, data locality of a file written much later than its timestamp of when the data was received is not a major concern, as the manifest file will tell whatever query engine reads this table which keys in their S3 bucket to grab and the locality issue is relatively abstracted away from the user, but what about if the user is using HDFS? Could this lead to performance issues (or even correctness issues) on read if records with relatively similar timestamps at their RollingFileAppender are scattered across a potentially large number of files?

I suppose this amounts to three concerns (and forgive me if these are non-issues as I am still new to the project, but not new to Flink so partially this is for helping me understand, as well as reviewing my concerns when reading this code):

Should we be concerned that a writer won't emit a file until a streaming query is closed due to the previously mentioned case? Possibly tracking the time that each writer has existed and then emitting a file if it has been far too long (however that could be determined).

If a record comes in at some time, and then the file containing that record isn't written for a much greater period of time (on the order of days or weeks), could this lead to correctness problems or very large performance problems when any query engine reads this table?

Would it be beneficial to at least emit a warning or info level log to the user that it might be beneficial to pre-partition their data according to the partition key spec if perhaps the number of unique RollingFileAppender writers gets too high for one given Flink writer slot / TaskManager? Admittedly, it might be difficult to determine a heuristic of when this might be a problem vs just the natural difference in the parallelism of writing task slots vs the cardinality of the partition key.

Should we be concerned that a writer won't emit a file until a streaming query is closed due to the previously mentioned case?

I think that the intent is to close and emit all of the file files each checkpoint, rather than keeping them open. That is required to achieve exactly-once writes because the data needs to be committed to the table.

I think that also takes care of your second question because data is constantly added to the table.

Would it be beneficial to at least emit a warning or info level log to the user that it might be beneficial to pre-partition their data according to the partition key spec . . .

I think a reasonable thing to do is to limit the number of writers that are kept open, to limit the resources that are held. Then you can either fail if you go over that limit, or can close and release files with a LRU policy. Failing brings the problem to the user's attention immediately and is similar to what we do on the Spark side, which doesn't allow writing new data to a partition after it is finished. That ensures that data is either clustered for the write, or the job fails.

The long-term plan for Spark is to be able to influence the logical plan that is writing to a table. That would be the equivalent of adding an automatic keyBy or rough orderBy for Flink. I think we would eventually want to do this for Flink as well, but I'm not sure what data clustering and sorting operations are supported currently.

Ah ok. I hadn't realized that was the plan.

I wrote a parquet writer for flink way back when flink did not support it and outputting files on checkpoint was the only real solution that I could come up with.

It also involved forking the base parquet-library, so we wound up abandoning it as we don't really have the engineering head count to be constantly updating and maintaining something like that. Despite the fact that Flink can now support writing parquet files etc, this is why I'm interested in this project. That and then the numerous additions to the data lake that the project supports.

Thanks for the info @rdblue!

During scan planning, IIUC, an inclusive projection could possibly match a very large number of rows that might fall outside of the predicate range if the RollingFileAppender for this rarely observed predicate at this Task Manager buffers its data for a very long time before writing (say days or even weeks in a longer running streaming query).

You mean the flink streaming reader won't see the buffered data which is still not committed to iceberg table ? Actually, that's exactly the expected behavior. Say we have a data pipeline:

(flink-streaming-sink-job-A) -> (iceberg table) -> (flink-streaming-reader-job-B).

The upstream flink-streaming-sink-job-A will append the records to iceberg table continuously, and commit to the iceberg table if checkpoint happen. we need to guarantee the transaction semantic, so the downstream flink streaming reader could only see the committed iceberg data, the delta data between two contiguous snapshots is the incremental data that the flink streaming reader should consume.

rdblue · 2020-07-26T22:42:03Z

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java

+    }
+  }
+
+  class RollingFileAppender implements Closeable {


Minor: This doesn't implement FileAppender, so maybe RollingFileWriter would make more sense?

Well, sounds great.

rdblue · 2020-07-26T22:48:20Z

core/src/main/java/org/apache/iceberg/io/PartitionedWriter.java

+    }
+
+    if (currentAppender == null) {
+      currentAppender = new RollingFileAppender(currentKey);


It would be nice to not change the logic for opening an appender. Before, this was part of the flow of changing partitions and I don't see any value in moving it.

Now we've changed to maintain the partitionKey inside the RollingFileWriter (we've discussed this before , this is because for fanout writer we may have multiple writers append records), so the RollingFileAppender creation is actually doing the partition key setting. I did not open the appender here because we only need to open an appender when there's a real record to write (in case of opening an appender without writing a record) , all those logic have been hidden inside the RollingFileAppender.

rdblue · 2020-07-26T22:50:38Z

core/src/main/java/org/apache/iceberg/io/PartitionedWriter.java

-  PartitionedWriter(PartitionSpec spec, FileFormat format, SparkAppenderFactory appenderFactory,
-                    OutputFileFactory fileFactory, FileIO io, long targetFileSize, Schema writeSchema) {
+  private PartitionKey currentKey = null;
+  private RollingFileAppender currentAppender = null;


Now that the current key is null, we will need a check before adding it to completedPartitions in the write method:

if (!key.equals(currentKey)) { closeCurrent(); if (currentKey != null) { // if the key is null, there was no previous current key completedPartitions.add(currentKey); } ... }

Nice finding.

rdblue · 2020-07-26T22:52:38Z

core/src/main/java/org/apache/iceberg/io/UnpartitionedWriter.java

+  @Override
+  public void write(T record) throws IOException {
+    if (currentAppender == null) {
+      currentAppender = new RollingFileAppender(null);


Why not initialize currentAppender in the constructor? Then we don't need an additional null check in write, which is called in a tight loop.

I refactor this part because we don't need to initialize any real writer if there's no record come in. Before this patch , it will open a real file writer even if there's no record to write, and in the end we will need to close this useless writer and clean its file.

spark/src/main/java/org/apache/iceberg/spark/source/RowDataRewriter.java

rdblue · 2020-07-26T22:57:05Z

spark2/src/main/java/org/apache/iceberg/spark/source/Writer.java

+      this.close();
+
+      List<DataFile> dataFiles = complete();
+      return new TaskCommit(new TaskResult(dataFiles));


If complete doesn't produce TaskResult, then I'm not sure that we need it at all anymore. Could we just construct TaskCommit directly?

flink/src/main/java/org/apache/iceberg/flink/TaskWriterFactory.java

rdblue · 2020-07-26T23:11:18Z

core/src/main/java/org/apache/iceberg/io/PartitionedFanoutWriter.java

+      while (iterator.hasNext()) {
+        iterator.next().close();
+        // Remove from the writers after closed.
+        iterator.remove();


Many iterator classes don't implement remove. What about iterating over the key set separately instead?

if (!writers.isEmpty()) { for (PartitionKey key : writers.keySet()) { RollingFileAppender writer = writers.remove(key); writer.close(); } }

OK, sounds good.

rdblue · 2020-07-26T23:24:19Z

core/src/main/java/org/apache/iceberg/io/PartitionedFanoutWriter.java

+  private final Map<PartitionKey, RollingFileAppender> writers = Maps.newHashMap();
+
+  public PartitionedFanoutWriter(PartitionSpec spec, FileFormat format, FileAppenderFactory<T> appenderFactory,
+                                 OutputFileFactory fileFactory, FileIO io, long targetFileSize) {


I think this is fine, but you might want to move this into Flink and combine it with the Flink-specific writer. There are a lot of concerns that might need to change for this class, like using a LRU cache for writers, incrementally releasing files, etc. Since this is only used by Flink, we might just want to iterate on it there instead of trying to maintain this as an independent class. We can always bring it back out when we have an additional use case.

rdblue · 2020-07-26T23:25:27Z

Thanks, @openinx! The RollingFileAppender looks much better. Just a few minor things to take care of to minimize the changes at this point. Otherwise I think this mostly looks ready.

openinx · 2020-07-30T14:29:51Z

Ping @rdblue , Mind to take another look ? Thanks.

rdblue · 2020-07-31T00:33:29Z

site/docs/javadoc/0.9.0/serialized-form.html

 <!--   -->
 </a>
-<h3>Class org.apache.iceberg.spark.source.SparkBatchWrite.TaskCommit extends org.apache.iceberg.spark.source.TaskResult implements Serializable</h3>
+<h3>Class org.apache.iceberg.spark.source.SparkBatchWrite.TaskCommit implements Serializable</h3>


The Javadoc for a release should not be modified. I think this is probably a search and replace error.

Yes, you are right. we shouldn't change the 0.9.0 Javadoc, let's revert it.

You are right, we should not change the javadoc of 0.9.0 release.

rdblue · 2020-07-31T00:41:11Z

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java

+    }
+
+    private void openCurrent() {
+      if (spec.fields().size() == 0) {


Unpartitioned writers pass a null partition key. Would it make more sense to use that instead of using spec?

Yeah, it make sense. Thanks.

Yeah, it makes sense.

spark2/src/main/java/org/apache/iceberg/spark/source/Writer.java

rdblue · 2020-07-31T00:44:59Z

spark2/src/main/java/org/apache/iceberg/spark/source/Writer.java


    @Override
    public WriterCommitMessage commit() throws IOException {
+      this.close();


No need to use the prefix this for close calls, is there?

rdblue · 2020-07-31T00:45:16Z

spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchWrite.java

-    TaskCommit(TaskResult result) {
-      super(result.files());
+  public static class TaskCommit implements WriterCommitMessage {
+    private final List<DataFile> taskFiles;


Same here, this class should use an Array of data files.

rdblue · 2020-07-31T00:48:22Z

core/src/main/java/org/apache/iceberg/io/UnpartitionedWriter.java

+    closeCurrent();
+  }
+
+  private void closeCurrent() throws IOException {


Is this method needed? Why not merge it with close?

Yeah, its logics could be just moved to close().

core/src/main/java/org/apache/iceberg/io/UnpartitionedWriter.java

…n spark and flink.

openinx · 2020-07-31T08:08:45Z

@rdblue I've addressed all latest comment, thanks.

rdblue · 2020-07-31T19:42:59Z

Thanks, @openinx! I fixed the minor problem that caused tests to fail and merged this.

openinx · 2020-07-31T23:50:09Z

Thanks for the fixing.

openinx mentioned this pull request Jul 20, 2020

FlinkParquetWriter should build writer with schema visitor #1215

Closed

rdblue reviewed Jul 21, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/taskio/FileAppenderFactory.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 23, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/taskio/BaseTaskWriter.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 23, 2020

View reviewed changes

kbendick reviewed Jul 26, 2020

View reviewed changes

rdblue reviewed Jul 26, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/RowDataRewriter.java Show resolved Hide resolved

rdblue reviewed Jul 26, 2020

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/TaskWriterFactory.java Show resolved Hide resolved

rdblue reviewed Jul 26, 2020

View reviewed changes

openinx mentioned this pull request Jul 28, 2020

Implement the flink stream writer to accept the row data and emit the complete data files event to downstream #1145

Merged

rdblue reviewed Jul 31, 2020

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 31, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/io/UnpartitionedWriter.java Outdated Show resolved Hide resolved

aokolnychyi self-requested a review July 31, 2020 04:57

openinx added 15 commits July 31, 2020 14:43

Abstract the generic task writers for sharing the common codes betwee…

e7bc4e6

…n spark and flink.

Minor fixes

5f1b29e

Add flink task writers and unit tests.

14e4dc5

Adjust the unit tests.

f1ccdfd

Addressing the failure unit tests.

8fe27c2

Add unit test and more javadoc

a586c51

Fix the broken TestRewriteDataFilesAction

93a9318

Add javadoc for FileAppenderFactory

bb69950

More unit tests

abb0b5d

Addressing the comment.

9f1fa70

Remove the keyGetter and use the abstract partitioned(..) method.

499a48f

Make the few public classes/methods to be package-access

21be53a

Addressing the comments

61dde9b

Remove the public modifiers from PartitionedFanoutWriter

8a09207

Addressing the lastest comment from Ryan Blue.

0c7423a

Fix NullPointerException in PartitionedWriter

e6152d1

rdblue merged commit 28e7a1b into apache:master Jul 31, 2020

openinx deleted the abstract-partitioned-writers branch August 1, 2020 13:08

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Core: Extract common task writers from Spark, add Flink (apache#1213)

e7a8f67

liubo1022126 mentioned this pull request Jan 18, 2022

Core: Orc data file not support to shouldRollToNewFile #3916

Closed

Abstract the generic task writers for sharing the common codes between spark and flink #1213

Abstract the generic task writers for sharing the common codes between spark and flink #1213

Uh oh!

Conversation

openinx commented Jul 16, 2020

Uh oh!

rdblue commented Jul 20, 2020

Uh oh!

openinx commented Jul 21, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx commented Jul 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick Jul 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Jul 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx Jul 23, 2020 •

edited

Loading

kbendick Jul 26, 2020 •

edited

Loading

rdblue Jul 26, 2020 •

edited

Loading

rdblue Jul 26, 2020 •

edited

Loading

rdblue Jul 26, 2020 •

edited

Loading