[SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue #28904

HeartSaVioR · 2020-06-23T08:11:52Z

What changes were proposed in this pull request?

In many operations on CompactibleFileStreamLog reads a metadata log file and materializes all entries into memory. As the nature of the compact operation, CompactibleFileStreamLog may have a huge compact log file with bunch of entries included, and for now they're just monotonically increasing, which means the amount of memory to materialize also grows incrementally. This leads pressure on GC.

This patch proposes to streamline the logic on file stream source and sink whenever possible to avoid memory issue. To make this possible we have to break the existing behavior of excluding entries - now the compactLogs method is called with all entries, which forces us to materialize all entries into memory. This is hopefully no effect on end users, because only file stream sink has a condition to exclude entries, and the condition has been never true. (DELETE_ACTION has been never set.)

Based on the observation, this patch also changes the existing UT a bit which simulates the situation where "A" file is added, and another batch marks the "A" file as deleted. This situation simply doesn't work with the change, but as I mentioned earlier it hasn't been used. (I'm not sure the UT is from the actual run. I guess not.)

Why are the changes needed?

The memory issue (OOME) is reported by both JIRA issue and user mailing list.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs
Manual test done

The manual test leverages the simple apps which continuously writes the file stream sink metadata log.

HeartSaVioR/spark-delegation-token-experiment@bea7680

The test is configured to have a batch metadata log file at 1.9M (10,000 entries) whereas other Spark configuration is set to the default. (compact interval = 10) The app runs as driver, and the heap memory on driver is set to 3g.

before the patch

It only ran for 40 mins, with the latest compact batch file size as 1.3G. The process struggled with GC, and after some struggling, it threw OOME.

after the patch

It sustained 2 hours run (manually stopped as it's expected to run more), with the latest compact batch file size as 2.2G. The actual memory usage didn't even go up to 1.2G, and be cleaned up soon without outstanding GC activity.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

HeartSaVioR · 2020-06-23T08:22:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

   * Returns all files except the deleted ones.
   */
-  def allFiles(): Array[T] = {
+  def allFiles(predicate: T => Boolean = _ => true): Array[T] = {


We can also have a streamlined version of this method to avoid materializing all entries on initializing FileStreamSource, though I think there's the another problem we should solve (file stream source log should not have bunch of entries - think about other data sources) and once we fixed that issue it won't matter at all.

HeartSaVioR · 2020-06-23T08:26:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

          val logs =
            getAllValidBatches(latestId, compactInterval).flatMap { id =>
-              super.get(id).getOrElse {
+              filterInBatch(id)(shouldRetain).getOrElse {


This would help when we introduce a new condition on exclusion of entries.

HeartSaVioR · 2020-06-23T08:26:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

            // Otherwise, this is a real IO issue and we should throw it.
-            latestId = nextCompactionBatchId(latestId, compactInterval)
-            super.get(latestId).getOrElse {
+            val expectedMinLatestId = nextCompactionBatchId(latestId, compactInterval)


This new approach is to avoid reading the next compact file log, which materializes all entries into the file. It should be extreme case, so it's also OK to keep this as it is if someone strongly think the previous one is better.

HeartSaVioR · 2020-06-23T08:28:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
+  override def shouldRetain(log: SinkFileStatus): Boolean = {
+    log.action != FileStreamSinkLog.DELETE_ACTION


While I just keep this, I think we should just remove this. As I left TODO below, it hasn't been used, exists hypothetically.

Just think out loud, DELETE_ACTION is not needed even we consider data compaction, if we do data compaction in compact batch. Nevertheless, let's focus on current state for now. DELETE_ACTION can be dropped.

I'll just remove this condition - I believe we shouldn't concern about DELETE_ACTION, but even we concern about that, it would work better if we leave it to the reader, as FileStreamSinkLog won't be able to match pair of ADD_ACTION and DELETE_ACTION and remove the entries.

Again, DELETE_ACTION should be just dropped, unless we have a plan to implement it properly.

HeartSaVioR · 2020-06-23T08:28:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

 object FileStreamSinkLog {
  val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.
+  // TODO: We can remove the field "action" as well, ignoring "action" in existing metadata log.


Note that this would also help to reduce the size of each entry. OK to leave only ADD_ACTION if JSON serializer/deserializer complains.

Let's consider only when we are open to introduce another version. Don't want to deal with compatibility even without bumping version.

HeartSaVioR · 2020-06-23T08:33:40Z

cc. @tdas @zsxwing @jose-torres @gaborgsomogyi

HeartSaVioR · 2020-06-23T08:34:28Z

also cc. @xuanyuanking @uncleGen as they have been reviewed sibling PRs.

SparkQA · 2020-06-23T08:34:44Z

Test build #124408 has finished for PR 28904 at commit 7721fff.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-23T12:47:15Z

Test build #124403 has finished for PR 28904 at commit b00e262.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-23T13:26:27Z

Test build #124410 has finished for PR 28904 at commit a2daf01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-23T19:47:11Z

Just to be sure, the test app with patch now writes the version 1589, which the log file size is 2.9G, with RES 1.025g.

HeartSaVioR · 2020-06-23T22:13:36Z

Combining efforts together:

master branch + this patch (SPARK-30462) = reached batch 919 in 2 hours (120 mins)
SPARK-30946 + this patch (SPARK-30462) = reached batch 919 in 11 mins

10x speedup, reduction of log file size (missed the batch 919 but 1369.compact only took 678M), consistent memory usage (under 1.2G)

HeartSaVioR · 2020-06-27T00:18:07Z

I've been running sustain tests (still running) and here's some observation:

SPARK-30946 + SPARK-30462 just wrote compact batch 7309 (73,100,000 entries) which size is 3.5G, and RES has been less than 2G (It increases a bit. Maybe also better to run with smaller heap memory.) It took 169517 ms (2.8 mins) to build a compact batch 7309, which is similar to build a compact batch 479 (4,800,000) without SPARK-30946. (15x processed)
SPARK-30462 itself just wrote compact batch 1739 (17,390,000 entries) which size is 3.1G, and RES has been less than 2G.

xuanyuanking · 2020-06-28T03:31:33Z

Very impressive! I'll review this in 2 days.

HeartSaVioR · 2020-06-28T04:09:36Z

UPDATE: now SPARK-30946 + SPARK-30462 writes 11879 which RES is still less than 2G (around 1.7G). I'll stop the sustaining test for enough heap and run the another sustaining test for smaller heap (1.5G).

HeartSaVioR · 2020-06-29T03:47:36Z

UPDATE: SPARK-30946 + SPARK-30462 with lower down driver memory to 1.5G now writes batch 9039 which RES is around 1.3g. I guess the process uses up available memory if possible, but not leads to OOME.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

xuanyuanking · 2020-06-30T15:00:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

+   * NOTE: This no longer fails early on corruption. The caller should handle the exception
+   * properly and make sure the logic is not affected by failing in the middle.
+   */
+  def applyFnToBatchByStream[RET](batchId: Long)(fn: InputStream => RET): RET = {


nit: duplicated code with get.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

SparkQA · 2020-07-01T04:02:31Z

Test build #124716 has finished for PR 28904 at commit 849e421.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-01T04:27:04Z

org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)

This seems to be failing frequently. I'll see other build result being run via 124719.

SparkQA · 2020-07-01T07:32:09Z

Test build #124719 has finished for PR 28904 at commit 11cb693.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-02T06:04:14Z

retest this, please

SparkQA · 2020-07-02T07:05:02Z

Test build #124870 has finished for PR 28904 at commit 11cb693.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-02T07:15:09Z

retest this, please

SparkQA · 2020-07-02T12:16:27Z

Test build #124877 has finished for PR 28904 at commit 11cb693.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-02T23:05:53Z

retest this, please

SparkQA · 2020-07-05T20:31:45Z

Test build #124945 has finished for PR 28904 at commit 11cb693.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-05T22:41:35Z

retest this, please

SparkQA · 2020-07-06T00:08:50Z

Test build #124973 has finished for PR 28904 at commit 11cb693.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-06T00:10:57Z

sql/core/src/test/resources/structured-streaming/file-sink-log-version-2.1.0/8

@@ -1,3 +1,2 @@
 v1
 {"path":"/a/b/8","size":800,"isDir":false,"modificationTime":800,"blockReplication":1,"blockSize":100,"action":"add"}
-{"path":"/a/b/0","size":100,"isDir":false,"modificationTime":100,"blockReplication":1,"blockSize":100,"action":"delete"}


This should be removed - this gave false information as we support DELETE_ACTION previously, but in reality it was never implemented.

xuanyuanking · 2020-07-16T03:49:51Z

Well, I guess I already explained why compactLogs is the culprit of the memory issue, right? (#28904 (comment))

Yep that's right. I'm also looking at the code in detail and try to find a way both keep this API and have the improvement. If it's hard to achieve, of course the improvement has a higher priority. I'll take a closer look today.

HeartSaVioR · 2020-07-16T04:22:29Z

Technically it's a private API, even not tagged as developer API - that said, it doesn't break anything in Spark's perspective. If we have confusion with availability of org.apache.spark.sql.execution package outside of Spark, then I'd rather say we may need to reconsider adding private[execution] on everywhere in the package.

SparkQA · 2020-07-16T07:05:02Z

Test build #125933 has finished for PR 28904 at commit 247a0a1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-16T07:16:41Z

retest this, please

SparkQA · 2020-07-16T12:14:22Z

Test build #125956 has finished for PR 28904 at commit 247a0a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-21T03:58:17Z

cc. @tdas @zsxwing @jose-torres as it's technically not possible to merge only from @xuanyuanking 's review.

itsvikramagr · 2020-07-30T09:11:14Z

@HeartSaVioR - This is a much-needed fix. Thanks for it.

I have an orthogonal question. Why do we need to worry about file sink metadata files? I can think of the following reasons

the downstream read operations can read the compacted metadata file to list all committed files. So they can avoid the listing cost and also improve performance
Helps in exactly-once semantics. On task failures, we don't have to worry about deleting any files written.

If the compacted metadata file size is running into GBs, the number of valid files would be in millions. In practice, the end-user will consider this sink path as a staging location and have another job to compact these small files into a final destination.

for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.

In short, if we provide an option just to have last few commits in sink metadata to ensure SS is not impacted. And make changes in various readers not to read using metadata log files. Won't it help in ensuring the reliability of the streaming job?

HeartSaVioR · 2020-07-30T23:24:18Z

for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.

I already provided the change (Spark 3.0.0 if I remember correctly), but this is only "best-effort". You cannot deal with crashing scenario.

Given we directly write the file to the final path, worth noting that reading output files in directory directly doesn't only mean you're reading duplicated outputs (at least once). This also means there's a chance you may be reading incomplete/corrupted files as well, in some sort of crashing during write.

Stepping back to explain the rationalization and the goal - providing holistic solution is not a goal.

There're already lots of efforts being made to provide holistic solution, though these efforts are happening "outside" of Spark codebase, Delta Lake, Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if I try to address the entire problems, which I can't persuade anyone to provide me enough time to work.
(Please refer the comment which is the feedback I got - #27694 (comment) and my reply #27694 (comment))

My goal of the overall improvements in file stream source/sink is, enabling end users run the query a bit longer without weird issues (like OOM). I just had to take easier (and limited) approach to solve the issue. e.g. regarding growing entries issue, while alternatives support data compaction to reduce the overall number of files without losing anything, I proposed retention of the output files so that older files can be excluded in metadata log but they no longer be accessible. End users need to pick up alternatives once they cannot live with the limitations.

HeartSaVioR · 2020-08-01T03:10:14Z

cc. @tdas @zsxwing @jose-torres kindly reminder

dongjoon-hyun · 2020-08-09T18:18:12Z

Retest this please.

dongjoon-hyun

For the following notable change, I also agree with @HeartSaVioR 's assessment (#28904 (comment)) but didn't receive any strong feedback here.

- def compactLogs(logs: Seq[T]): Seq[T]
+ def shouldRetain(log: T): Boolean

Please let us know if you disagree, @tdas , @zsxwing , @jose-torres . If there is no written feedback, this is considered as OK in the community because two committers (@HeartSaVioR and me) are already agreeing in the written discussion.

For the remaining part, we will continue to discuss.

SparkQA · 2020-08-09T22:31:04Z

Test build #127241 has finished for PR 28904 at commit 247a0a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-10T23:34:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala


 object FileStreamSinkLog {
  val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.


Could you file a JIRA and make it IDed TODO, @HeartSaVioR ?
That will improve the visibility on DELETE_ACTION related issues. That will lead a follow-up discussion on DELETE_ACTION after this PR.

Add a JIRA number in the comment?

Thanks for the suggestion. Will file an issue and add JIRA number into the code comment.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

dongjoon-hyun · 2020-08-10T23:52:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+  def foreachInBatch(batchId: Long)(fn: T => Unit): Unit = applyFnInBatch(batchId)(_.foreach(fn))
+
+  /**
+   * Apply filter on all entries in the specific batch.


ditto. Shall we mention IllegalStateException?

ditto. Would we want to mention all possible exceptions per method? IllegalStateException is the explicit one, but there're also many possible exceptions, starting from IOException.

dongjoon-hyun · 2020-08-10T23:57:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean


@HeartSaVioR . Can we have def shouldRetain(log: T): Boolean = true to simplify?
true is reasonable for the default value for Log, isn't it? And, in this PR, all derived class overrides with true only. I believe that the special derived classes will override in any way according to their semantics.

We'll enable the condition on file stream source and sink via #28363 and #28422 so eventually majority of derived classes will override the method. But I also agree true is the reasonable default value. I'll update this. Thanks!

dongjoon-hyun

Thank you for this improvement. I left a few minor comments, @HeartSaVioR .

gatorsmile · 2020-08-11T03:20:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the exception
+   * properly and make sure the logic is not affected by failing in the middle.


If the caller does not handle the exception properly, what is the consequence? Do we have a test case to cover it?

How to ensure all the callers handle it properly? Do we need to introduce a test to cover it?

I'm not sure how this class ensures callers are following the guide. Did you mean we'd like to test this behavior with derived classes (file stream source/sink) log? Or we'd like to test this behavior with test-purpose implementation of CompactibleFileStreamLog?

gatorsmile · 2020-08-11T03:29:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

    }
  }

+  def serializeEntry(entry: T, out: OutputStream): Unit = {


gatorsmile · 2020-08-11T03:29:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+    out.write(Serialization.write(entry).getBytes(UTF_8))
+  }
+
+  def deserializeEntry(line: String): T = Serialization.read[T](line)


gatorsmile · 2020-08-11T03:35:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

+
+  /**
+   * Store the metadata for the specified batchId and return `true` if successful. This method
+   * fills the content of metadata via executing function. If the function throws exception,


Nit: exception -> an exception

HeartSaVioR · 2020-08-13T03:07:09Z

Thanks for the review comments. Please allow some more days (probably next week) to address as I'm on vacation this week and have limited time on looking into laptop. :) Thanks again!

dongjoon-hyun · 2020-08-14T04:00:11Z

Sure. Have a nice vacation and take care, @HeartSaVioR .

SparkQA · 2020-08-18T07:05:03Z

Test build #127530 has finished for PR 28904 at commit e16ebe4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @HeartSaVioR and all.
Merged to master for Apache Spark 3.1.0 on December 2020.

HeartSaVioR · 2020-08-21T06:08:45Z

Thanks all for reviewing and merging!

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Jun 23, 2020

HeartSaVioR force-pushed the SPARK-30462 branch from b00e262 to 7721fff Compare June 23, 2020 08:25

HeartSaVioR commented Jun 23, 2020

View reviewed changes

HeartSaVioR force-pushed the SPARK-30462 branch from 7721fff to a2daf01 Compare June 23, 2020 08:37

HeartSaVioR mentioned this pull request Jun 30, 2020

[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log #27694

Closed

xuanyuanking reviewed Jun 30, 2020

View reviewed changes

HeartSaVioR commented Jul 6, 2020

View reviewed changes

dongjoon-hyun reviewed Aug 9, 2020

View reviewed changes

dongjoon-hyun reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala Show resolved Hide resolved

dongjoon-hyun reviewed Aug 10, 2020

View reviewed changes

dongjoon-hyun reviewed Aug 11, 2020

View reviewed changes

gatorsmile reviewed Aug 11, 2020

View reviewed changes

Reflect review comments

e16ebe4

dongjoon-hyun approved these changes Aug 20, 2020

View reviewed changes

dongjoon-hyun closed this in e6795cd Aug 20, 2020

HeartSaVioR deleted the SPARK-30462 branch August 21, 2020 06:08

zsxwing mentioned this pull request Aug 22, 2020

[SPARK-32648][SS] Remove unused DELETE_ACTION in FileStreamSinkLog #29505

Closed

[SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue #28904

[SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue #28904

Uh oh!

Conversation

HeartSaVioR commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jun 23, 2020

Uh oh!

HeartSaVioR commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

HeartSaVioR commented Jun 23, 2020

Uh oh!

HeartSaVioR commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Jun 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanyuanking commented Jun 28, 2020

Uh oh!

HeartSaVioR commented Jun 28, 2020

Uh oh!

HeartSaVioR commented Jun 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

HeartSaVioR commented Jul 1, 2020

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

HeartSaVioR commented Jul 2, 2020

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

HeartSaVioR commented Jul 2, 2020

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

HeartSaVioR commented Jul 2, 2020

Uh oh!

SparkQA commented Jul 5, 2020

Uh oh!

HeartSaVioR commented Jul 5, 2020

Uh oh!

HeartSaVioR commented Jun 23, 2020 •

edited

Loading

HeartSaVioR Jun 23, 2020 •

edited

Loading

HeartSaVioR commented Jun 23, 2020 •

edited

Loading

HeartSaVioR commented Jun 27, 2020 •

edited

Loading

itsvikramagr commented Jul 30, 2020 •

edited

Loading

HeartSaVioR commented Jul 30, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Aug 10, 2020 •

edited

Loading

dongjoon-hyun Aug 10, 2020 •

edited

Loading