[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink #26590

HeartSaVioR · 2019-11-19T04:20:31Z

What changes were proposed in this pull request?

This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup.

To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files.

Why are the changes needed?

Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT.

HeartSaVioR · 2019-11-19T04:25:59Z

cc. @zsxwing

Please also review the assumption here;

To simplify the condition, this patch assumes that if the source files belong to the FileStreamSink, the matched source path is the root of output directory for FileStreamSink. For example, suppose we provide a glob path /a/b/c/*/* and FileStreamSource processes the file /a/b/c/d/e/f/g/file. Then we only check /a/b/c/d/e to see whether there's FileStreamSink metadata log available.

I can address the case if we would like to consider the case where metadata log is placed under subdirectory of glob path (like /a/b/c/d/e/f/_spark_metadata) or even placed under ancestor of glob path (like /a/b/_spark_metadata). I haven't address it yet because it would bring overheads, so would like to decide about a boundary (upper and lower) and apply it afterwards.

HeartSaVioR · 2019-11-19T04:30:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+      val fileSystem: FileSystem,
+      val sourcePath: Path) extends Logging {
+
+    private val srcPathToContainFileStreamSinkMetadata = new mutable.HashMap[Path, Boolean]


This is a cache storing the result of check whether the dir contains metadata dir or not, as we may not want to do the check per batch. This is based on the assumption that a directory won't be changed from having metadata to not having metadata or vice versa, but please let me know if the assumption doesn't sound safe. I'll remove the cache and check per batch.

SparkQA · 2019-11-19T05:38:45Z

Test build #114057 has finished for PR 26590 at commit 51ef7e0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-19T08:05:02Z

Test build #114065 has finished for PR 26590 at commit 82b6c18.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-11-19T11:25:21Z

Retest this, please

SparkQA · 2019-11-19T15:28:40Z

Test build #114092 has finished for PR 26590 at commit 82b6c18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-11-20T14:30:15Z

To simplify the condition, this patch assumes that if the source files belong to the FileStreamSink, the matched source path is the root of output directory for FileStreamSink. For example, suppose we provide a glob path /a/b/c// and FileStreamSource processes the file /a/b/c/d/e/f/g/file. Then we only check /a/b/c/d/e to see whether there's FileStreamSink metadata log available.

As a user I may have a directory structured where /a/b/_spark_metadata exists (or even /a/_spark_metadata). Such case I would be grateful to Spark to protect me from accidental file delete/move.

gaborgsomogyi · 2019-11-20T14:41:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+
+      srcPathToEntries.filterKeys { srcPath =>
+        srcPathToContainFileStreamSinkMetadata.get(srcPath) match {
+          case Some(v) => !v


+1 on caching since we spare quite a time.
Not yet sure whether it's forbidden to set existing directory as sink (haven't found any explicit statement)?
If it's allowed then cache would contain false because /a/b/c/d file found but no _spark_metadata. All of a sudden a sink query started on /a/b/c which makes cached value invalid.

Yeah that's the reason I asked for more voices - the value of cache can be invalid at any time, but then we can't cache it and have to check every time which is resource-inefficient. Maybe I'd be even OK to not use cache given we'll do it in background, but want to check if it's only me.

HeartSaVioR · 2019-11-21T00:24:32Z

As a user I may have a directory structured where /a/b/_spark_metadata exists (or even /a/_spark_metadata). Such case I would be grateful to Spark to protect me from accidental file delete/move.

Yeah that's ideal, though ideally we now have to check all subdirectories, and given their status of whether they have metadata or not could be changed, we would end up check all subdirectories of source files per a batch. We might optimize the logic to only check each directory "once" per a batch (regardless of the number of source files) but still not 100% sure it's lightweight enough.

zsxwing · 2019-11-21T07:05:36Z

@HeartSaVioR I think we can simply detect whether we are using MetadataLogFileIndex here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

Line 205 in ba2bc4b

new MetadataLogFileIndex(sparkSession, qualifiedBasePath,

We don't need to do such complicated check because for cases you are checking, we won't go through MetadataLogFileIndex so the result is not correct anyway and the user should not use such path.

HeartSaVioR · 2019-11-21T07:55:49Z

@zsxwing
Ah OK got it. That's a good point - reading files in FileStreamSink output directory without metadata information is unsafe anyway.

Btw, actually I and @gaborgsomogyi considered about edge-cases which the query reads sub-directory(-ies) or ancestor with recursive option of FileStreamSink output directory, because the actual impact here is a kind of "side-effect" which "affects" other queries. It might be less problematic that the query will read the directory "incorrectly" and incorrect output will come up. The thing is, the query will also mess up the output directory as well since processed files will be cleaned up, which will also lead the files and metadata be out of sync and let other queries fail as well.

So I feel we still have to make a decision with consideration of possible side-effect; 1) try our best to prevent all known cases with (high?) costs, 2) consider these edge-cases as bad input and we don't care at all (maybe we could document it instead.) What do you think?

gaborgsomogyi · 2019-11-21T15:58:13Z

@HeartSaVioR Checking all the files in all the directories in each micro-batch is definitely an overkill.
Considering metadata we can have the following cases:

Metadata doesn't exist => Files created outside of Spark and starting a new Spark query intersecting with this directory should be considered error
Metadata exists in the root => Spark created it so we must use it and we can rely on that it won't be deleted
Metadata exists but not in the root => Spark created part/all of the files and such case delete/archive can mess up metadata <=> files consistency.

Only the last one is questionable what to do. Considering the possible solution complexity (globbing through the whole tree to find metadata) we can document this as configuration error. Of course if there is a relatively simple way to detect it then it would be a good idea to stop the query in advance (but at the first glance I can't find such easy way).

zsxwing · 2019-11-21T18:36:59Z

Checking all the files in all the directories in each micro-batch is definitely an overkill.

+1.

I think the fundamental issue is the FileIndex interface doesn't work for complicated things. There are multiple issues here. Another example: if a user is using a glob path in FileStreamSource, we always go to InMemoryFileIndex, even if there are some matched paths created by FileStreamSink. InMemoryFileIndex knowns nothing about MetadataLogFileIndex and uses its own logic to list files.

Ideally, the defending codes should be added when doing the file listing if we would like to prevent such cases because it can also prevent reading incorrect files. However, I think that's a pretty large change and probably not worth (I have not yet figured out how to make Hadoop's glob pattern codes understand MetadataLogFileIndex, maybe impossible).

Hence I suggest we just block the cleanSource option when listing files using MetadataLogFileIndex.

…f the source path refers to the output dir of FileStreamSink

HeartSaVioR · 2019-11-22T02:32:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+          validFileEntities.foreach(cleaner.clean)
+
+        case _ =>
+          logWarning("Ignoring 'cleanSource' option since Spark hasn't figured out whether " +


I just put logWarning here - I was about to throw IllegalStateException here since it doesn't sound feasible to have some files from commit() and FileStreamSource still cannot decide, but there might be some edge-case so avoided being aggressive here.

How about throwing an UnsupportedOperationException here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

Line 205 in ba2bc4b

new MetadataLogFileIndex(sparkSession, qualifiedBasePath,

The only "odd" case I can imagine to reach here is,

the query executed which wrote the commit log of the last batch and stopped before writing the offset for next batch.

the query is restarted, and constructNextBatch is called.

somehow the source files are all deleted between 1) and 2), hence FileStreamSource doesn't see any file and cannot decide when fetchAllFiles is called.

constructNextBatch will call commit for previous batch the query executed before.

It's obviously very odd case as the content of source directory are modified (maybe) manually which we don't support the case (so throwing exception would be OK), but I'm not fully sure there's no another edge-cases.

Btw, where do you recommend to add the exception? L287, or L205? If you're suggesting to add the exception in L205, I'm not sure I follow. If I'm understanding correctly, the case if the logic reaches case _ won't reach L205.

Also not yet see which place is the suggestion refers to.

L205, I'm not sure I follow

+1

L287: As I see this is more or less the should never happen case. The question is whether we can consider edge cases which may hit this. If we miss a valid case and we're throwing exception here we may block a query to start.

3. somehow the source files are all deleted between 1) and 2)

This should be a user error.

My general point is we should make sure the data files and the metadata in _spark_metadata are consistent and we should prevent from cleaning up data files that are still tracked. Logging a warning without really deleting files is a solution, however, most of users won't be able to notice this warning from their logs. Hence we should detect this earlier. There is already a variable sourceHasMetadata tracking whether the source is reading from a file stream sink or not. We can check the options and throw an exception when flipping it. What do you think?

Ah OK I guess I got your point now. I'm also in favor of being "fail-fast" and the suggestion fits it. Thanks! Just updated.

HeartSaVioR · 2019-11-22T04:36:05Z

Thanks for the feedback. Changed the logic to check whether the source is leveraging metadata or not. Please take a look again.

SparkQA · 2019-11-22T06:12:43Z

Test build #114262 has finished for PR 26590 at commit f9dc1a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-22T06:31:48Z

Test build #114263 has finished for PR 26590 at commit 8d6d08b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-11-25T12:16:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

-      logDebug(s"completed file entries: ${validFileEntities.mkString(",")}")
-      validFileEntities.foreach(cleaner.clean)
+      sourceHasMetadata match {
+        case Some(true) if !warnedIgnoringCleanSourceOption =>


Is it possible that it's called more than once? Such case case _ => will win.

Ah yes missed that. Nice finding.

SparkQA · 2019-11-25T18:03:41Z

Test build #114406 has finished for PR 26590 at commit d1ec200.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-02T23:31:41Z

@zsxwing @gaborgsomogyi I guess I addressed all review comments. Please take next round of reviews. Thanks in advance!

SparkQA · 2019-12-03T03:20:48Z

Test build #114741 has finished for PR 26590 at commit d7ded93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-03T03:31:30Z

Test build #114742 has finished for PR 26590 at commit fcdb9e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-05T01:03:51Z

Bump.

zsxwing · 2019-12-05T22:27:24Z

LGTM.

retest this please. Triggering another test since the last run was 3 days ago.

HeartSaVioR · 2019-12-05T22:37:21Z

retest this, please

SparkQA · 2019-12-06T02:22:40Z

Test build #114918 has finished for PR 26590 at commit fcdb9e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2019-12-06T05:45:42Z

Thanks! Merging to master,

HeartSaVioR · 2019-12-06T05:50:10Z

Thanks all for reviewing and merging!

HeartSaVioR · 2019-12-06T05:55:46Z

@zsxwing
Btw, could you please revisit the comment in #22952 when you have time so that we could fix it in time? It could be missed so I feel it's good to address sooner than later, but at least before starting 3.0.0 RC 1 vote. Thanks in advance!
#22952 (comment)

… the files belong to the output of FileStreamSink ### What changes were proposed in this pull request? This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup. To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files. ### Why are the changes needed? Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT. Closes apache#26590 from HeartSaVioR/SPARK-29953. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

HeartSaVioR commented Nov 19, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Nov 19, 2019

gaborgsomogyi reviewed Nov 20, 2019

View reviewed changes

[SPARK-29953][SQL] Don't clean up source files for FileStreamSource i…

f9dc1a4

…f the source path refers to the output dir of FileStreamSink

HeartSaVioR force-pushed the SPARK-29953 branch from 82b6c18 to f9dc1a4 Compare November 22, 2019 02:26

HeartSaVioR commented Nov 22, 2019

View reviewed changes

Refine a bit

8d6d08b

gaborgsomogyi reviewed Nov 25, 2019

View reviewed changes

Fix silly mistake

d1ec200

Reflect review comment

d7ded93

Refine a bit

fcdb9e8

asfgit closed this in 25431d7 Dec 6, 2019

HeartSaVioR deleted the SPARK-29953 branch December 6, 2019 05:50

[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink #26590

[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink #26590

Uh oh!

Conversation

HeartSaVioR commented Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

HeartSaVioR commented Nov 19, 2019

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

gaborgsomogyi commented Nov 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsxwing commented Nov 21, 2019

Uh oh!

HeartSaVioR commented Nov 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaborgsomogyi commented Nov 21, 2019

Uh oh!

zsxwing commented Nov 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 22, 2019

Uh oh!

SparkQA commented Nov 22, 2019

Uh oh!

SparkQA commented Nov 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 25, 2019

Uh oh!

HeartSaVioR commented Dec 2, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

HeartSaVioR commented Dec 5, 2019

Uh oh!

zsxwing commented Dec 5, 2019

Uh oh!

HeartSaVioR commented Dec 5, 2019

Uh oh!

HeartSaVioR commented Nov 19, 2019 •

edited

Loading

HeartSaVioR commented Nov 19, 2019 •

edited

Loading

HeartSaVioR commented Nov 21, 2019 •

edited

Loading

HeartSaVioR commented Nov 21, 2019 •

edited

Loading

HeartSaVioR Nov 26, 2019 •

edited

Loading