Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Nov 19, 2019

What changes were proposed in this pull request?

This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup.

To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files.

Why are the changes needed?

Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Nov 19, 2019

cc. @zsxwing

Please also review the assumption here;

To simplify the condition, this patch assumes that if the source files belong to the FileStreamSink, the matched source path is the root of output directory for FileStreamSink. For example, suppose we provide a glob path /a/b/c/*/* and FileStreamSource processes the file /a/b/c/d/e/f/g/file. Then we only check /a/b/c/d/e to see whether there's FileStreamSink metadata log available.

I can address the case if we would like to consider the case where metadata log is placed under subdirectory of glob path (like /a/b/c/d/e/f/_spark_metadata) or even placed under ancestor of glob path (like /a/b/_spark_metadata). I haven't address it yet because it would bring overheads, so would like to decide about a boundary (upper and lower) and apply it afterwards.

val fileSystem: FileSystem,
val sourcePath: Path) extends Logging {

private val srcPathToContainFileStreamSinkMetadata = new mutable.HashMap[Path, Boolean]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cache storing the result of check whether the dir contains metadata dir or not, as we may not want to do the check per batch. This is based on the assumption that a directory won't be changed from having metadata to not having metadata or vice versa, but please let me know if the assumption doesn't sound safe. I'll remove the cache and check per batch.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114057 has finished for PR 26590 at commit 51ef7e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114065 has finished for PR 26590 at commit 82b6c18.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Retest this, please

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114092 has finished for PR 26590 at commit 82b6c18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

To simplify the condition, this patch assumes that if the source files belong to the FileStreamSink, the matched source path is the root of output directory for FileStreamSink. For example, suppose we provide a glob path /a/b/c// and FileStreamSource processes the file /a/b/c/d/e/f/g/file. Then we only check /a/b/c/d/e to see whether there's FileStreamSink metadata log available.

As a user I may have a directory structured where /a/b/_spark_metadata exists (or even /a/_spark_metadata). Such case I would be grateful to Spark to protect me from accidental file delete/move.


srcPathToEntries.filterKeys { srcPath =>
srcPathToContainFileStreamSinkMetadata.get(srcPath) match {
case Some(v) => !v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on caching since we spare quite a time.
Not yet sure whether it's forbidden to set existing directory as sink (haven't found any explicit statement)?
If it's allowed then cache would contain false because /a/b/c/d file found but no _spark_metadata. All of a sudden a sink query started on /a/b/c which makes cached value invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the reason I asked for more voices - the value of cache can be invalid at any time, but then we can't cache it and have to check every time which is resource-inefficient. Maybe I'd be even OK to not use cache given we'll do it in background, but want to check if it's only me.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Nov 21, 2019

As a user I may have a directory structured where /a/b/_spark_metadata exists (or even /a/_spark_metadata). Such case I would be grateful to Spark to protect me from accidental file delete/move.

Yeah that's ideal, though ideally we now have to check all subdirectories, and given their status of whether they have metadata or not could be changed, we would end up check all subdirectories of source files per a batch. We might optimize the logic to only check each directory "once" per a batch (regardless of the number of source files) but still not 100% sure it's lightweight enough.

@zsxwing
Copy link
Member

zsxwing commented Nov 21, 2019

@HeartSaVioR I think we can simply detect whether we are using MetadataLogFileIndex here:

new MetadataLogFileIndex(sparkSession, qualifiedBasePath,

We don't need to do such complicated check because for cases you are checking, we won't go through MetadataLogFileIndex so the result is not correct anyway and the user should not use such path.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Nov 21, 2019

@zsxwing
Ah OK got it. That's a good point - reading files in FileStreamSink output directory without metadata information is unsafe anyway.

Btw, actually I and @gaborgsomogyi considered about edge-cases which the query reads sub-directory(-ies) or ancestor with recursive option of FileStreamSink output directory, because the actual impact here is a kind of "side-effect" which "affects" other queries. It might be less problematic that the query will read the directory "incorrectly" and incorrect output will come up. The thing is, the query will also mess up the output directory as well since processed files will be cleaned up, which will also lead the files and metadata be out of sync and let other queries fail as well.

So I feel we still have to make a decision with consideration of possible side-effect; 1) try our best to prevent all known cases with (high?) costs, 2) consider these edge-cases as bad input and we don't care at all (maybe we could document it instead.) What do you think?

@gaborgsomogyi
Copy link
Contributor

@HeartSaVioR Checking all the files in all the directories in each micro-batch is definitely an overkill.
Considering metadata we can have the following cases:

  1. Metadata doesn't exist => Files created outside of Spark and starting a new Spark query intersecting with this directory should be considered error
  2. Metadata exists in the root => Spark created it so we must use it and we can rely on that it won't be deleted
  3. Metadata exists but not in the root => Spark created part/all of the files and such case delete/archive can mess up metadata <=> files consistency.

Only the last one is questionable what to do. Considering the possible solution complexity (globbing through the whole tree to find metadata) we can document this as configuration error. Of course if there is a relatively simple way to detect it then it would be a good idea to stop the query in advance (but at the first glance I can't find such easy way).

@zsxwing
Copy link
Member

zsxwing commented Nov 21, 2019

Checking all the files in all the directories in each micro-batch is definitely an overkill.

+1.

I think the fundamental issue is the FileIndex interface doesn't work for complicated things. There are multiple issues here. Another example: if a user is using a glob path in FileStreamSource, we always go to InMemoryFileIndex, even if there are some matched paths created by FileStreamSink. InMemoryFileIndex knowns nothing about MetadataLogFileIndex and uses its own logic to list files.

Ideally, the defending codes should be added when doing the file listing if we would like to prevent such cases because it can also prevent reading incorrect files. However, I think that's a pretty large change and probably not worth (I have not yet figured out how to make Hadoop's glob pattern codes understand MetadataLogFileIndex, maybe impossible).

Hence I suggest we just block the cleanSource option when listing files using MetadataLogFileIndex.

…f the source path refers to the output dir of FileStreamSink
validFileEntities.foreach(cleaner.clean)

case _ =>
logWarning("Ignoring 'cleanSource' option since Spark hasn't figured out whether " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just put logWarning here - I was about to throw IllegalStateException here since it doesn't sound feasible to have some files from commit() and FileStreamSource still cannot decide, but there might be some edge-case so avoided being aggressive here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about throwing an UnsupportedOperationException here:

new MetadataLogFileIndex(sparkSession, qualifiedBasePath,

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only "odd" case I can imagine to reach here is,

  1. the query executed which wrote the commit log of the last batch and stopped before writing the offset for next batch.
  2. the query is restarted, and constructNextBatch is called.
  3. somehow the source files are all deleted between 1) and 2), hence FileStreamSource doesn't see any file and cannot decide when fetchAllFiles is called.
  4. constructNextBatch will call commit for previous batch the query executed before.

It's obviously very odd case as the content of source directory are modified (maybe) manually which we don't support the case (so throwing exception would be OK), but I'm not fully sure there's no another edge-cases.

Btw, where do you recommend to add the exception? L287, or L205? If you're suggesting to add the exception in L205, I'm not sure I follow. If I'm understanding correctly, the case if the logic reaches case _ won't reach L205.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not yet see which place is the suggestion refers to.

L205, I'm not sure I follow

+1

L287: As I see this is more or less the should never happen case. The question is whether we can consider edge cases which may hit this. If we miss a valid case and we're throwing exception here we may block a query to start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. somehow the source files are all deleted between 1) and 2)

This should be a user error.

My general point is we should make sure the data files and the metadata in _spark_metadata are consistent and we should prevent from cleaning up data files that are still tracked. Logging a warning without really deleting files is a solution, however, most of users won't be able to notice this warning from their logs. Hence we should detect this earlier. There is already a variable sourceHasMetadata tracking whether the source is reading from a file stream sink or not. We can check the options and throw an exception when flipping it. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK I guess I got your point now. I'm also in favor of being "fail-fast" and the suggestion fits it. Thanks! Just updated.

@HeartSaVioR
Copy link
Contributor Author

Thanks for the feedback. Changed the logic to check whether the source is leveraging metadata or not. Please take a look again.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114262 has finished for PR 26590 at commit f9dc1a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114263 has finished for PR 26590 at commit 8d6d08b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

logDebug(s"completed file entries: ${validFileEntities.mkString(",")}")
validFileEntities.foreach(cleaner.clean)
sourceHasMetadata match {
case Some(true) if !warnedIgnoringCleanSourceOption =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that it's called more than once? Such case case _ => will win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes missed that. Nice finding.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114406 has finished for PR 26590 at commit d1ec200.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

@zsxwing @gaborgsomogyi I guess I addressed all review comments. Please take next round of reviews. Thanks in advance!

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114741 has finished for PR 26590 at commit d7ded93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114742 has finished for PR 26590 at commit fcdb9e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Bump.

@zsxwing
Copy link
Member

zsxwing commented Dec 5, 2019

LGTM.

retest this please. Triggering another test since the last run was 3 days ago.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Dec 6, 2019

Test build #114918 has finished for PR 26590 at commit fcdb9e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Dec 6, 2019

Thanks! Merging to master,

@asfgit asfgit closed this in 25431d7 Dec 6, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks all for reviewing and merging!

@HeartSaVioR HeartSaVioR deleted the SPARK-29953 branch December 6, 2019 05:50
@HeartSaVioR
Copy link
Contributor Author

@zsxwing
Btw, could you please revisit the comment in #22952 when you have time so that we could fix it in time? It could be missed so I feel it's good to address sooner than later, but at least before starting 3.0.0 RC 1 vote. Thanks in advance!
#22952 (comment)

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
… the files belong to the output of FileStreamSink

### What changes were proposed in this pull request?

This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup.

To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files.

### Why are the changes needed?

Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added UT.

Closes apache#26590 from HeartSaVioR/SPARK-29953.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Shixiong Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants