-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink #26590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
f9dc1a4
8d6d08b
d1ec200
d7ded93
fcdb9e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -259,6 +259,8 @@ class FileStreamSource( | |||
|
|
||||
| override def toString: String = s"FileStreamSource[$qualifiedBasePath]" | ||||
|
|
||||
| private var warnedIgnoringCleanSourceOption: Boolean = false | ||||
|
|
||||
| /** | ||||
| * Informs the source that Spark has completed processing all data for offsets less than or | ||||
| * equal to `end` and will only request offsets greater than `end` in the future. | ||||
|
|
@@ -267,10 +269,22 @@ class FileStreamSource( | |||
| val logOffset = FileStreamSourceOffset(end).logOffset | ||||
|
|
||||
| sourceCleaner.foreach { cleaner => | ||||
| val files = metadataLog.get(Some(logOffset), Some(logOffset)).flatMap(_._2) | ||||
| val validFileEntities = files.filter(_.batchId == logOffset) | ||||
| logDebug(s"completed file entries: ${validFileEntities.mkString(",")}") | ||||
| validFileEntities.foreach(cleaner.clean) | ||||
| sourceHasMetadata match { | ||||
| case Some(true) if !warnedIgnoringCleanSourceOption => | ||||
| logWarning("Ignoring 'cleanSource' option since source path refers to the output" + | ||||
| " directory of FileStreamSink.") | ||||
| warnedIgnoringCleanSourceOption = true | ||||
|
|
||||
| case Some(false) => | ||||
| val files = metadataLog.get(Some(logOffset), Some(logOffset)).flatMap(_._2) | ||||
| val validFileEntities = files.filter(_.batchId == logOffset) | ||||
| logDebug(s"completed file entries: ${validFileEntities.mkString(",")}") | ||||
| validFileEntities.foreach(cleaner.clean) | ||||
|
|
||||
| case _ => | ||||
| logWarning("Ignoring 'cleanSource' option since Spark hasn't figured out whether " + | ||||
|
||||
| new MetadataLogFileIndex(sparkSession, qualifiedBasePath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only "odd" case I can imagine to reach here is,
- the query executed which wrote the commit log of the last batch and stopped before writing the offset for next batch.
- the query is restarted, and
constructNextBatchis called. - somehow the source files are all deleted between 1) and 2), hence FileStreamSource doesn't see any file and cannot decide when
fetchAllFilesis called. constructNextBatchwill callcommitfor previous batch the query executed before.
It's obviously very odd case as the content of source directory are modified (maybe) manually which we don't support the case (so throwing exception would be OK), but I'm not fully sure there's no another edge-cases.
Btw, where do you recommend to add the exception? L287, or L205? If you're suggesting to add the exception in L205, I'm not sure I follow. If I'm understanding correctly, the case if the logic reaches case _ won't reach L205.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not yet see which place is the suggestion refers to.
L205, I'm not sure I follow
+1
L287: As I see this is more or less the should never happen case. The question is whether we can consider edge cases which may hit this. If we miss a valid case and we're throwing exception here we may block a query to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. somehow the source files are all deleted between 1) and 2)
This should be a user error.
My general point is we should make sure the data files and the metadata in _spark_metadata are consistent and we should prevent from cleaning up data files that are still tracked. Logging a warning without really deleting files is a solution, however, most of users won't be able to notice this warning from their logs. Hence we should detect this earlier. There is already a variable sourceHasMetadata tracking whether the source is reading from a file stream sink or not. We can check the options and throw an exception when flipping it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK I guess I got your point now. I'm also in favor of being "fail-fast" and the suggestion fits it. Thanks! Just updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that it's called more than once? Such case
case _ =>will win.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes missed that. Nice finding.