[SPARK-20568][SS] Provide option to clean up completed files in streaming query #22952

HeartSaVioR · 2018-11-05T22:18:29Z

What changes were proposed in this pull request?

This patch adds the option to clean up files which are completed in previous batch.

cleanSource -> "archive" / "delete" / "off"

The default value is "off", which Spark will do nothing.

If "delete" is specified, Spark will simply delete input files. If "archive" is specified, Spark will require additional config sourceArchiveDir which will be used to move input files to there. When archiving (via move) the path of input files are retained to the archived paths as sub-path.

Note that it is only applied to "micro-batch", since for batch all input files must be kept to get same result across multiple query executions.

How was this patch tested?

Added UT. Manual test against local disk as well as HDFS.

HeartSaVioR · 2018-11-05T22:26:41Z

I feel the patch is simple to skip verifying manually against HDFS, but I'll try to spin up HDFS cluster and test this manually.

EDIT: also verified with HDFS cluster.

SparkQA · 2018-11-06T01:42:39Z

Test build #98491 has finished for PR 22952 at commit 8a1d2e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-06T01:48:46Z

Test build #98493 has finished for PR 22952 at commit fb01c60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-11-06T07:01:12Z

cc. @zsxwing

dongjoon-hyun · 2018-11-07T04:38:13Z

docs/structured-streaming-programming-guide.md

Hi, @HeartSaVioR .
Renaming is expensive in S3, isn't it? I don't worry about HDFS, but do you know if there exist potential side effects like performance degradation in the cloud environment, especially with continuous processing mode?

Hi @dongjoon-hyun , thanks for pointing out good point! I was being concerned about only filesystem/HDFS case and not familiar with cloud environment.

I guess we have possible options here:

Rename in background thread.

For option 1, we may want to restrict the max files to enqueue, and when it reaches the max we may handle some of them synchronously. And we also may need to postpone JVM shutdown until all enqueued files are renamed.

Provide additional option: delete (two options - 'rename' / 'delete' - are mutually exclusive)

Actually the actions end users are expected to take are 1. moving to archive directory (with compression or not) 2. delete periodically. If moving/renaming require non-trivial cost, end users may want to just delete files directly without backing up.

Document the overhead to description of option.

While we can not clearly say how much the cost is, we can explain the fact the cleanup operation may affect processing of batch.

Provided options are not mutually exclusive.

cc. to @steveloughran - I think you're expert on cloud storage: could you provide your thought on this?
also cc. to @zsxwing in case of missing.

@HeartSaVioR . Does Flink/Storm have this feature? Or are there JIRA issues? I'm wondering if this is popular in the streaming engines and how they are handling this in the cloud situation.

@dongjoon-hyun
For Storm, it renames input file twice, 1. in process 2. completed (actually it is not a rename, but move to archive directory). HDFS spout is created at 2015 which I don't expect there's deep consideration on cloud storage.
For Flink I have no idea, I'll explore how they handle it.

I think the feature is just an essential thing in ETL situation: a comment in JIRA clearly shows why the feature is needed.
https://issues.apache.org/jira/browse/SPARK-20568?focusedCommentId=16356929&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16356929

The essential thing should not be slow. If we don't have a clear and written warning, the users will complain again and again due to the performance regression. Frequently, the users don't say they changed this kind of setting. Instead, they say Spark suddenly shows regressions in their environment.

For example, http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0.

Totally agreed, and that matches the option 3 I've proposed. And option 1 would not affect much on critical path in a batch since rename operations will be enqueued and background thread will take care.

For option 1, guaranteeing makes the thing being complicated. If we are OK to NOT guarantee that all processed files are renamed, we can take the renaming in background (like option 1) without handling backpressure, and simply drop the requests in queue with logging if the size is beyond the threshold or JVM is shutting down.

S3 rename is O(data), whereas for real filesystems it is O(1). Azure is usually O(1) unless some cross-shard move takes place, then it drops to O(data)...much rarer though.

zsxwing · 2018-11-12T13:23:19Z

Provide additional option: delete (two options - 'rename' / 'delete' - are mutually exclusive)

Actually the actions end users are expected to take are 1. moving to archive directory (with compression or not) 2. delete periodically. If moving/renaming require non-trivial cost, end users may want to just delete files directly without backing up.

+1 for this approach. The file listing cost is huge when the directory has a lot of files. I think one of the goals of this feature is reducing the file listing cost, so it's better to not rename the files into the same directory. Either delete the files or move to a different directory should be fine. Also could you try to make one simple option for rename/delete, such as cleanSource -> (none, rename or delete)? When the user picks up rename, they should be able to set the archive directory using another option.

In addition, it would be great that we can document that whenever using this option, the same directory should not be used by multiple queries.

steveloughran · 2018-11-12T21:50:21Z

docs/structured-streaming-programming-guide.md

I'd drop s3n & s3 refs as they have gone from deprecated to deceased

This looks like beyond of this PR: we can address it in separate PR. Could you raise another one?

HeartSaVioR · 2018-11-16T15:37:50Z

@zsxwing @dongjoon-hyun @steveloughran
Thanks all for the valuable feedback! I applied review comments.

While I covered the new feature with new UTs, I'm yet to test this manually with HDFS. I'll find the time to do manual test in next week. For cloud storages, TBH, it's not easy for me to do manual test against them, so I'd wish to lean on reviewers' eyes and experiences.

SparkQA · 2018-11-16T18:54:02Z

Test build #98917 has finished for PR 22952 at commit 3f6b5fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-16T18:54:24Z

Test build #98919 has finished for PR 22952 at commit ca26b41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-16T19:04:22Z

Test build #98918 has finished for PR 22952 at commit 33c5681.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi

Good feature in general, thanks for the efforts!

My general idea is that the whole logic can happen in a different thread which could bring the following advantages:

The delete/move time will not count in the processing time
If files for some reason was not able to deleted/moved temporarily an instant retry will take place
The load can be split into smaller chunks (like max 50 files per round or something) -> this one is more of a theoretical option

gaborgsomogyi · 2018-11-21T09:36:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala

This can be simplified something like:

matchedModeOpt match { case None => throw new IllegalArgumentException(s"Invalid mode for clean source option $modeStrOption." + s" Must be one of ${CleanSourceMode.values.mkString(",")}") case Some(matchedMode) => if (matchedMode == CleanSourceMode.ARCHIVE && sourceArchiveDir.isEmpty) { throw new IllegalArgumentException("Archive mode must be used with 'sourceArchiveDir' " + "option.") } matchedMode }

Will address.

gaborgsomogyi · 2018-11-21T09:43:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

These fs operations can also throw exception. Why not covered these as well with try?

Nice finding. Will address.

gaborgsomogyi · 2018-11-21T09:45:56Z

docs/structured-streaming-programming-guide.md

This can be enforced.

Yeah I guess you're right. I'll add a logic to check in initialization on FileStreamSource.

HeartSaVioR · 2018-11-22T08:30:26Z

@gaborgsomogyi
Yeah I also thought about the idea (commented above) but I've lost focus on other task. Given that smaller patch is better to be reviewed easily and current patch works well (except overheads on cleaning in same thread), would we split this up and address it to another issue?

HeartSaVioR · 2018-11-22T10:32:40Z

@gaborgsomogyi
Thanks for reviewing! I addressed your review comments except asynchronous cleanup, which might be able to break down to separated issue.

SparkQA · 2018-11-22T14:02:06Z

Test build #99180 has finished for PR 22952 at commit 007f5d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2018-11-22T17:30:48Z

@HeartSaVioR I'm fine with this, on the other hand if you're focusing on different things I'm happy to create a jira + PR for the separate thread thing to speed up processing.

HeartSaVioR · 2018-11-22T22:26:41Z

@gaborgsomogyi
Thanks for taking care, but I guess I can manage it. I'll ask for help when I can't go back to this one.
This patch (latest change) hasn't get any feedback on committers yet so let's not rush on this and wait for it.

gaborgsomogyi · 2018-11-22T22:40:25Z

@HeartSaVioR ok, feel free to ping me if review needed.

zsxwing

Thanks for your PR. This is awesome. Did one round review. But I'm still thinking how to detect path overlap and how to get the new path.

zsxwing · 2018-11-28T18:11:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala

nit: could you create a method to CleanSourceMode to convert a string to CleanSourceMode.Value?

OK will address.

zsxwing · 2018-11-29T00:29:01Z

docs/structured-streaming-programming-guide.md

NOTE 2: The source path should not be used from multiple sources or queries when enabling this option, because source files will be moved or deleted which behavior may impact the other sources and queries.

NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query.

Nice finding. I missed the case which multiple sources in same query refer same file directory. Will address.

zsxwing · 2018-11-29T00:32:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

Could you do this check only when CleanSourceMode is ARCHIVE?

Ah yes missed it. Will address.

zsxwing · 2018-11-29T00:35:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

It's better to also check the return value of rename. A user may reuse a source archive dir and cause path conflicts. We should also log this.

Yeah, I guess the patch prevents the case if it works like my expectation, but I'm also in favor of defensive programming and logging would be better for end users. Will address.

rename()'s return of true/false is pretty meaningless, as in "says that it fails, but doesn't provide any explanation as to why". See HADOOP-11452 for discussion on making rename/3 public -this does through useful exceptions on failures. Happy for anyone to take up work on that...

zsxwing · 2018-11-29T00:54:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

you can use val files = metadataLog.get(Some(logOffset), Some(logOffset)).flatMap(_._2) to use the underlying cache in FileStreamSourceLog.

Ah I didn't indicate that. Thanks for letting me know! Will address.

zsxwing · 2018-11-29T00:58:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

val curPath = new Path(new URI(entry.path)) to make it escape/unescape path properly. entry.path was created from Path.toUri.toString. Could you also add a unit test to test special paths such as /a/b/a b%.txt?

Yeah... actually I was somewhat confused I have to escape/unescape for path. Thanks for suggestion. Will address and add a new unit test for testing it.

I just modified existing UT to have space and % in directory name as well as file name.

zsxwing · 2018-11-29T00:58:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

Will address.

zsxwing · 2018-11-29T01:01:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

we need to use fs.makeQualified to turn all user provided paths to absolute paths as the user may just pass a relative path.

Nice finding. Will address.

HeartSaVioR · 2018-11-29T13:16:35Z

@zsxwing Thanks for the detailed review! Addressed review comments.

HeartSaVioR · 2018-11-29T13:23:47Z

@zsxwing Btw, how do you think about addressing background move/deletion (I had thought and @gaborgsomogyi also suggested as well) into separate issue? I guess putting more feature would let you spend more time to review.

SparkQA · 2018-11-29T16:42:52Z

Test build #99450 has finished for PR 22952 at commit f59c35a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-11-29T17:59:41Z

@zsxwing Btw, how do you think about addressing background move/deletion (I had thought and

Yeah, this can be done in a separate ticket.

I was playing with org.apache.hadoop.fs.GlobFilter to see how to detect the overlap. But one major issue is before getting the target path, we don't know whether a path will match the glob pattern or not.

The worst case, we can check the overlap when parsing the options for a normal path. For glob path, we can use GlobFilter/GlobPattern to check before doing the rename, in which case we can just use GlobPattern to check the target path.

HeartSaVioR · 2018-12-02T03:58:54Z

@zsxwing
Yeah, it would be ideal we can enforce archivePath to which don't have any possibility to match against source path (glob), so my approach was to find directory which is the base directory without having glob in ancestor, and archive path + base directory of source path doesn't belong to sub-directory of found directory.

For example, suppose source path is /a/b/c/*/ef?/*/g/h/*/i, then base directory of source path would be /a/b/c, and archive path + base directory of source path should not belong to sub-directory of /a/b/c.
(My code has a bug for finding the directory so need to fix it.)

This is not an elegant approach and the approach has false-positive, ending up restricting the archive path which actually doesn't make overlap (too restrict), but it would guarantee two paths never overlap. (So no need to re-check when renaming file.)

I guess the approach might be reasonable because in practice end users would avoid themselves have to think about complicated case on overlaps, and just isolate two paths.

What do you think about this approach?

cc. @gaborgsomogyi Could you also help validating my approach?

SparkQA · 2018-12-02T11:50:39Z

Test build #99567 has finished for PR 22952 at commit 79fa3e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2018-12-03T13:23:18Z

@HeartSaVioR
I've taken a deeper look at the overlap thing and found the following.

Added an additional test which produced odd result:

...
      val sourcePath = "/hello/worl"
      val archivePath = "/hello/world/spark"
...

This has thrown IllegalArgumentException but the sourcePath is different than archivePath.
This happens without any glob magic.

This approach may not work if there are symlinks involved (fs.makeQualified doesn't make any link resolve).
- HDFS does not support it yet, though on the way, see https://issues.apache.org/jira/browse/HADOOP-10019
- S3 and ADLS does not support it
All in all this part is fine now.

Checking the glob part...

gaborgsomogyi · 2018-12-03T13:58:55Z

@HeartSaVioR
Related the glob part @zsxwing pointed out an important problem. Glob pattern is much more than checking * and ?, see the link up. For simplicity take this test:

...
      val sourcePath = "/hello/worl{d}"
      val archivePath = "/hello/world/spark"
...

This should throw IllegalArgumentException but proceeding without exception.
A glob parser would be good to be used.

steveloughran · 2018-12-03T14:25:24Z

Hadoop FS glob filtering is pathologically bad on object stores.
I have tried in the past to do an ~O(1) impl for S3 HADOOP-13371. While I could produce one which was efficient for test cases, it would suffer in the use case "selective pattern match at the top of a very wide tree", where you really do want to filter down aggressively for the topmost directory/directories.

I think there you'd want to have a threshold as to how many path elements up you'd switch from ls dir + match into the full deep listfiles(recursive) scan

Not looked at it for ages. If someone does want to play there, welcome to take it up

steveloughran · 2018-12-03T14:27:14Z

HDFS does not support it yet, though on the way, see https://issues.apache.org/jira/browse/HADOOP-10019

That's an old patch; I don't know of any active dev there.

HeartSaVioR · 2019-10-26T07:23:45Z

retest this, please

SparkQA · 2019-10-26T11:04:17Z

Test build #112711 has finished for PR 22952 at commit b67778a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Looks ok, a few nits remaining.

vanzin · 2019-10-28T17:55:40Z

docs/structured-streaming-programming-guide.md

        "s3a://a/b/c/dataset.txt"<br/>
+        <code>cleanSource</code>: option to clean up completed files after processing.<br/>
+        Available options are "archive", "delete", "off". If the option is not provided, the default value is "off".<br/>
+        When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater than 2). e.g. /archived/here This will ensure archived files are never included as new source files.<br/>


<code></code> around directory path, period before This.

vanzin · 2019-10-28T17:55:53Z

docs/structured-streaming-programming-guide.md

+        <code>cleanSource</code>: option to clean up completed files after processing.<br/>
+        Available options are "archive", "delete", "off". If the option is not provided, the default value is "off".<br/>
+        When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater than 2). e.g. /archived/here This will ensure archived files are never included as new source files.<br/>
+        Spark will move source files respecting its own path. For example, if the path of source file is "/a/b/dataset.txt" and the path of archive directory is "/archived/here", file will be moved to "/archived/here/a/b/dataset.txt"<br/>


s/its/their

vanzin · 2019-10-28T17:58:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+      require(baseArchiveFileSystem.isDefined == baseArchivePath.isDefined)
+
+      baseArchiveFileSystem.foreach { fs =>
+        require(fileSystem.getUri == fs.getUri, "Base archive path is located to the different" +


"Base archive path is located on a different file system than the source files."

SparkQA · 2019-10-28T21:04:00Z

Test build #112793 has finished for PR 22952 at commit dd9d4ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-28T21:09:00Z

retest this, please

SparkQA · 2019-10-28T23:00:50Z

Test build #112800 has finished for PR 22952 at commit dd9d4ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-29T05:34:28Z

retest this, please

SparkQA · 2019-10-29T07:05:02Z

Test build #112824 has finished for PR 22952 at commit dd9d4ad.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-10-29T16:09:04Z

retest this please

SparkQA · 2019-10-29T19:46:57Z

Test build #112855 has finished for PR 22952 at commit dd9d4ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2019-10-29T19:55:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+      require(baseArchivePath.isDefined)
+
+      val curPath = new Path(new URI(entry.path))
+      val newPath = new Path(baseArchivePath.get, curPath.toUri.getPath.stripPrefix("/"))


Can we try to not this API? This has a known issue. See https://issues.apache.org/jira/browse/SPARK-28841 and https://issues.apache.org/jira/browse/HDFS-14762

I'm revisiting two issues and not sure there's a viable workaround. Looks like the issue pointed out was ":" isn't a valid char for HDFS but might be a valid char for other filesystems so Path API doesn't restrict it and leads problem. Even HDFS-14762 is closed as "Won't fix".

Would this only occur on Path(parent, child) and Path(pathstr) is safe? Would it work if we manually concat two paths as string and pass to Path's constructor?

It only occurs in Path(parent, child). I think we can manually concat two paths.

Thanks for the quick feedback! I'll reflect it.

if it makes people any happier know that "." isn't allowed as the last char in an ABFS filename. Every store has surprises

SparkQA · 2019-10-30T00:44:29Z

Test build #112867 has finished for PR 22952 at commit 178d2f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-30T22:23:16Z

Could we go through the next round of review? I guess it's close to be ready to be merged. Thanks in advance!

vanzin · 2019-10-31T20:45:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+         * than 2, as neither file nor parent directory of destination path can be matched with
+         * source path.
+         */
+        require(path.depth() > 2, "Base archive path must have a depth of at least 2 " +


The check says "greater than 2" but the error says "at least 2". Which one is right?

So the explanation says about 2 "subdirectories", not 2 "depth". / denotes its own depth. I don't think depth is the term end users are familiar with - I'll remove the part "a depth of".

vanzin · 2019-10-31T20:46:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+      }
+
+      baseArchivePath.foreach { path =>
+


vanzin · 2019-10-31T20:46:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

    fs.makeQualified(new Path(path))  // can contain glob patterns
  }

+  private val sourceCleaner: FileStreamSourceCleaner = {


I wasn't going to ask, but since I have more comments... I think it's better if this were an Option and set to None if the cleaner is off.

Similarly, below, you'll resolve the sourceArchiveDir even if the cleaner is not set to archive, which is not necessary.

(I'm almost suggesting that there should be a separate implementation for delete and for archive to make this, and the code calling it, a bit cleaner.)

Yeah, that's a good suggestion. Will address.

SparkQA · 2019-11-01T04:11:49Z

Test build #113055 has finished for PR 22952 at commit 21c71c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-01T08:10:36Z

Test build #113069 has finished for PR 22952 at commit 01f5750.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-11-04T23:15:44Z

Merging to master.

HeartSaVioR · 2019-11-04T23:35:40Z

Thanks for reviewing and merging! I'll file the follow-up issue on cleaning from the background thread.
cc. @gaborgsomogyi If you're still interested on making cleanup in background you can file and take this up, as I'm currently focusing on event log stuff. Please let me know if you don't have a free slot to deal with this as of now. Thanks in advance!

mikedias · 2019-11-05T22:48:02Z

Great news, awesome job!

Now that we can delete files after process them, will be good to have an option to re-process the file if it was re-uploaded to the same folder with the same name. It was proposed in #23782 and I think it's a good complimentary feature to this one.

Any chance it get a round of review @vanzin @HeartSaVioR @gaborgsomogyi?

zsxwing

@HeartSaVioR Thanks for adding this great feature. Sorry that I was busy and didn't have time to finish my review. I just did a post-hoc review and found one issue: https://issues.apache.org/jira/browse/SPARK-29953. Could you help fix it? Thanks! I also left one question regarding the depth check.

zsxwing · 2019-11-18T23:23:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+      /**
+       * FileStreamSource reads the files which one of below conditions is met:
+       * 1) file itself is matched with source path
+       * 2) parent directory is matched with source path


@HeartSaVioR Could you clarify this? I think there are some cases we still read files but they don't met these conditions:

Partitioned paths, such as "/foo/bar/p1=a/p2=b" and the source path is "/foo/bar".

recursiveFileLookup is set to true (See https://issues.apache.org/jira/browse/SPARK-27990).

@zsxwing
Thanks for spending your time to revisit this! The condition is based on the test suite in FileStreamSource, but for partitioned paths, yes that's missed. Nice finding. I need to update the condition, or just remove the condition documented there at all.

For recursiveFileLookup, it came later than the patch and I missed it. The condition was formed in early this year, and recursiveFileLookup seemed to come in mid this year.

Adding two cases, FileStreamSource can read any files under the source path, which invalidates the depth check. There're three options to deal with this:

No pattern check and just try to rename. Log it if it fails to rename. (Caution! It doesn't prevent archived file to be added to source file again in different directory.)

Disallow any path to be used as base archive path if the path matches the source path (glob) - here "disallow" means fail the query. After then we don't need to check the pattern. If end users provide complicated glob path as source path, they also may be puzzled how to not match, but not sure they would really want to set the path be complicated in production.

Do pattern check before renaming, though it needs checking pattern per file. We may optimize this a bit via grouping files per directory and check the pattern with directory instead of individual files. It doesn't fail the query so end users need to check whether the files are not cleaned up due to the pattern check.

Which one (or couple of) would be the preferred approach?

FYI, just filed https://issues.apache.org/jira/browse/SPARK-30281 and raised a patch with picking the option 2. #26920

dongjoon-hyun reviewed Nov 7, 2018

View reviewed changes

steveloughran reviewed Nov 12, 2018

View reviewed changes

HeartSaVioR changed the title ~~[SPARK-20568][SS] Rename files which are completed in previous batch~~ [SPARK-20568][SS] Provide option to clean up completed files in streaming query Nov 16, 2018

gaborgsomogyi reviewed Nov 21, 2018

View reviewed changes

zsxwing requested changes Nov 29, 2018

View reviewed changes

vanzin reviewed Oct 28, 2019

View reviewed changes

Reflect review comments

dd9d4ad

zsxwing reviewed Oct 29, 2019

View reviewed changes

Reflect review comment: avoid Path constructor which has known issue

178d2f4

vanzin reviewed Oct 31, 2019

View reviewed changes

HeartSaVioR added 2 commits November 1, 2019 09:21

Reflect review comments

21c71c4

Remove unintentional newline

01f5750

vanzin closed this in ba2bc4b Nov 4, 2019

HeartSaVioR deleted the SPARK-20568 branch November 4, 2019 23:35

zsxwing reviewed Nov 18, 2019

View reviewed changes

HeartSaVioR mentioned this pull request Nov 19, 2019

[SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource #24830

Closed

HeartSaVioR mentioned this pull request Dec 6, 2019

[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink #26590

Closed

HeartSaVioR mentioned this pull request Nov 29, 2020

[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files #28363

Closed

[SPARK-20568][SS] Provide option to clean up completed files in streaming query #22952

[SPARK-20568][SS] Provide option to clean up completed files in streaming query #22952

Uh oh!

Conversation

HeartSaVioR commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HeartSaVioR commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 6, 2018

Uh oh!

SparkQA commented Nov 6, 2018

Uh oh!

HeartSaVioR commented Nov 6, 2018

Uh oh!

dongjoon-hyun Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 22, 2018

Uh oh!

HeartSaVioR commented Nov 22, 2018

Uh oh!

SparkQA commented Nov 22, 2018

Uh oh!

gaborgsomogyi commented Nov 22, 2018

Uh oh!

HeartSaVioR commented Nov 5, 2018 •

edited

Loading

HeartSaVioR commented Nov 5, 2018 •

edited

Loading

dongjoon-hyun Nov 7, 2018 •

edited

Loading

HeartSaVioR Nov 7, 2018 •

edited

Loading

dongjoon-hyun Nov 7, 2018 •

edited

Loading

zsxwing commented Nov 12, 2018 •

edited

Loading

gaborgsomogyi Nov 21, 2018 •

edited

Loading