Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Nov 5, 2018

What changes were proposed in this pull request?

This patch adds the option to clean up files which are completed in previous batch.

cleanSource -> "archive" / "delete" / "off"

The default value is "off", which Spark will do nothing.

If "delete" is specified, Spark will simply delete input files. If "archive" is specified, Spark will require additional config sourceArchiveDir which will be used to move input files to there. When archiving (via move) the path of input files are retained to the archived paths as sub-path.

Note that it is only applied to "micro-batch", since for batch all input files must be kept to get same result across multiple query executions.

How was this patch tested?

Added UT. Manual test against local disk as well as HDFS.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Nov 5, 2018

I feel the patch is simple to skip verifying manually against HDFS, but I'll try to spin up HDFS cluster and test this manually.

EDIT: also verified with HDFS cluster.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98491 has finished for PR 22952 at commit 8a1d2e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98493 has finished for PR 22952 at commit fb01c60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

cc. @zsxwing

Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @HeartSaVioR .
Renaming is expensive in S3, isn't it? I don't worry about HDFS, but do you know if there exist potential side effects like performance degradation in the cloud environment, especially with continuous processing mode?

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dongjoon-hyun , thanks for pointing out good point! I was being concerned about only filesystem/HDFS case and not familiar with cloud environment.

I guess we have possible options here:

  1. Rename in background thread.

For option 1, we may want to restrict the max files to enqueue, and when it reaches the max we may handle some of them synchronously. And we also may need to postpone JVM shutdown until all enqueued files are renamed.

  1. Provide additional option: delete (two options - 'rename' / 'delete' - are mutually exclusive)

Actually the actions end users are expected to take are 1. moving to archive directory (with compression or not) 2. delete periodically. If moving/renaming require non-trivial cost, end users may want to just delete files directly without backing up.

  1. Document the overhead to description of option.

While we can not clearly say how much the cost is, we can explain the fact the cleanup operation may affect processing of batch.

Provided options are not mutually exclusive.

cc. to @steveloughran - I think you're expert on cloud storage: could you provide your thought on this?
also cc. to @zsxwing in case of missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR . Does Flink/Storm have this feature? Or are there JIRA issues? I'm wondering if this is popular in the streaming engines and how they are handling this in the cloud situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
For Storm, it renames input file twice, 1. in process 2. completed (actually it is not a rename, but move to archive directory). HDFS spout is created at 2015 which I don't expect there's deep consideration on cloud storage.
For Flink I have no idea, I'll explore how they handle it.

I think the feature is just an essential thing in ETL situation: a comment in JIRA clearly shows why the feature is needed.
https://issues.apache.org/jira/browse/SPARK-20568?focusedCommentId=16356929&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16356929

Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The essential thing should not be slow. If we don't have a clear and written warning, the users will complain again and again due to the performance regression. Frequently, the users don't say they changed this kind of setting. Instead, they say Spark suddenly shows regressions in their environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging

Since schema merging is a relatively expensive operation, 
and is not a necessity in most cases, we turned it off
by default starting from 1.5.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed, and that matches the option 3 I've proposed. And option 1 would not affect much on critical path in a batch since rename operations will be enqueued and background thread will take care.

For option 1, guaranteeing makes the thing being complicated. If we are OK to NOT guarantee that all processed files are renamed, we can take the renaming in background (like option 1) without handling backpressure, and simply drop the requests in queue with logging if the size is beyond the threshold or JVM is shutting down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 rename is O(data), whereas for real filesystems it is O(1). Azure is usually O(1) unless some cross-shard move takes place, then it drops to O(data)...much rarer though.

@zsxwing
Copy link
Member

zsxwing commented Nov 12, 2018

Provide additional option: delete (two options - 'rename' / 'delete' - are mutually exclusive)

Actually the actions end users are expected to take are 1. moving to archive directory (with compression or not) 2. delete periodically. If moving/renaming require non-trivial cost, end users may want to just delete files directly without backing up.

+1 for this approach. The file listing cost is huge when the directory has a lot of files. I think one of the goals of this feature is reducing the file listing cost, so it's better to not rename the files into the same directory. Either delete the files or move to a different directory should be fine. Also could you try to make one simple option for rename/delete, such as cleanSource -> (none, rename or delete)? When the user picks up rename, they should be able to set the archive directory using another option.

In addition, it would be great that we can document that whenever using this option, the same directory should not be used by multiple queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop s3n & s3 refs as they have gone from deprecated to deceased

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like beyond of this PR: we can address it in separate PR. Could you raise another one?

@HeartSaVioR
Copy link
Contributor Author

@zsxwing @dongjoon-hyun @steveloughran
Thanks all for the valuable feedback! I applied review comments.

While I covered the new feature with new UTs, I'm yet to test this manually with HDFS. I'll find the time to do manual test in next week. For cloud storages, TBH, it's not easy for me to do manual test against them, so I'd wish to lean on reviewers' eyes and experiences.

@HeartSaVioR HeartSaVioR changed the title [SPARK-20568][SS] Rename files which are completed in previous batch [SPARK-20568][SS] Provide option to clean up completed files in streaming query Nov 16, 2018
@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98917 has finished for PR 22952 at commit 3f6b5fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98919 has finished for PR 22952 at commit ca26b41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98918 has finished for PR 22952 at commit 33c5681.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feature in general, thanks for the efforts!

My general idea is that the whole logic can happen in a different thread which could bring the following advantages:

  • The delete/move time will not count in the processing time
  • If files for some reason was not able to deleted/moved temporarily an instant retry will take place
  • The load can be split into smaller chunks (like max 50 files per round or something) -> this one is more of a theoretical option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified something like:

    matchedModeOpt match {
      case None =>
        throw new IllegalArgumentException(s"Invalid mode for clean source option $modeStrOption." +
          s" Must be one of ${CleanSourceMode.values.mkString(",")}")
      case Some(matchedMode) =>
        if (matchedMode == CleanSourceMode.ARCHIVE && sourceArchiveDir.isEmpty) {
          throw new IllegalArgumentException("Archive mode must be used with 'sourceArchiveDir' " +
            "option.")
        }
        matchedMode
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fs operations can also throw exception. Why not covered these as well with try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding. Will address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be enforced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess you're right. I'll add a logic to check in initialization on FileStreamSource.

@HeartSaVioR
Copy link
Contributor Author

@gaborgsomogyi
Yeah I also thought about the idea (commented above) but I've lost focus on other task. Given that smaller patch is better to be reviewed easily and current patch works well (except overheads on cleaning in same thread), would we split this up and address it to another issue?

@HeartSaVioR
Copy link
Contributor Author

@gaborgsomogyi
Thanks for reviewing! I addressed your review comments except asynchronous cleanup, which might be able to break down to separated issue.

@SparkQA
Copy link

SparkQA commented Nov 22, 2018

Test build #99180 has finished for PR 22952 at commit 007f5d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

@HeartSaVioR I'm fine with this, on the other hand if you're focusing on different things I'm happy to create a jira + PR for the separate thread thing to speed up processing.

@HeartSaVioR
Copy link
Contributor Author

@gaborgsomogyi
Thanks for taking care, but I guess I can manage it. I'll ask for help when I can't go back to this one.
This patch (latest change) hasn't get any feedback on committers yet so let's not rush on this and wait for it.

@gaborgsomogyi
Copy link
Contributor

@HeartSaVioR ok, feel free to ping me if review needed.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. This is awesome. Did one round review. But I'm still thinking how to detect path overlap and how to get the new path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you create a method to CleanSourceMode to convert a string to CleanSourceMode.Value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE 2: The source path should not be used from multiple sources or queries when enabling this option, because source files will be moved or deleted which behavior may impact the other sources and queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding. I missed the case which multiple sources in same query refer same file directory. Will address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this check only when CleanSourceMode is ARCHIVE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes missed it. Will address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to also check the return value of rename. A user may reuse a source archive dir and cause path conflicts. We should also log this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the patch prevents the case if it works like my expectation, but I'm also in favor of defensive programming and logging would be better for end users. Will address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename()'s return of true/false is pretty meaningless, as in "says that it fails, but doesn't provide any explanation as to why". See HADOOP-11452 for discussion on making rename/3 public -this does through useful exceptions on failures. Happy for anyone to take up work on that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use val files = metadataLog.get(Some(logOffset), Some(logOffset)).flatMap(_._2) to use the underlying cache in FileStreamSourceLog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't indicate that. Thanks for letting me know! Will address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val curPath = new Path(new URI(entry.path)) to make it escape/unescape path properly. entry.path was created from Path.toUri.toString. Could you also add a unit test to test special paths such as /a/b/a b%.txt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... actually I was somewhat confused I have to escape/unescape for path. Thanks for suggestion. Will address and add a new unit test for testing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just modified existing UT to have space and % in directory name as well as file name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to use fs.makeQualified to turn all user provided paths to absolute paths as the user may just pass a relative path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding. Will address.

@HeartSaVioR
Copy link
Contributor Author

@zsxwing Thanks for the detailed review! Addressed review comments.

@HeartSaVioR
Copy link
Contributor Author

@zsxwing Btw, how do you think about addressing background move/deletion (I had thought and @gaborgsomogyi also suggested as well) into separate issue? I guess putting more feature would let you spend more time to review.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99450 has finished for PR 22952 at commit f59c35a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Nov 29, 2018

@zsxwing Btw, how do you think about addressing background move/deletion (I had thought and

Yeah, this can be done in a separate ticket.

I was playing with org.apache.hadoop.fs.GlobFilter to see how to detect the overlap. But one major issue is before getting the target path, we don't know whether a path will match the glob pattern or not.

The worst case, we can check the overlap when parsing the options for a normal path. For glob path, we can use GlobFilter/GlobPattern to check before doing the rename, in which case we can just use GlobPattern to check the target path.

@HeartSaVioR
Copy link
Contributor Author

@zsxwing
Yeah, it would be ideal we can enforce archivePath to which don't have any possibility to match against source path (glob), so my approach was to find directory which is the base directory without having glob in ancestor, and archive path + base directory of source path doesn't belong to sub-directory of found directory.

For example, suppose source path is /a/b/c/*/ef?/*/g/h/*/i, then base directory of source path would be /a/b/c, and archive path + base directory of source path should not belong to sub-directory of /a/b/c.
(My code has a bug for finding the directory so need to fix it.)

This is not an elegant approach and the approach has false-positive, ending up restricting the archive path which actually doesn't make overlap (too restrict), but it would guarantee two paths never overlap. (So no need to re-check when renaming file.)

I guess the approach might be reasonable because in practice end users would avoid themselves have to think about complicated case on overlaps, and just isolate two paths.

What do you think about this approach?

cc. @gaborgsomogyi Could you also help validating my approach?

@SparkQA
Copy link

SparkQA commented Dec 2, 2018

Test build #99567 has finished for PR 22952 at commit 79fa3e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

@HeartSaVioR
I've taken a deeper look at the overlap thing and found the following.

  • Added an additional test which produced odd result:
...
      val sourcePath = "/hello/worl"
      val archivePath = "/hello/world/spark"
...

This has thrown IllegalArgumentException but the sourcePath is different than archivePath.
This happens without any glob magic.

  • This approach may not work if there are symlinks involved (fs.makeQualified doesn't make any link resolve).

    All in all this part is fine now.

Checking the glob part...

@gaborgsomogyi
Copy link
Contributor

@HeartSaVioR
Related the glob part @zsxwing pointed out an important problem. Glob pattern is much more than checking * and ?, see the link up. For simplicity take this test:

...
      val sourcePath = "/hello/worl{d}"
      val archivePath = "/hello/world/spark"
...

This should throw IllegalArgumentException but proceeding without exception.
A glob parser would be good to be used.

@steveloughran
Copy link
Contributor

Hadoop FS glob filtering is pathologically bad on object stores.
I have tried in the past to do an ~O(1) impl for S3 HADOOP-13371. While I could produce one which was efficient for test cases, it would suffer in the use case "selective pattern match at the top of a very wide tree", where you really do want to filter down aggressively for the topmost directory/directories.

I think there you'd want to have a threshold as to how many path elements up you'd switch from ls dir + match into the full deep listfiles(recursive) scan

Not looked at it for ages. If someone does want to play there, welcome to take it up

@steveloughran
Copy link
Contributor

HDFS does not support it yet, though on the way, see https://issues.apache.org/jira/browse/HADOOP-10019

That's an old patch; I don't know of any active dev there.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 26, 2019

Test build #112711 has finished for PR 22952 at commit b67778a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, a few nits remaining.

"s3a://a/b/c/dataset.txt"<br/>
<code>cleanSource</code>: option to clean up completed files after processing.<br/>
Available options are "archive", "delete", "off". If the option is not provided, the default value is "off".<br/>
When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater than 2). e.g. /archived/here This will ensure archived files are never included as new source files.<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<code></code> around directory path, period before This.

<code>cleanSource</code>: option to clean up completed files after processing.<br/>
Available options are "archive", "delete", "off". If the option is not provided, the default value is "off".<br/>
When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater than 2). e.g. /archived/here This will ensure archived files are never included as new source files.<br/>
Spark will move source files respecting its own path. For example, if the path of source file is "/a/b/dataset.txt" and the path of archive directory is "/archived/here", file will be moved to "/archived/here/a/b/dataset.txt"<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/its/their

require(baseArchiveFileSystem.isDefined == baseArchivePath.isDefined)

baseArchiveFileSystem.foreach { fs =>
require(fileSystem.getUri == fs.getUri, "Base archive path is located to the different" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Base archive path is located on a different file system than the source files."

@SparkQA
Copy link

SparkQA commented Oct 28, 2019

Test build #112793 has finished for PR 22952 at commit dd9d4ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 28, 2019

Test build #112800 has finished for PR 22952 at commit dd9d4ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 29, 2019

Test build #112824 has finished for PR 22952 at commit dd9d4ad.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Oct 29, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Oct 29, 2019

Test build #112855 has finished for PR 22952 at commit dd9d4ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

require(baseArchivePath.isDefined)

val curPath = new Path(new URI(entry.path))
val newPath = new Path(baseArchivePath.get, curPath.toUri.getPath.stripPrefix("/"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm revisiting two issues and not sure there's a viable workaround. Looks like the issue pointed out was ":" isn't a valid char for HDFS but might be a valid char for other filesystems so Path API doesn't restrict it and leads problem. Even HDFS-14762 is closed as "Won't fix".

Would this only occur on Path(parent, child) and Path(pathstr) is safe? Would it work if we manually concat two paths as string and pass to Path's constructor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only occurs in Path(parent, child). I think we can manually concat two paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback! I'll reflect it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it makes people any happier know that "." isn't allowed as the last char in an ABFS filename. Every store has surprises

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112867 has finished for PR 22952 at commit 178d2f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Could we go through the next round of review? I guess it's close to be ready to be merged. Thanks in advance!

* than 2, as neither file nor parent directory of destination path can be matched with
* source path.
*/
require(path.depth() > 2, "Base archive path must have a depth of at least 2 " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check says "greater than 2" but the error says "at least 2". Which one is right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the explanation says about 2 "subdirectories", not 2 "depth". / denotes its own depth. I don't think depth is the term end users are familiar with - I'll remove the part "a depth of".

}

baseArchivePath.foreach { path =>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

fs.makeQualified(new Path(path)) // can contain glob patterns
}

private val sourceCleaner: FileStreamSourceCleaner = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't going to ask, but since I have more comments... I think it's better if this were an Option and set to None if the cleaner is off.

Similarly, below, you'll resolve the sourceArchiveDir even if the cleaner is not set to archive, which is not necessary.

(I'm almost suggesting that there should be a separate implementation for delete and for archive to make this, and the code calling it, a bit cleaner.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good suggestion. Will address.

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113055 has finished for PR 22952 at commit 21c71c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113069 has finished for PR 22952 at commit 01f5750.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Nov 4, 2019

Merging to master.

@vanzin vanzin closed this in ba2bc4b Nov 4, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks for reviewing and merging! I'll file the follow-up issue on cleaning from the background thread.
cc. @gaborgsomogyi If you're still interested on making cleanup in background you can file and take this up, as I'm currently focusing on event log stuff. Please let me know if you don't have a free slot to deal with this as of now. Thanks in advance!

@HeartSaVioR HeartSaVioR deleted the SPARK-20568 branch November 4, 2019 23:35
@mikedias
Copy link

mikedias commented Nov 5, 2019

Great news, awesome job!

Now that we can delete files after process them, will be good to have an option to re-process the file if it was re-uploaded to the same folder with the same name. It was proposed in #23782 and I think it's a good complimentary feature to this one.

Any chance it get a round of review @vanzin @HeartSaVioR @gaborgsomogyi?

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR Thanks for adding this great feature. Sorry that I was busy and didn't have time to finish my review. I just did a post-hoc review and found one issue: https://issues.apache.org/jira/browse/SPARK-29953. Could you help fix it? Thanks! I also left one question regarding the depth check.

/**
* FileStreamSource reads the files which one of below conditions is met:
* 1) file itself is matched with source path
* 2) parent directory is matched with source path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR Could you clarify this? I think there are some cases we still read files but they don't met these conditions:

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing
Thanks for spending your time to revisit this! The condition is based on the test suite in FileStreamSource, but for partitioned paths, yes that's missed. Nice finding. I need to update the condition, or just remove the condition documented there at all.

For recursiveFileLookup, it came later than the patch and I missed it. The condition was formed in early this year, and recursiveFileLookup seemed to come in mid this year.

Adding two cases, FileStreamSource can read any files under the source path, which invalidates the depth check. There're three options to deal with this:

  1. No pattern check and just try to rename. Log it if it fails to rename. (Caution! It doesn't prevent archived file to be added to source file again in different directory.)
  2. Disallow any path to be used as base archive path if the path matches the source path (glob) - here "disallow" means fail the query. After then we don't need to check the pattern. If end users provide complicated glob path as source path, they also may be puzzled how to not match, but not sure they would really want to set the path be complicated in production.
  3. Do pattern check before renaming, though it needs checking pattern per file. We may optimize this a bit via grouping files per directory and check the pattern with directory instead of individual files. It doesn't fail the query so end users need to check whether the files are not cleaned up due to the pattern check.

Which one (or couple of) would be the preferred approach?

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, just filed https://issues.apache.org/jira/browse/SPARK-30281 and raised a patch with picking the option 2. #26920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants