[SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource #26920

HeartSaVioR · 2019-12-17T06:42:08Z

What changes were proposed in this pull request?

This patch renews the verification logic of archive path for FileStreamSource, as we found the logic doesn't take partitioned/recursive options into account.

Before the patch, it only requires the archive path to have depth more than 2 (two subdirectories from root), leveraging the fact FileStreamSource normally reads the files where the parent directory matches the pattern or the file itself matches the pattern. Given 'archive' operation moves the files to the base archive path with retaining the full path, archive path is tend to be safe if the depth is more than 2, meaning FileStreamSource doesn't re-read archived files as new source files.

WIth partitioned/recursive options, the fact is invalid, as FileStreamSource can read any files in any depth of subdirectories for source pattern. To deal with this correctly, we have to renew the verification logic, which may not intuitive and simple but works for all cases.

The new verification logic prevents both cases:

archive path matches with source pattern as "prefix" (the depth of archive path > the depth of source pattern)

e.g.

source pattern: /hello*/spar?
archive path: /hello/spark/structured/streaming

Any files in archive path will match with source pattern when recursive option is enabled.

source pattern matches with archive path as "prefix" (the depth of source pattern > the depth of archive path)

e.g.

source pattern: /hello*/spar?/structured/hello2*
archive path: /hello/spark/structured

Some archive files will not match with source pattern, e.g. file path: /hello/spark/structured/hello2, then final archived path: /hello/spark/structured/hello/spark/structured/hello2.

But some other archive files will still match with source pattern, e.g. file path: /hello2/spark/structured/hello2, then final archived path: /hello/spark/structured/hello2/spark/structured/hello2 which matches with source pattern when recursive is enabled.

Implicitly it also prevents archive path matches with source pattern as full match (same depth).

We would want to prevent any source files to be archived and added to new source files again, so the patch takes most restrictive approach to prevent the possible cases.

Why are the changes needed?

Without this patch, there's a chance archived files are included as new source files when partitioned/recursive option is enabled, as current condition doesn't take these options into account.

Does this PR introduce any user-facing change?

Only for Spark 3.0.0-preview (only preview 1 for now, but possibly preview 2 as well) - end users are required to provide archive path with ensuring a bit complicated conditions, instead of simply higher than 2 depths.

How was this patch tested?

New UT.

…ng archive path on FileStreamSource

SparkQA · 2019-12-17T08:05:01Z

Test build #115432 has finished for PR 26920 at commit 4b99e61.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-17T08:05:02Z

Test build #115433 has finished for PR 26920 at commit e779edc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-17T08:13:38Z

retest this, please

SparkQA · 2019-12-17T12:08:29Z

Test build #115442 has finished for PR 26920 at commit e779edc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-17T13:18:13Z

cc. @zsxwing @vanzin @gaborgsomogyi

vanzin

Code looks ok. This is basically the check I had suggested in the original PR...

vanzin · 2019-12-17T20:00:45Z

docs/structured-streaming-programming-guide.md

        <code>cleanSource</code>: option to clean up completed files after processing.<br/>
        Available options are "archive", "delete", "off". If the option is not provided, the default value is "off".<br/>
-        When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater than 2). e.g. <code>/archived/here</code>. This will ensure archived files are never included as new source files.<br/>
+        When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" should ensure some condition to guarantee archived files are never included as new source files:


Code looks ok but the documentation is kinda hard to follow.

First, the whole "should ensure some condition" part is redundant since there is a single condition. Just replace it with the following sentence.

The following sentence can be reworded a bit to be clearer, too:

The value of <code>sourceArchiveDir</code> must not match the source pattern, when considering just the prefix of the paths that match in subdirectory depth. Otherwise archived files would be considered new source files.

(It's kinda hard to explain the depth thing with words in documentation. It always sounds a bit confusing. An example would be much clearer.)

Actually that made me want to stick with simple condition as current (as I also felt that end users may not be easy to follow the rule), though unfortunately we found the cases which we no longer be able to do that.

I tried to follow the reworded sentence, but it seems to lead confusion cause;

Otherwise archived files would be considered new source files. This sounds me as it's allowed to violate the rule and the result is this, but the goal is that we just don't allow to violate the rule.

The point of condition is that we are checking the match with same depth, taking minimum, due to the fact explained in PR description. While we would want to skip elaborating why, I think we still need to clarify it in doc. I'm not sure only mentioning prefix/subdirectory contains the point.

I'll try to add an example after origin sentence.

SparkQA · 2019-12-18T01:59:47Z

Test build #115472 has finished for PR 26920 at commit be988df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-01-02T23:36:30Z

Kindly reminder.

vanzin · 2020-01-07T23:31:59Z

retest this please

SparkQA · 2020-01-08T03:24:46Z

Test build #116265 has finished for PR 26920 at commit be988df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-01-08T17:13:54Z

Merging to master.

HeartSaVioR · 2020-01-08T23:23:50Z

Thanks for reviewing and merging!

HeartSaVioR added 3 commits December 17, 2019 14:51

[SPARK-30281][SS] Consider partitioned/recursive option while verifyi…

4b99e61

…ng archive path on FileStreamSource

Refine a bit

59862ba

Refine again

e779edc

vanzin reviewed Dec 17, 2019

View reviewed changes

Refine doc

be988df

HeartSaVioR mentioned this pull request Jan 2, 2020

[SPARK-20568][SS] Provide option to clean up completed files in streaming query #22952

Closed

vanzin closed this in bd7510b Jan 8, 2020

HeartSaVioR deleted the SPARK-30281 branch January 8, 2020 23:23

[SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource #26920

[SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource #26920

Uh oh!

Conversation

HeartSaVioR commented Dec 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 17, 2019

Uh oh!

SparkQA commented Dec 17, 2019

Uh oh!

HeartSaVioR commented Dec 17, 2019

Uh oh!

SparkQA commented Dec 17, 2019

Uh oh!

HeartSaVioR commented Dec 17, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

vanzin Dec 17, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Dec 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 18, 2019

Uh oh!

HeartSaVioR commented Jan 2, 2020

Uh oh!

vanzin commented Jan 7, 2020

Uh oh!

SparkQA commented Jan 8, 2020

Uh oh!

vanzin commented Jan 8, 2020

Uh oh!

HeartSaVioR commented Jan 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HeartSaVioR commented Dec 17, 2019 •

edited

Loading

HeartSaVioR Dec 17, 2019 •

edited

Loading