-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26875][SS] Add an option on FileStreamSource to include modified files #23782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
First of all, I agree this would be one of valid use cases. I'm just thinking out loud about edge-case (maybe that's why Spark restricts): when timestamp of file is modified in any chance (contents being added, some unintended modification, etc.), all of contents in file are reprocessed (as UT in this patch leverages it) which is not only breaking |
|
Ah the title of PR is saying about processing "modified file" instead of new file being overwritten to same file name. Then what I stated is not for an edge-case but just for the case this PR addresses. |
|
Yes, that why is important to keep this option as default I like the "modified file" because it's just what it is. If I change to "overwritten files" it might be misunderstood as a content comparison, which is harder to address. |
|
The patch looks simple and clear: it seems to be just a matter of policy - allow or disallow possibly non-safe behavior via option. Personally I'd a bit worried there might be some cases which last modified timestamp on file is modified unintentionally, then things got messed up. Even end users intend to enable this option, end users might complain when end users encounter reprocessing file as well as breaking semantic due to unintended reason. Thought about way to mitigate shortly - mostly regarding file offset - but new overwritten file could have same file length, as well as we also need to store file offset so doesn't seem to be good option as of now. Maybe I'd worried too much, so need to hear other voices as well. (I'm just a one of contributor anyway and decision will be taken from committers so it's just my 2 cents.) |
|
The users always have to understand their use-cases before activating non-default options. Take Same here, we just have to be clear about what will happen when activating the option in the docs. |
|
The question is why some producer generates the same file again? From data source perspective I see mainly 2 actually implemented ways:
+1 @HeartSaVioR and I'm worried with this patch as well.
This change may increase this behaviour. All in all with my actual understanding I would change the producer. |
|
Some producers does not care much about the uniqueness of filenames, leading into possible/often file overriding. The motivation of this patch is exactly when you can't change the producer's behavior 😄 In my view, this option is a good complimentary of #22952 where we would be able to archive/delete processed files. Without this option, if we upload a file with same name as the previous processed and deleted one, it wouldn't get processed leading into a non-intuitive behavior. Addressing your concerns:
|
and
is contradictory to me. |
|
Maybe is not clear that the patch does not change any processing behavior, it only adds an option to consider "is this file new and should be included on the micro-batch?" This is the current
Answering your question:
Regarding eventually race conditions, nothing is changed. Spark will deal with it using the |
|
Maybe @gaborgsomogyi is considering more than what it is as of now. Currently #22952 is implemented as synchronous approach, but @gaborgsomogyi had his voice to make archiving/deletion being async in #22952 (we talked about dealing with it in next TODO), and it's based on the assumption that we never process the file again even same file is added as new. When the option is turned on, it breaks assumption and things will get changed very differently - including race condition what @gaborgsomogyi is stated. When we are hit by race condition It would produce unintended result (maybe worse scenario would be archive/delete a new file before processing which end users expect to get them processed). Race condition still occurs between Spark and FileSystem even without this feature (just in #22952), but that's not Spark can control and Spark just should make its best to archive/delete faster before others overwrite/delete them. It doesn't still break query semantic or output even we are hit by race condition. |
|
I think that the archive/delete race condition can be addressed by checking the file timestamp before archive/delete. If it is the same as the processed, proceed. If not, skip. This extra step can be enabled only if Talking about end users expectations, if they upload a file and it gets deleted/archived, they probably expect a new file with the same name to be processed as well when uploaded again. Do not process the file is not intuitive and is also hard to debug which files names were processed in past. Why my file is not getting processed can be a frequently asked question. I totally understand the implications of files been unintentionally modified as well pointed by @HeartSaVioR and that's why the option is |
|
Can one of the admins verify this patch? |
|
Thanks a lot for your contribution. However, I think overwriting an existing file is an anti-pattern. Most of storage systems cannot handle this properly:
Generally, file stream source requires the files appear in the directory atomically so that we don't need to handle the case that Spark reads a file that is still being written. Overwriting an existing file breaks this assumption. |
|
Hi @zsxwing, thanks for taking time and share your thoughts. The idea of this configuration to add another condition to consider if a file should be processed or not. It does not make any assumption about concurrently modified files or anything else. Everything remains the same. The scenario that I'm trying to solve here is:
What this PR simply proposes is: If enabled, instead of only check the |
|
@mikedias Thanks for giving the scenario. Yep, I understand this could be helpful. However, I would like to understand more about the use case. How does a user upload
As I mentioned previously, overwriting a file makes everything complicated. The user has to think about when is the safe time to overwrite a file. Adding this option may make the user think Spark can handle file overwriting correctly, however we don't and cannot handle race conditions internally in Spark. |
I don't think it is safe to delete the file manually outside of Spark, as it cannot be known whether the file is processed and added to the commit log. Checking whether the file is processed is not sufficient - it should ensure the file will be never accessed. Assuming the file is deleted/archived from Spark, I'm seeing the possible confusion from user side as well since it's no longer an "overwrite" if the file is processed and cleaned up. So that seems to be a valid use case, though we may need to think how we let Spark differentiate "overwrite" and "add new file which was deleted or moved". Overwriting a file shouldn't be still allowed, as there're pretty much comments here to explain why it's harmful. |
|
Thank you @zsxwing @HeartSaVioR for taking time and reviewing this PR. Glad to see activity here :) @zsxwing The files are uploaded directly to the source folder. I don't see a need for an intermediate step that moves files around. @HeartSaVioR A common use case for deleting files outside Spark is to remove the old files sitting in the source folder impacting the performance of the ListObjects operation. We use S3 lifecycle policies to delete the files after 15 days (giving plenty of time to Spark process them). I agree that having an option with |
I agree that some manual operations have to be taken for such case, but as Spark has no idea about the current status of files if they can be modified, unfortunately that has to be with your own risk. There's an option SPARK-20568 provides the way to remove/archive processed files in safe manner officially, so I would agree there's a valid case if end users see the folder and confirm the file doesn't exist (so they are NOT overwriting existing file) and put the file there while the other file with same path was actually processed and removed/archived. I guess that might be considered if we all agree about this as valid use case. |
In fact, there's the option |
Ignoring the input file is not what we want even if we support overwriting file, right? And there's another possible unintended behavior - if the partial file is readable by luck, then the query will not read the further content. That's not an option really reserved for overwriting file. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@mikedias Any reason this PR wasn't merged and closed with inactivity? were there any issues discovered in testing or was it just lack of time to follow through? can we have this re-opened and reviewed again and merged since it solves an interesting problem for us too. Thanks again for the solution here. |
|
See above comments - I think we already explained why it is not a good idea. |
|
@HeartSaVioR If you don't mind, can you summarize the main concern here. I read all the comments here but it was hard to follow them and understand the reason why this is not approved. |
|
This comment explains everything. Also I do not agree that spark.sql.files.ignoreCorruptFiles is a rescue, likewise I commented above. If you ever require Spark to provide at-least-once fault tolerance, there should be never a change to the source on replay. If the input file is somehow overwritten between the batch failure and the reprocessing of the same batch, fault tolerance is going to be broken. It's a hard problem, not a trivial one. |
What changes were proposed in this pull request?
The current behavior only the check the filename to determine if a file should be processed or not. I propose to add an option to also test the file timestamp if is greater than last time it was processed, as an indication that it's modified and have different content.
It is useful when the source producer eventually overrides files with new content.
How was this patch tested?
Added unit tests.