-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink #32702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #139083 has finished for PR 32702 at commit
|
|
This was already proposed before from a part of #31638, though I'm not sure you've indicated this. Quoting my comment #31638 (comment) :
|
Oh, this comes from internal customer request. It seems hard (or troublesome) to do workaround so I basically think it makes sense to support such use-case. I'm not aware of the previous PR including it. I'm okay if you think an option is better than a config. |
|
And one more, I think let file stream sink to ignore metadata directory on reading existing metadata but write to the metadata directory is odd and error-prone. The metadata is no longer valid when Spark starts to write a new metadata on the same directory, and the option must be set to true for such directory to read properly despite Spark writes the metadata. There's no indication and end users have to memorize it. The ideal approach is writing metadata to the directory indicating whether the directory is set to at-least-once (or multi-writes) or exactly-once (or single-write) when the directory is written for the first time, and leverage the option all the time instead of changing its behavior depending on the query's config/option. This will bring consistency for the directory. Btw I've made more improvements on file stream source and file stream sink, but I had to agree that the efforts are quite duplicated with data lake solutions. (See discussions in #27694) Once you start to address the issues one by one, you've got to realize these are what data lake solutions have been fixed. That's why I stop dealing with file stream source and file stream sink, though I guess ETL to data lake solutions is still valid and then the long-running issue on file stream source should be fixed - #28422 |
I don't know if it is a typo, but this doesn't let file stream sink but actually lets file stream source (and batch read path) to ignore metadata directory when reading the output of file stream sink. It doesn't change how file stream sink reads or writes to the metadata directory. Is it possible we are talking two different things? |
|
Looks like your code change doesn't address it, but your PR description mentions it.
What's the solution of this? Doesn't it mean you want to make the directory be writable from multiple queries (including same query with different checkpoint)? |
The use-case looks like this. The user wants to write to same output directory after changing the query. But once they change something in the query, previous checkpoint cannot be used anymore, so they need to use a new checkpoint directory (and metadata directory, otherwise duplicate batches won't be written). They don't write to the output from multiple queries at the same time. |
Which steps end users require to do to resolve such case with your PR? Deleting metadata directory and letting read path to ignore the metadata? I know this is a valid workaround to unblock such case end users would be stuck on reusing directory, but they should be quite cautious as they must remember the state of directory; the metadata won't have some parts of output, which is easy to forget. Once they forget the fact and also forget setting the flag on read query, only the parts of output will be read and they will complain about the result of read query without indicating what they did. So just allowing end users to ignore metadata is simple, but the risks on turning on the flag are not that simple. Let's take our responsibility to guide the meaning of ignoring metadata and try to provide the possible risks. |
Currently in the use-case, what the users do is, when they change the query and the checkpoint doesn't work anymore, they clean up the metadata directory, run the changed query with new checkpoint. They have another Spark app reading from the streaming query output. But as Spark respects the metadata, the another Spark app can only read the files written by the changed streaming query (i.e. the files recorded in the metadata). The other files written before changing the streaming query, are ignored by Spark now.
I agree. That is why this config is internal only so far. I should also add more cautious wordings in the config doc too. I have discussed with the users, seems to me they should know what they are asking and be cautious of the effect of this config. |
|
@HeartSaVioR Does it sound okay for you? If okay, still prefer an option over config? If so, please let me know so I can change to use option. |
|
Now I think it should be a source option. Given the impact, they should know what they are doing in their code, not configuration which can be brought by multiple places, even from cluster level config. |
|
Okay, sounds good. Let me change to using a source option. |
|
@viirya But I'd like to emphasize that the root issue won't be resolved and eventually they'd like to try out data lake solutions instead. |
|
Oh, not need to apologize. I've not updated this yet. :) This is a SQL config now. Please help review if you find some time. Thanks! |
|
Agree. Adding a config also has one benefit is that the existing application can avoid changing the code. But we'll still avoid adding more and more configs later as @HeartSaVioR suggested. There's indeed too much configs to control the behaviors 😂 |
|
@HeartSaVioR @xuanyuanking Can we move forward with this? |
|
Ack, I'll review this and compare it with my original PR tomorrow (Beijing time). |
xuanyuanking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, I check the major idea actually it's same with 83dd27a. Just left some small improvements.
| caseInsensitiveOptions.get("path").toSeq ++ paths, | ||
| newHadoopConfiguration(), | ||
| sparkSession.sessionState.conf) => | ||
| if !sparkSession.sessionState.conf.fileStreamSinkMetadataIgnored && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking the config on the caller side in three places, maybe we can directly check the config in FileStreamSink.hasMetadata? These 2 approaches should be equivalent while the latter one only changes single code segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is fine for me.
| .createWithDefault(true) | ||
|
|
||
| val FILESTREAM_SINK_METADATA_IGNORED = | ||
| buildConf("spark.sql.streaming.fileStreamSink.metadata.ignored") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the guideline for naming configurations, maybe the config can be named like spark.sql.streaming.fileStreamSink.ignoreMetadata or spark.sql.streaming.fileStreamSink.formatCheck.enabled, or any other good names :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally spark.sql.streaming.fileStreamSink.ignoreMetadata sounds better. I couldn't get what formatCheck means intuitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.streaming.fileStreamSink.ignoreMetadata sounds good.
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK. Let's resolve two comments and then we are good to go.
|
Thanks @HeartSaVioR @xuanyuanking! I've updated the change. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #139999 has finished for PR 32702 at commit
|
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Thanks! Merging to master. |
|
Thanks @viirya for the contribution! I merged into master. |
|
Thanks @viirya ! |
…treamSink This patch proposes to add an internal config for ignoring metadata of `FileStreamSink` when reading the output path. `FileStreamSink` produces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log. Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark. Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily. We need a config for ignoring the metadata of `FileStreamSink` when reading the output path. Added a config for ignoring metadata of FileStreamSink when reading the output. Unit tests. Closes apache#32702 from viirya/ignore-metadata. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
|
Sorry to jump on a very old thread. Using the config |
What changes were proposed in this pull request?
This patch proposes to add an internal config for ignoring metadata of
FileStreamSinkwhen reading the output path.Why are the changes needed?
FileStreamSinkproduces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log.Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark.
Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily.
We need a config for ignoring the metadata of
FileStreamSinkwhen reading the output path.Does this PR introduce any user-facing change?
Added a config for ignoring metadata of FileStreamSink when reading the output.
How was this patch tested?
Unit tests.