[SPARK-44924][SS] Add config for FileStreamSource cached files #42623
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This change adds configuration options for the streaming input File Source for
maxCachedFilesanddiscardCachedFilesRatio. These values were originally introduced with #27620 but were hardcoded to 10,000 and 0.2, respectively.Why are the changes needed?
Under certain workloads with large
maxFilesPerTriggersettings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job withmaxFilesPerTriggerset to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have.Does this PR introduce any user-facing change?
Updated documentation for structured streaming sources to describe new configurations options
How was this patch tested?
New and existing unit tests.
Was this patch authored or co-authored using generative AI tooling?
No