Skip to content

Conversation

@ragnarok56
Copy link
Contributor

What changes were proposed in this pull request?

This change adds configuration options for the streaming input File Source for maxCachedFiles and discardCachedFilesRatio. These values were originally introduced with #27620 but were hardcoded to 10,000 and 0.2, respectively.

Why are the changes needed?

Under certain workloads with large maxFilesPerTrigger settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job with maxFilesPerTrigger set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have.

Does this PR introduce any user-facing change?

Updated documentation for structured streaming sources to describe new configurations options

How was this patch tested?

New and existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

@Kimahriman
Copy link
Contributor

ping @HeartSaVioR since you added the caching back in the day

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 25, 2024
@github-actions github-actions bot closed this Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants