Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InferDependenciesRequest should support/allow batching #15197

Open
thejcannon opened this issue Apr 20, 2022 · 10 comments
Open

InferDependenciesRequest should support/allow batching #15197

thejcannon opened this issue Apr 20, 2022 · 10 comments

Comments

@thejcannon
Copy link
Member

thejcannon commented Apr 20, 2022

Is your feature request related to a problem? Please describe.
1 process per source file has way too much overhead for as simple as inferring dependencies usually is.

Describe the solution you'd like
Much like fmt/lint requests, InferDependenciesRequest would benefit greatly from stably partitioned batches.

Describe alternatives you've considered
N/A

Additional context
N/A

@thejcannon
Copy link
Member Author

I'm leaning towards this just being an enitrely new union, to maintain backwards compatibility.
OTOH making this a breaking change forces clients to think about batching, which potentially has great perf benefits.

@thejcannon
Copy link
Member Author

@stuhood thoughts?

@stuhood
Copy link
Member

stuhood commented Apr 23, 2022

Potentially... I think that unless the partial caching idea that you and @benjyw have been pushing, this would reduce cache hit rates by a fair amount. I also haven't really heard any complaints about inference runtime (I think because the cache hit rate tends to be so high), but maybe you've heard differently.

IMO, by far the most important case to optimize for inference is "I've edited one file and am re-running my test": assuming that the batch that was created in that case only contained one file, then it wouldn't have a negative impact at least.

@benjyw
Copy link
Contributor

benjyw commented Apr 24, 2022

Inference time is probably a drag on CI runtimes at least, where caching isn't always available, and we have to re-infer the entire repo every time.

More generally, I will continue to push on needing "fact caching" of some kind, so that we can decouple the data we're trying to obtain from the process(es) needed to obtain it. There are so many examples of where the process invocation overhead is prohibitive or otherwise a problem. We need to free ourselves of the need to run processes at the desired cache granularity.

Another example is pytest - when tests have to do non-trivial setup in pytest hooks (such as Django tests setting up dbs and applying migrations) then running each test file in its own process is very expensive, much slower than just running pytest manually, and the ~only reason we do this is because we want to cache the test results at that granularity.

@stuhood
Copy link
Member

stuhood commented Apr 25, 2022

Inference time is probably a drag on CI runtimes at least, where caching isn't always available, and we have to re-infer the entire repo every time.

This hasn't been my experience... CI environments are the spots where folks invest the most time in caching, in order to be able to skip test runs.

Another example is pytest - when tests have to do non-trivial setup in pytest hooks (such as Django tests setting up dbs and applying migrations) then running each test file in its own process is very expensive, much slower than just running pytest manually, and the ~only reason we do this is because we want to cache the test results at that granularity.

While dependency inference code is code that we have written and control, end user code isn't, and the cucumber usecase described on #14941 would be very likely to break if it was cached this way, because we wouldn't be accounting for the interplay between tests. Because the other important reason we run them in isolation is for hermeticity and reproducibility.

So: could we do this by default for dependency inference? Probably. Could we do it by default for pytest? I don't think so.

@benjyw
Copy link
Contributor

benjyw commented Apr 25, 2022

So: could we do this by default for dependency inference? Probably. Could we do it by default for pytest? I don't think so.

By default maybe not, but we could let you control it via option. People may prefer performance to hermeticity, at least on desktop if not in CI, and we should give them that slider.

@benjyw
Copy link
Contributor

benjyw commented Apr 25, 2022

BTW another use case for caching that is not directly correlated to the process as it ran: It would let us eliminate the PATH from the cache key, so that desktop runs could enjoy more cache hits.

@benjyw
Copy link
Contributor

benjyw commented Apr 26, 2022

Also, re pytest: #14941

Some users clearly want this, and it's not at all clear to me that one or the other is more philosophically correct. It's more about tradeoffs.

@stuhood
Copy link
Member

stuhood commented Apr 26, 2022

Also, re pytest: #14941

Some users clearly want this, and it's not at all clear to me that one or the other is more philosophically correct. It's more about tradeoffs.

I responded about that above in #15197 (comment)

If you added fact caching and batched tests, and then cucumber wrote a shared JSON file... it would cause a reproducibility bug. The second comment on that ticket is actually exactly the case where we could not cache the entries independently.


We've hijacked this ticket. Discussion of "fact caching" should probably go on another ticket.

@thejcannon
Copy link
Member Author

Coming back on topic 😄 ...

Potentially... I think that unless the partial caching idea that you and @benjyw have been pushing, this would reduce cache hit rates by a fair amount. I also haven't really heard any complaints about inference runtime (I think because the cache hit rate tends to be so high), but maybe you've heard differently.

Funnily enough, what spurred this was me being annoyed at the UI when we are scheduling and running hundreds of processes. Then I thought "why isn't this batched".

So I think really, this ticket can and should only be addressed when we have something that looks like per-file caching (whatever it is). And at some point we can also think about UI for Pants more generally and holistically (with this "issue" being one focus).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants