-
-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InferDependenciesRequest
should support/allow batching
#15197
Comments
I'm leaning towards this just being an enitrely new union, to maintain backwards compatibility. |
@stuhood thoughts? |
Potentially... I think that unless the partial caching idea that you and @benjyw have been pushing, this would reduce cache hit rates by a fair amount. I also haven't really heard any complaints about inference runtime (I think because the cache hit rate tends to be so high), but maybe you've heard differently. IMO, by far the most important case to optimize for inference is "I've edited one file and am re-running my test": assuming that the batch that was created in that case only contained one file, then it wouldn't have a negative impact at least. |
Inference time is probably a drag on CI runtimes at least, where caching isn't always available, and we have to re-infer the entire repo every time. More generally, I will continue to push on needing "fact caching" of some kind, so that we can decouple the data we're trying to obtain from the process(es) needed to obtain it. There are so many examples of where the process invocation overhead is prohibitive or otherwise a problem. We need to free ourselves of the need to run processes at the desired cache granularity. Another example is pytest - when tests have to do non-trivial setup in pytest hooks (such as Django tests setting up dbs and applying migrations) then running each test file in its own process is very expensive, much slower than just running pytest manually, and the ~only reason we do this is because we want to cache the test results at that granularity. |
This hasn't been my experience... CI environments are the spots where folks invest the most time in caching, in order to be able to skip test runs.
While dependency inference code is code that we have written and control, end user code isn't, and the cucumber usecase described on #14941 would be very likely to break if it was cached this way, because we wouldn't be accounting for the interplay between tests. Because the other important reason we run them in isolation is for hermeticity and reproducibility. So: could we do this by default for dependency inference? Probably. Could we do it by default for pytest? I don't think so. |
By default maybe not, but we could let you control it via option. People may prefer performance to hermeticity, at least on desktop if not in CI, and we should give them that slider. |
BTW another use case for caching that is not directly correlated to the process as it ran: It would let us eliminate the PATH from the cache key, so that desktop runs could enjoy more cache hits. |
Also, re pytest: #14941 Some users clearly want this, and it's not at all clear to me that one or the other is more philosophically correct. It's more about tradeoffs. |
I responded about that above in #15197 (comment) If you added fact caching and batched tests, and then cucumber wrote a shared JSON file... it would cause a reproducibility bug. The second comment on that ticket is actually exactly the case where we could not cache the entries independently. We've hijacked this ticket. Discussion of "fact caching" should probably go on another ticket. |
Coming back on topic 😄 ...
Funnily enough, what spurred this was me being annoyed at the UI when we are scheduling and running hundreds of processes. Then I thought "why isn't this batched". So I think really, this ticket can and should only be addressed when we have something that looks like per-file caching (whatever it is). And at some point we can also think about UI for Pants more generally and holistically (with this "issue" being one focus). |
Is your feature request related to a problem? Please describe.
1 process per source file has way too much overhead for as simple as inferring dependencies usually is.
Describe the solution you'd like
Much like
fmt
/lint
requests,InferDependenciesRequest
would benefit greatly from stably partitioned batches.Describe alternatives you've considered
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: