`InferDependenciesRequest` should support/allow batching #15197

thejcannon · 2022-04-20T14:15:55Z

Is your feature request related to a problem? Please describe.
1 process per source file has way too much overhead for as simple as inferring dependencies usually is.

Describe the solution you'd like
Much like fmt/lint requests, InferDependenciesRequest would benefit greatly from stably partitioned batches.

Describe alternatives you've considered
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

thejcannon · 2022-04-20T14:21:25Z

I'm leaning towards this just being an enitrely new union, to maintain backwards compatibility.
OTOH making this a breaking change forces clients to think about batching, which potentially has great perf benefits.

thejcannon · 2022-04-23T00:27:37Z

@stuhood thoughts?

stuhood · 2022-04-23T20:51:32Z

Potentially... I think that unless the partial caching idea that you and @benjyw have been pushing, this would reduce cache hit rates by a fair amount. I also haven't really heard any complaints about inference runtime (I think because the cache hit rate tends to be so high), but maybe you've heard differently.

IMO, by far the most important case to optimize for inference is "I've edited one file and am re-running my test": assuming that the batch that was created in that case only contained one file, then it wouldn't have a negative impact at least.

benjyw · 2022-04-24T17:59:36Z

Inference time is probably a drag on CI runtimes at least, where caching isn't always available, and we have to re-infer the entire repo every time.

More generally, I will continue to push on needing "fact caching" of some kind, so that we can decouple the data we're trying to obtain from the process(es) needed to obtain it. There are so many examples of where the process invocation overhead is prohibitive or otherwise a problem. We need to free ourselves of the need to run processes at the desired cache granularity.

Another example is pytest - when tests have to do non-trivial setup in pytest hooks (such as Django tests setting up dbs and applying migrations) then running each test file in its own process is very expensive, much slower than just running pytest manually, and the ~only reason we do this is because we want to cache the test results at that granularity.

stuhood · 2022-04-25T17:21:39Z

Inference time is probably a drag on CI runtimes at least, where caching isn't always available, and we have to re-infer the entire repo every time.

This hasn't been my experience... CI environments are the spots where folks invest the most time in caching, in order to be able to skip test runs.

Another example is pytest - when tests have to do non-trivial setup in pytest hooks (such as Django tests setting up dbs and applying migrations) then running each test file in its own process is very expensive, much slower than just running pytest manually, and the ~only reason we do this is because we want to cache the test results at that granularity.

While dependency inference code is code that we have written and control, end user code isn't, and the cucumber usecase described on #14941 would be very likely to break if it was cached this way, because we wouldn't be accounting for the interplay between tests. Because the other important reason we run them in isolation is for hermeticity and reproducibility.

So: could we do this by default for dependency inference? Probably. Could we do it by default for pytest? I don't think so.

benjyw · 2022-04-25T21:02:12Z

So: could we do this by default for dependency inference? Probably. Could we do it by default for pytest? I don't think so.

By default maybe not, but we could let you control it via option. People may prefer performance to hermeticity, at least on desktop if not in CI, and we should give them that slider.

benjyw · 2022-04-25T21:03:41Z

BTW another use case for caching that is not directly correlated to the process as it ran: It would let us eliminate the PATH from the cache key, so that desktop runs could enjoy more cache hits.

benjyw · 2022-04-26T16:52:28Z

Also, re pytest: #14941

Some users clearly want this, and it's not at all clear to me that one or the other is more philosophically correct. It's more about tradeoffs.

stuhood · 2022-04-26T17:18:33Z

Also, re pytest: #14941

Some users clearly want this, and it's not at all clear to me that one or the other is more philosophically correct. It's more about tradeoffs.

I responded about that above in #15197 (comment)

If you added fact caching and batched tests, and then cucumber wrote a shared JSON file... it would cause a reproducibility bug. The second comment on that ticket is actually exactly the case where we could not cache the entries independently.

We've hijacked this ticket. Discussion of "fact caching" should probably go on another ticket.

thejcannon · 2022-04-26T17:37:14Z

Coming back on topic 😄 ...

Potentially... I think that unless the partial caching idea that you and @benjyw have been pushing, this would reduce cache hit rates by a fair amount. I also haven't really heard any complaints about inference runtime (I think because the cache hit rate tends to be so high), but maybe you've heard differently.

Funnily enough, what spurred this was me being annoyed at the UI when we are scheduling and running hundreds of processes. Then I thought "why isn't this batched".

So I think really, this ticket can and should only be addressed when we have something that looks like per-file caching (whatever it is). And at some point we can also think about UI for Pants more generally and holistically (with this "issue" being one focus).

thejcannon added the enhancement label Apr 20, 2022

kaos mentioned this issue Apr 25, 2022

Run pytest on multiple files the pytest-way #14941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`InferDependenciesRequest` should support/allow batching #15197

`InferDependenciesRequest` should support/allow batching #15197

thejcannon commented Apr 20, 2022 •

edited

Loading

thejcannon commented Apr 20, 2022

thejcannon commented Apr 23, 2022

stuhood commented Apr 23, 2022 •

edited

Loading

benjyw commented Apr 24, 2022

stuhood commented Apr 25, 2022 •

edited

Loading

benjyw commented Apr 25, 2022

benjyw commented Apr 25, 2022

benjyw commented Apr 26, 2022

stuhood commented Apr 26, 2022 •

edited

Loading

thejcannon commented Apr 26, 2022

InferDependenciesRequest should support/allow batching #15197

InferDependenciesRequest should support/allow batching #15197

Comments

thejcannon commented Apr 20, 2022 • edited Loading

thejcannon commented Apr 20, 2022

thejcannon commented Apr 23, 2022

stuhood commented Apr 23, 2022 • edited Loading

benjyw commented Apr 24, 2022

stuhood commented Apr 25, 2022 • edited Loading

benjyw commented Apr 25, 2022

benjyw commented Apr 25, 2022

benjyw commented Apr 26, 2022

stuhood commented Apr 26, 2022 • edited Loading

thejcannon commented Apr 26, 2022

`InferDependenciesRequest` should support/allow batching #15197

`InferDependenciesRequest` should support/allow batching #15197

thejcannon commented Apr 20, 2022 •

edited

Loading

stuhood commented Apr 23, 2022 •

edited

Loading

stuhood commented Apr 25, 2022 •

edited

Loading

stuhood commented Apr 26, 2022 •

edited

Loading