Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we always set min_batch_size == max_batch_size? #495

Open
inahga opened this issue Sep 13, 2023 · 2 comments
Open

Should we always set min_batch_size == max_batch_size? #495

inahga opened this issue Sep 13, 2023 · 2 comments

Comments

@inahga
Copy link
Contributor

inahga commented Sep 13, 2023

From #286 (comment).

For fixed size queries, we prompt the user to specify a max batch size. We'll probably always recommend that the user set min and max equal to each other, thus making this field meaningless.

There's also an unspecified issue in the DAP specification for removing this ability entirely https://www.ietf.org/archive/id/draft-ietf-ppm-dap-06.html#section-4.1.2-6

@branlwyd
Copy link
Member

I think we want to either:

  • Always set min_batch_size == max_batch_size. This is a simple approach that will work pretty well, and always provide equally-sized batches, which the Collector might appreciate.

  • Set max_batch_size to a small multiple of min_batch_size, perhaps 110% of the min_batch_size. This would allow a small fraction of reports inside the batch to fail aggregation without requiring an extra "tiny" aggregation job after all of the other aggregation jobs; this would decrease the number of aggregation jobs that our system needs to handle & would (slightly) reduce latency-to-aggregate-availability in the case that reports do fail to aggregate.

In previous discussions, I was leaning towards the former, but now IMO I think we should take the max_batch_size = 110% of min_batch_size approach. (I'm curious if any collectors would care about receiving same-sized batches -- I think the answer is yes for at least some collectors, since this was one of the reasons that fixed-size was invented in the first place, but I do not know how common this requirement will be.)

In any case, I think we don't need to expose max_batch_size as a user-controllable parameter.

@divergentdave
Copy link
Contributor

In situations where Clients add DP noise as a preprocessing step before report sharding, I think people may want min_batch_size == max_batch_size. That will make calibrating the per-Client noise easier.

This may be a good case for providing one-size-fits-all behavior via the web console, and allowing more sophisticated choices for tasks created via API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants