Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm #100

schomatis · 2021-07-30T15:31:34Z

Continuation and expansion of #94 (comment). Discussing in its own issue as I think this might be too long for a review thread.

The final advice there was to remove the timeout and block on the fetch. I'm not challenging the given advice but giving the full context to make sure we're aligned here since this can have a potentially negative UX effect.

Brief background on the unsharding algorithm:

The user modifies the HAMT directory (adds/removes an entry).
The size has potentially changed. We need to evaluate if it has fallen below the threshold to switch to a basic directory. We do this in the same add/remove operation; we do not defer it to a later stage like the extraction of the CID in GetNode.
We don't want to fetch all the entries so we don't compute the total/exact size and instead just try to fetch as little as possible to prove we are still above the threshold.
If some shards are unavailable in the network (high delay, missing/removed, etc.) we might not be able to retrieve enough shards to make sure we are above the threshold but we can't also claim we are below it.
Since the priority is to have a deterministic relation between directory size and CID in case of doubt we fail the operation, even if we were able to perform the original add/remove.
We can never really be sure when a shard is missing or actually has a long delay so we use a timeout to decide when to fail (for the reason explained above).

Here I'd say: the user has asked us to do an operation, we should keep trying until they cancel.

The user asked to do an add/remove operation but it may not be aware that it entails a potentially big fetch of shards well beyond the ones that were needed for the original operation.

Following this advice we would then remove the timeout and depend entirely on the client when to stop the operation the same way it had to, even before the sharding feature, as an add/remove operation can always be blocking if we don't find the necessary shards to operate on.
The advantage is the reduction of complexity as explained by Steven in the original comment.
The potential disadvantage is depending on a user-defined timeout that was designed for specific add/remove operations and not for ~~massive~~ these "extra" fetches that might exceed the ones needed originally by a noticeable degree.

If we're comfortable going this way a simple 'go for it' is enough; no need to keep expanding on the motivation to remove the timeout.

The text was updated successfully, but these errors were encountered:

aschmahmann · 2021-07-30T16:44:21Z

and not for massive fetches

Why do you think the fetches need to be massive? Is it because the default number of elements per shard is much smaller than the amount of data that could fit into a single directory meaning that one directory will become a bunch of shards?

schomatis · 2021-07-30T23:33:08Z

Sorry, massive might have been an exaggeration. But yes, the intention was to emphasize that the amount of fetches might exceed by a non-negligible degree the number needed to perform the original add/remove operation.

aschmahmann · 2021-08-04T15:50:52Z

The options proposed are:

Fail to do and add/remove because we hit a timeout
Don't have the timeout and allow the caller to cancel it upstream if the call takes too long

The latter seems preferable to me since it gives the caller more choice as to what to do.

schomatis · 2021-08-04T21:36:07Z

Don't have the timeout and allow the caller to cancel it upstream if the call takes too long

Going with this then, thanks.

schomatis added the need/triage Needs initial labeling and prioritization label Jul 30, 2021

schomatis self-assigned this Jul 30, 2021

schomatis mentioned this issue Jul 30, 2021

Size-based unsharding #94

Merged

4 tasks

schomatis closed this as completed Aug 4, 2021

schomatis mentioned this issue Aug 6, 2021

Remove tiemout and refactor sizeBelowThreshold #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm #100

Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm #100

schomatis commented Jul 30, 2021 •

edited

Loading

aschmahmann commented Jul 30, 2021

schomatis commented Jul 30, 2021

aschmahmann commented Aug 4, 2021

schomatis commented Aug 4, 2021

Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm #100

Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm #100

Comments

schomatis commented Jul 30, 2021 • edited Loading

aschmahmann commented Jul 30, 2021

schomatis commented Jul 30, 2021

aschmahmann commented Aug 4, 2021

schomatis commented Aug 4, 2021

schomatis commented Jul 30, 2021 •

edited

Loading