Skip to content
This repository has been archived by the owner on Jun 27, 2023. It is now read-only.

Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm #100

Closed
schomatis opened this issue Jul 30, 2021 · 4 comments
Assignees
Labels
need/triage Needs initial labeling and prioritization

Comments

@schomatis
Copy link
Contributor

schomatis commented Jul 30, 2021

Continuation and expansion of #94 (comment). Discussing in its own issue as I think this might be too long for a review thread.

The final advice there was to remove the timeout and block on the fetch. I'm not challenging the given advice but giving the full context to make sure we're aligned here since this can have a potentially negative UX effect.

Brief background on the unsharding algorithm:

  1. The user modifies the HAMT directory (adds/removes an entry).
  2. The size has potentially changed. We need to evaluate if it has fallen below the threshold to switch to a basic directory. We do this in the same add/remove operation; we do not defer it to a later stage like the extraction of the CID in GetNode.
  3. We don't want to fetch all the entries so we don't compute the total/exact size and instead just try to fetch as little as possible to prove we are still above the threshold.
  4. If some shards are unavailable in the network (high delay, missing/removed, etc.) we might not be able to retrieve enough shards to make sure we are above the threshold but we can't also claim we are below it.
  5. Since the priority is to have a deterministic relation between directory size and CID in case of doubt we fail the operation, even if we were able to perform the original add/remove.
  6. We can never really be sure when a shard is missing or actually has a long delay so we use a timeout to decide when to fail (for the reason explained above).

Here I'd say: the user has asked us to do an operation, we should keep trying until they cancel.

The user asked to do an add/remove operation but it may not be aware that it entails a potentially big fetch of shards well beyond the ones that were needed for the original operation.

  • Following this advice we would then remove the timeout and depend entirely on the client when to stop the operation the same way it had to, even before the sharding feature, as an add/remove operation can always be blocking if we don't find the necessary shards to operate on.

  • The advantage is the reduction of complexity as explained by Steven in the original comment.

  • The potential disadvantage is depending on a user-defined timeout that was designed for specific add/remove operations and not for massive these "extra" fetches that might exceed the ones needed originally by a noticeable degree.

If we're comfortable going this way a simple 'go for it' is enough; no need to keep expanding on the motivation to remove the timeout.

@schomatis schomatis added the need/triage Needs initial labeling and prioritization label Jul 30, 2021
@schomatis schomatis self-assigned this Jul 30, 2021
@schomatis schomatis mentioned this issue Jul 30, 2021
4 tasks
@aschmahmann
Copy link
Contributor

and not for massive fetches

Why do you think the fetches need to be massive? Is it because the default number of elements per shard is much smaller than the amount of data that could fit into a single directory meaning that one directory will become a bunch of shards?

@schomatis
Copy link
Contributor Author

Sorry, massive might have been an exaggeration. But yes, the intention was to emphasize that the amount of fetches might exceed by a non-negligible degree the number needed to perform the original add/remove operation.

@aschmahmann
Copy link
Contributor

The options proposed are:

  1. Fail to do and add/remove because we hit a timeout
  2. Don't have the timeout and allow the caller to cancel it upstream if the call takes too long

The latter seems preferable to me since it gives the caller more choice as to what to do.

@schomatis
Copy link
Contributor Author

Don't have the timeout and allow the caller to cancel it upstream if the call takes too long

Going with this then, thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

2 participants