Skip to content

improve the deduplication logic#2014

Merged
garypen merged 5 commits intodevfrom
garypen/1984-timeout
Oct 27, 2022
Merged

improve the deduplication logic#2014
garypen merged 5 commits intodevfrom
garypen/1984-timeout

Conversation

@garypen
Copy link
Contributor

@garypen garypen commented Oct 27, 2022

Close examination revealed that some judicious tweaking of the locking and spawning would improve the behaviour.

In performance testing this performs the same as current (dev) or using the deduplicate crate. I can't reproduce the hanging problem with dev with this set of changes.

The rationale for the changes are as follows:

  • We need to spawn the sentinel before we insert the waiter entry or failure after inserting, but before spawning a sentinel, will cause issues.
  • Notification of waiters from the "first" entry must be synchronised under the "wait_map" lock, or else that becomes racy and subscribers may be missed. That notification must be immediately preceded by the removal of the wait_map entry.

fixes: #1984

Gary Pennington added 2 commits October 27, 2022 13:36
Close examination revealed that some judicious tweaking of the locking
and spawning would improve the behaviour.

In performance testing this performs the same as current (dev) or using
the deduplicate crate. I can't reproduce the hanging problem with dev
with this set of changes.
@garypen garypen requested review from Geal and bnjjj October 27, 2022 12:44
@garypen garypen self-assigned this Oct 27, 2022
@github-actions

This comment has been minimized.

@garypen garypen changed the title Garypen/1984 timeout fix the deduplication logic Oct 27, 2022
@garypen garypen changed the title fix the deduplication logic improve the deduplication logic Oct 27, 2022
Copy link
Contributor

@Geal Geal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds very reasonable. So the issue you were seeing would have been that getting data from the storage at line 65 waits indefinitely or the task stops, and so the sender would remain?
Or was it the race between removing the entry from the waitmap and sending the value?

@garypen
Copy link
Contributor Author

garypen commented Oct 27, 2022

that sounds very reasonable. So the issue you were seeing would have been that getting data from the storage at line 65 waits indefinitely or the task stops, and so the sender would remain? Or was it the race between removing the entry from the waitmap and sending the value?

Moving the sentinel before the insert guarantees that we can't have an entry in the entrymap that lives longer than we desire. That's the really important change, since otherwise we might have inserted an entry that will never be removed. With such a thing in our wait_map, then new clients just subscribe and sit there waiting for nothing.

We absolutely must remove the waiters (under lock) before send, but we could perhaps drop the lock just before the send and still be safe. I didn't want to take that risk, since performance is already very good and we know this area is complex to work with.

UPDATE: I tried dropping the lock before the send and it appears to work fine. Performance isn't any better than when we hold the lock over the send, so I'll leave things as they are for now.

@garypen garypen merged commit 688f92b into dev Oct 27, 2022
@garypen garypen deleted the garypen/1984-timeout branch October 27, 2022 16:10
@abernix abernix mentioned this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

router needs a timeout for delegated fetch operations

2 participants