Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Conversation

@eskimor
Copy link
Member

@eskimor eskimor commented Mar 21, 2021

This should fix the bug that streams won't get flushed on a timely
manner in Polkadot.

This should fix the bug that streams won't get flushed on a timely
manner.
@eskimor eskimor added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Mar 21, 2021
@eskimor eskimor requested review from bkchr and romanb March 21, 2021 11:36
@tomaka
Copy link
Contributor

tomaka commented Mar 21, 2021

What exactly is the problem that you're observing and that you're trying to fix?

@eskimor
Copy link
Member Author

eskimor commented Mar 21, 2021

What exactly is the problem that you're observing and that you're trying to fix?

Flushing of data seems to be delayed quite a lot (seconds) in Polkadot. Looking at the code, the current triggering of flushing looks imperfect as a call to poll not necessarily results in a poll_flush at all, depending on what else is going on in other substreams and events_queue. It looks like some amount of load could easily delay flushing of a particular stream a lot, which would explain what we are seeing.

We still need to test on rococo, whether this actually fixes the issue.

@eskimor
Copy link
Member Author

eskimor commented Mar 21, 2021

What exactly is the problem that you're observing and that you're trying to fix?

Also see #8405 for an earlier (and flawed) approach.

@tomaka
Copy link
Contributor

tomaka commented Mar 21, 2021

It looks like some amount of load could easily delay flushing of a particular stream a lot

This can only happen if all cores are 100% busy and the task is never scheduled. Is that what we observe?

The code is far from being perfect, but I don't think it's a good idea to add weird-looking code if we're not sure that it solves a problem.

@eskimor
Copy link
Member Author

eskimor commented Mar 21, 2021

This can only happen if all cores are 100% busy and the task is never scheduled. Is that what we observe?

If waker.wake() is called multiple times, will then the task also be invoked multiple times - or do get duplicates get filtered in the executor? I mean we will be invoked even if we returned Ready already, so we don't behave like a standard future ... So I guess the question becomes, under what circumstances does poll get called here actually?

@tomaka
Copy link
Contributor

tomaka commented Mar 21, 2021

It is no different than a regular future: poll is called when the upper layer is ready to accept an event.

If waker.wake() is called multiple times, poll is only called once.
Returning Poll::Ready doesn't strictly guarantee that poll is called immediately after. After Ready is returned, the upper layer is aware that there might be more to do, but calls poll again only when it's ready to accept another event.

@bkchr
Copy link
Member

bkchr commented Mar 21, 2021

@tomaka there is some similar bug with flushing as you fixed in November on some upper layer? As in November we see that a notification is written, but sometimes it takes 3-6 seconds to reach the other end. The size of the notification is only 160 bytes, so that should not be related to taking too being to slow to transfer.

@tomaka
Copy link
Contributor

tomaka commented Mar 21, 2021

there is some similar bug with flushing as you fixed in November on some upper layer?

#7594
It introduces the code that #8405 wanted to remove.

The reason why, at the time, I added a small hack instead of fixing it properly is the legacy substream, which is now gone from the code base.

@bkchr
Copy link
Member

bkchr commented Mar 21, 2021

Edit: I'm dumb.

I know that you are on vacation, could you give us a hint on what needs to be done for a Proper fix?

@tomaka
Copy link
Contributor

tomaka commented Mar 21, 2021

I don't know what to fix since I don't know what the bug is. The code looks functional to me as it is in master. Not great in quality, but also not buggy.
It's indeed possible that some busy substream starves others, but clearly not for 3 to 6 seconds.

@eskimor
Copy link
Member Author

eskimor commented Mar 21, 2021

Still, if the initial fix of calling wake was supposed to fix anything, then I don't quite understand how that would be guaranteed, e.g. in the following scenario:

  • We call waker.wake
  • poll gets called
  • some input stream is ready - so it won't call wake.
  • we return and burned our wake call - it had no effect on producing an earlier flush.

The previous fix would only work, if we were like completely idle, but a simple ready input stream would ruin it. This PR is meant to fix this. What am I missing?

@eskimor
Copy link
Member Author

eskimor commented Mar 22, 2021

Ok, after a good night sleep, it is rather obvious: We are not worried about getting called in a timely manner when we return Ready, but only on Pending - that makes sense.

@eskimor
Copy link
Member Author

eskimor commented Mar 22, 2021

Closing in favour of #8422

@eskimor eskimor closed this Mar 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants