Http2: Hyper client gets stuck if too many requests are spawned #2419

staticgc · 2021-02-05T11:27:50Z

I have a golang https server to which a rust client sends lot of parallel requests.
After few hundreds of requests, it stops and apparently there is no TCP communication.

The REST API accepts a byte buffer in a body and responds back with its length in a json.

To reproduce, the entire code is at: https://github.com/staticgc/hyper-stuck
It has golang server, Rust Client & Golang Client
The build instructions are very straightforward

The default max number of futures spawned are 400
This apparently is more than the max number of http2 streams at server (which is 250)

If the count is reduced to 200 it works.

Other observations:

Golang client works with high goroutine count
Increasing connections in Rust client from 1 to say 3 causes more requests to be processed but again gets stuck
With 400 futures/requests, when the number of connections = 10 (chosen arbitrarily) it worked ok.
When HTEST_BUF_SIZE is reduced to say 1KB then even with 1 connection & high future count it works
If the rust client stops, it does so after predictable number of requests

Number of connections here refers to http2 negotiated connection which hyper (I think) creates only 1 per Client instance.
So changing HTEST_CONN_COUNT changes the number of Client instances that are created.

Also to prevent initial flooding at the server, the rust client makes a single http2 request on each Client instance and then issues the parallel requests.

for i in 0..conn_count {
     send_req_https(client_vec[i].clone(), bufsz, url.as_str()).await?;
}

Able to reproduce on: Mac OS & Cent OS Linux

Edit: Apologies for some typos in build instructions in the repo above. I have fixed those. Let me know here if anything remains.

The text was updated successfully, but these errors were encountered:

staticgc · 2021-02-09T19:04:30Z

I think I might have found the issue.

When the number of active streams exceed 250 (which is the default limit), the future polled here returns Poll::Pending. Now the parent future i.e. ClientTask does not get woken up again when stream count goes below the threshold.

So the poll here is not called. Now in my program there there is upper limit to number of futures (controlled by a semaphore) and it waits for existing streams to be done with. All the wake-ups done to the unbounded channel req_tx are already consumed so there is no one to wake-up the future.

@seanmonstar

staticgc · 2021-02-11T15:15:21Z

Here is a work-around:

Wrap hyper::client::Client and guard the request method with a semaphore.

If more parallel requests are needed, create multiple CustomClient

Edit: This workaround is only for HTTP/2

use hyper::{Body, {client::HttpConnector}};
use hyper_openssl::HttpsConnector;
use tokio::sync::Semaphore;

#[derive(Clone)]
struct CustomClient {
    client: Client<HttpsConnector<HttpConnector>>,
    guard: Arc<Semaphore>,
}

impl CustomClient {
    fn new() -> Result<Self, Error> {
        let client = CustomClient::new_hyper_client()?;
        let guard = Arc::new(Semaphore::new(240)); //Harcoded limit just below threshold 250

        Ok(CustomClient{
            client,
            guard,
        })
    }

    async fn request(&self, req: Request<Body>) -> Result<Response<Body>, Error> {
        //Is allowed to make request? If not then wait.
        let _permit = self.guard.acquire().await?;

        let rsp = self.client.request(req).await?;
        Ok(rsp)
    }
}

cynecx · 2021-03-20T04:00:14Z

Ping. Somehow I came across this issue trying out some stuff between go and tokio/hyper/h2. Since it seems like an issue with h2, it would make sense to move the issue there?

staticgc · 2021-03-20T05:40:56Z

Ping. Somehow I came across this issue trying out some stuff between go and tokio/hyper/h2. Since it seems like an issue with h2, it would make sense to move the issue there?

@seanmonstar can u pls advise?

silence-coding · 2021-11-01T04:25:40Z

How's this going? Is someone working on it?

fasterthanlime · 2022-10-20T20:25:57Z

Here's a nice self-contained reproduction for this bug: https://github.com/fasterthanlime/h2-repro

jfourie1 · 2022-10-21T19:34:44Z

FWIW I observed that an existing waker is silently dropped here. In other words the following assert would fail if added to the top of the function : assert!(self.send_task.is_none())

jfourie1 · 2022-10-21T19:52:03Z

As a quick test / hack I added the following to the start of the function mentioned above :

        if let Some(task) = self.send_task.take() {
            task.wake();
        }

This seems to fix the issue for me but still trying to figure out what is going on. Won't recommend just blindly applying this patch.

jfourie1 · 2022-10-21T21:30:06Z

Some further analysis:

Looking at this loop's first iteration the following seems to be happening for the request that gets stuck:
The stream is not marked as pending yet so this returns Ok(()).
The stream is marked pending and added to the pending queue via the call to send_request() here
This poll() here calls this, storing a waker in self.send_task. For the next iteration of the loop the stream is marked as pending so this will return Poll::Pending and immediately overwrite the above waker.

seanmonstar · 2022-10-21T22:16:27Z

Thank you so much for debugging this. Once you feel comfortable the patch is no longer blind, I'd be happy to merge a PR. (Mega bonus points if it becomes clear how to trigger this condition in a unit test, but not required.)

jfourie1 · 2022-10-25T17:03:03Z

I spent some more time on this today and would definitely not recommend applying the previous "patch". With that patch applied the request that would have gotten stuck is effectively busy polled until it completes, which is not ideal. The real problem/bug is that this function, specifically the call to self.inner.poll_pending_open() does not allow more than one waiter. In addition this single waiter is shared with this and this. So the current model of calling poll_ready() for multiple SendRequest clones before calling send_request() will cause this problem. I'll try to add some unit tests to illustrate the problem. It seems that what is needed is some mechanism to allow multiple concurrent tasks to wait for send stream capacity. One thought I had was to use a semaphore and change poll_ready() to poll this semaphore. This semaphore will be released once the request is done and the number of permits allowed on the semaphore will be set to max_send_streams. Does this make sense? Any other ideas?

There exists a race condition in ClientTask::poll() when the request that is sent via h2::client::send_request() is pending open. A task will be spawned to wait for send capacity on the sendstream. Because this same stream is also stored in the pending member of h2::client::SendRequest the next iteration of the poll() loop can call poll_ready() and call wait_send() on the same stream passed into the spawned task. Fix this by always calling poll_ready() after send_request(). If this call to poll_ready() returns Pending save the necessary context in ClientTask and only spawn the task that will eventually resolve to the response after poll_ready() returns Ok.

jfourie1 · 2022-11-02T21:31:29Z

I created a pull request for this issue, please see here. The actual issue turned out to be slightly different than what I originally thought the problem was. Please see the commit log in the PR for further detail. The actual issue is a race condition in ClientTask::poll() and not a problem in the h2 crate.

There exists a race condition in ClientTask::poll() when the request that is sent via h2::client::send_request() is pending open. A task will be spawned to wait for send capacity on the sendstream. Because this same stream is also stored in the pending member of h2::client::SendRequest the next iteration of the poll() loop can call poll_ready() and call wait_send() on the same stream passed into the spawned task. Fix this by always calling poll_ready() after send_request(). If this call to poll_ready() returns Pending save the necessary context in ClientTask and only spawn the task that will eventually resolve to the response after poll_ready() returns Ok. Closes #2419

ibotty · 2023-01-03T21:57:10Z

I don't quiet understand. AFAICT #3041 is merged. Doesn't this fix the issue?

jeromegn · 2023-01-04T14:25:17Z

I believe it can be closed.

0.14.20 has a bug hyperium/hyper#2419 that affects IPA

kundu-subhajit · 2023-10-03T13:25:01Z

Hi @jfourie1, @seanmonstar, We want to reopen this issue. We tried the same code with semaphore value of 400, as in the repository, and we are able to hit the issue, even with https://github.com/hyperium/hyper/releases/tag/v0.14.27. Note, we have server and client running on different machines.

We do acknowledge that this same issue isn't reproducible with semaphore count below 250. However, in our local setup we have 100 as max number of http2 streams at server. But even if we try with semaphore count of 90, network communications through hyper are stuck.

Possibly, these issues are connected, and may have same RCA. Need your help to understand further.

kundu-subhajit · 2023-10-05T09:28:46Z

@jeromegn, @seanmonstar, @jfourie1 : Please guide us, how to reopen the issue?

seanmonstar · 2023-10-05T10:42:04Z

I'd recommend opening a new issue, with whatever details you can provide. Let's leave this one alone.

cynecx mentioned this issue Aug 1, 2021

Potential issue with max concurrent streams hyperium/h2#550

Closed

silence-coding mentioned this issue Nov 1, 2021

manany request is pending hyperium/h2#573

Open

Gelbpunkt mentioned this issue Dec 24, 2021

Update dependencies, allow disabling HTTP2, don't depend on twilight-http twilight-rs/http-proxy#44

Merged

jeromegn added a commit to jeromegn/h2 that referenced this issue Oct 23, 2022

apply patch from hyperium/hyper#2419 (comment)

ece7d55

seanmonstar closed this as completed Jan 4, 2023

ylow mentioned this issue Apr 6, 2023

Http2: Hyper client stuck requests #3197

Closed

akoshelev mentioned this issue May 16, 2023

IPA integration tests hang sometimes private-attribution/ipa#650

Closed

akoshelev mentioned this issue Jun 8, 2023

Malicious IPA hangs in real world setup with 3 helpers private-attribution/ipa#685

Closed

akoshelev added a commit to akoshelev/raw-ipa that referenced this issue Jun 9, 2023

Upgrade hyper dependency

9851441

0.14.20 has a bug hyperium/hyper#2419 that affects IPA

kundu-subhajit mentioned this issue Oct 10, 2023

HTTP/2: Hyper communication is stuck if too many requests are spawned. #3338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Http2: Hyper client gets stuck if too many requests are spawned #2419

Http2: Hyper client gets stuck if too many requests are spawned #2419

staticgc commented Feb 5, 2021 •

edited

Loading

staticgc commented Feb 9, 2021 •

edited

Loading

staticgc commented Feb 11, 2021 •

edited

Loading

cynecx commented Mar 20, 2021 •

edited

Loading

staticgc commented Mar 20, 2021 •

edited

Loading

silence-coding commented Nov 1, 2021 •

edited

Loading

fasterthanlime commented Oct 20, 2022

jfourie1 commented Oct 21, 2022

jfourie1 commented Oct 21, 2022

jfourie1 commented Oct 21, 2022

seanmonstar commented Oct 21, 2022

jfourie1 commented Oct 25, 2022

jfourie1 commented Nov 2, 2022

ibotty commented Jan 3, 2023

jeromegn commented Jan 4, 2023

kundu-subhajit commented Oct 3, 2023

kundu-subhajit commented Oct 5, 2023

seanmonstar commented Oct 5, 2023

Http2: Hyper client gets stuck if too many requests are spawned #2419

Http2: Hyper client gets stuck if too many requests are spawned #2419

Comments

staticgc commented Feb 5, 2021 • edited Loading

staticgc commented Feb 9, 2021 • edited Loading

staticgc commented Feb 11, 2021 • edited Loading

cynecx commented Mar 20, 2021 • edited Loading

staticgc commented Mar 20, 2021 • edited Loading

silence-coding commented Nov 1, 2021 • edited Loading

fasterthanlime commented Oct 20, 2022

jfourie1 commented Oct 21, 2022

jfourie1 commented Oct 21, 2022

jfourie1 commented Oct 21, 2022

seanmonstar commented Oct 21, 2022

jfourie1 commented Oct 25, 2022

jfourie1 commented Nov 2, 2022

ibotty commented Jan 3, 2023

jeromegn commented Jan 4, 2023

kundu-subhajit commented Oct 3, 2023

kundu-subhajit commented Oct 5, 2023

seanmonstar commented Oct 5, 2023

staticgc commented Feb 5, 2021 •

edited

Loading

staticgc commented Feb 9, 2021 •

edited

Loading

staticgc commented Feb 11, 2021 •

edited

Loading

cynecx commented Mar 20, 2021 •

edited

Loading

staticgc commented Mar 20, 2021 •

edited

Loading

silence-coding commented Nov 1, 2021 •

edited

Loading