Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON-RPC messages arrive out of order #2992

Closed
2 tasks done
josepot opened this issue Jan 19, 2024 · 18 comments · Fixed by #3230
Closed
2 tasks done

JSON-RPC messages arrive out of order #2992

josepot opened this issue Jan 19, 2024 · 18 comments · Fixed by #3230
Assignees
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@josepot
Copy link

josepot commented Jan 19, 2024

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

Very often, when using the chainHead group of functions of new JSON-RPC API against a Polkadot-SDK node, there are "subscription" messages that come out of order. E.g:

image

What happens is that the message that announces the "opeartionId" that the client should be paying attention to, comes after the notifications for that "operationId". Since the client does not recognize those messages, then the messages get ignored.

I first experienced this against different public endpoints, and then I tried it against a local development node and I experienced the same issue. In order to further confirm that this was a problem with the implementation on the node, I tried running the same requests against smoldot while using a chainspec that had my local development node as the only bootnode, trying the same requests through smoldot (with the chainspec of a local development node) works flawlessly.

Steps to reproduce

I have setup the following nodejs repo to reproduce the issue. I recommend cloning the repo and trying it out in your local environment. However, you can also try it out using stackblitz.

In order to reproduce/debug what's happening:

  1. clone the repo
  2. run pnpm i
  3. run pnpm start

Every time that the script detects a message out of order it will print these kinds of messages on the terminal:
image

The script also generates a file named wire-logs.txt, where you will find all the send/received messages that were seen over the WebSocket connection in the order that they happened:
image

If you want to try against a different endpoint, then just edit the following line.

I hope that this repo helps to investigate the issue, because it's taken me a while to set it up. 🤞

@josepot josepot added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Jan 19, 2024
@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/papi-is-coming-showcasing-the-validator-selector-tool-dapp/5756/1

@niklasad1
Copy link
Member

niklasad1 commented Jan 19, 2024

All right, these seems to be caused by chainHead_unstable_storage and from the logs sometimes operationStorageEvent are emitted before chainHead_unstable_storage call is finished.

To give an example what is happening the following is extracted from the logs:

# tx storage call
>> {"jsonrpc":"2.0","id":675,"method":"chainHead_unstable_storage","params":["yImvwio6XDT9EiEF","0x885b921e752c2eb3627e14188c73b9285474da5f6c933024c7371320ee099772",[{"key":"0x2aeddc77fe58c98d50bd37f1b90840f9cd7f37317cd20b61e9bd46fab8704714b4a5da87e533f5dff58927b296a23cbf25794b7cbb8fe31c5e68f2bcd2c2483e9c9cd0712216f232","type":"value"}],null]}

# rx subscription event 1
<< {"jsonrpc":"2.0","method":"chainHead_unstable_followEvent","params":{"subscription":"yImvwio6XDT9EiEF","result":{"event":"operationStorageItems","operationId":"373","items":[{"key":"0x2aeddc77fe58c98d50bd37f1b90840f9cd7f37317cd20b61e9bd46fab8704714b4a5da87e533f5dff58927b296a23cbf25794b7cbb8fe31c5e68f2bcd2c2483e9c9cd0712216f232","value":"0x040100000002007db52a2f0000000000000000000000000f3125202d2047545354414b494e47001768747470733a2f2f67747374616b696e672e636f6d2f14406761757468387a3a6d61747269782e6f7267166761757468387a4067747374616b696e672e636f6d00000a40475374616b696e67"}]}}}

# rx subscription event 2
<< {"jsonrpc":"2.0","method":"chainHead_unstable_followEvent","params":{"subscription":"yImvwio6XDT9EiEF","result":{"event":"operationStorageDone","operationId":"373"}}}

# rx unstable call
<< {"jsonrpc":"2.0","result":{"result":"started","operationId":"373","discardedItems":0},"id":675}

Thus, if I understand this correctly we should really wait to emit events on the subscription with operationID 373 until the unstable call has been answered?!

Relevant code: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/rpc-spec-v2/src/chain_head/chain_head.rs#L361-#L370

@lexnv
Copy link
Contributor

lexnv commented Jan 19, 2024

Thanks Josep for raising this and providing the repro case 🙏

As Niklas pointed out, this is the culprit of the behavior:

let fut = async move {
storage_client.generate_events(block_guard, hash, items, child_trie).await;
};
self.executor
.spawn_blocking("substrate-rpc-subscription", Some("rpc"), fut.boxed());
Ok(MethodResponse::Started(MethodResponseStarted {
operation_id,
discarded_items: Some(discarded),
}))
}

The race is happening because:

  • we are spawning a future that provides the responses of the chainHead_storage method with the associated operation ID on the chainHead_follow
  • then, we return the chainHead_storage response, which contains the operation ID (Ok(operationId))

It is possible that before the user receives the operationID of the storage query; the chainHead_follow already produced a response for that operationID.

To mitigate this:

  1. We could implement an ACK send from the client side. The ACK is a response that tells the server that the operationID has been received. Without this mechanism, we could end-up with a race where the jsonrpsee thinks the message was propagated to the user; when the message (operationID) is only propagated to the kernel buffers of the server.

  2. Specify in the spec, that because of the async nature of the client - server communication; the chainHead_follow could produce a response with an operationID that has not been provided to the user yet. In those cases, users are encouraged to catch the response until they can correlate it with a query.

I would opt for option 2, since we already have a precedent in the rpc-v2 spec for the stopOperation.

Would love to get your thoughts on this!

@jsdw
Copy link
Contributor

jsdw commented Jan 19, 2024

My gut feeling is that 2 is the right approach; we probably can't guarantee that nothing will return on the subscription until after the user has received a response on the separate call. Even if we make sure that messages are sent in the right order, they may not be received in the right order.

So the right approach to working with this API is probably to start buffering messages from the subscription before making the call, and then once you get a response from the call with the operation ID, you can look for it in all of the subscription messages inc those that you buffered. (I checked what we do in Subxt here and that's essentially it).

@josepot
Copy link
Author

josepot commented Jan 19, 2024

Specify in the spec, that because of the async nature of the client - server communication; the chainHead_follow could produce a response with an operationID that has not been provided to the user yet.

This is IMO absolutely unacceptable.

Even if we make sure that messages are sent in the right order, they may not be received in the right order.

WebSocket connections work over TCP, so this statement is false. Messages sent in the right order are guaranteed to be received in the right order over a WebSocket connection.

The whole point of having a JSON-RPC API is to be able to guarantee things that can't be guaranteed over lower level protocols like UDP. One of those things is the order in which the messages arrive. It is unreasonable to force the client to keep around (for how long?) some notifications that the client doesn't know to which subscription they belong yet.

@bkchr
Copy link
Member

bkchr commented Jan 19, 2024

Yeah, I'm here on line with @josepot.

From your problem description it reads like the problem is in the server implementation that doesn't wait for the chain_head_unstable_storage to return before pushing data on the stream from somewhere else.

@josepot
Copy link
Author

josepot commented Jan 19, 2024

So the right approach to working with this API is probably to start buffering messages from the subscription before making the call, and then once you get a response from the call with the operation ID, you can look for it in all of the subscription messages inc those that you buffered. (I checked what we do in Subxt here and that's essentially it).

The JSON-RPC spec defines how notifications work. It is IMO completely wrong to keep around all the non-identified notifications, because there is no guarantee that a "correlation id" will be make public at a later stage.

@jsdw
Copy link
Contributor

jsdw commented Jan 19, 2024

I suppose that since all of this traffic goes down a single websocket connection now, my concern about a race in what message the client receives first doesn't make so much sense.

My concern in any case was that if you have (pseudo) code like:

// Perform some operation and get back ID
let operationId = await chainHeadCall(..);

// Then, get a handle to your follow subscription events and filter the ones you want
let operationEvents = chainHeadFollowSub.events().filterById(operationId);

Then it's possible for an event to arrive between getting the operation ID and starting to watch the follow subscription (because when you receive the operationId, it may already be in flight).

I suspect though that in JS this isn't an issue since that task wouldn't yield control to anything else from the moment the operation ID comes back until after it's monitoring the subscription though. With my head in Rust-land, where it's possible for events with that operation to be received on a separate thread between those two lines, I still need to care about potential races here I think.

In any case, let's see what we can do to enforce that the messages are sent down the WS connection in the right order!

@jsdw
Copy link
Contributor

jsdw commented Jan 19, 2024

Also FWIW the spec doesn't say anything about which order one can expect things in, only saying about the response containing an operationId:

This return value indicates that the request has successfully started.

operationId is a string containing an opaque value representing the operation.

If we want to guarantee that this retuns before any corresponding events on the follow subscription, let's make it clear in the spec too.

@josepot
Copy link
Author

josepot commented Jan 19, 2024

My concern in any case was that if you have (pseudo) code like:

// Perform some operation and get back ID
let operationId = await chainHeadCall(..);

// Then, get a handle to your follow subscription events and filter the ones you want
let operationEvents = chainHeadFollowSub.events().filterById(operationId);

Then it's possible for an event to arrive between getting the operation ID and starting to watch the follow subscription (because when you receive the operationId, it may already be in flight).

This makes absolutely no sense... Not only because that pseudocode makes no sense (what even is the type of the variable operationEvents?). Anyways, the argument is absurd because of the idea that you may miss an event because it happened "in between".

In JS every piece of code that runs, runs as a consequence of a callback triggered by an I/O event-queue. If an event arrives "in between", then the event still gets added to the I/O event-queue. So, as long as there is a listener interested, then the event that came "in between" will trigger the callback of the listener. If your argument is that it's technically possible to attach the the listener "too late", well then: sure, it's possible to write buggy code. However, in JS you will have to try very hard to shoot yourself in the foot in order to attach the listener too late. I mean, sure, it's well stablished that developers can mess up and write buggy code, but the whole idea that in JS you may miss events because you add their listener while the event is coming is absolutely nonsensical, especially in this case where the listener is a "child" listener of an already attached event-listener.

Also FWIW the spec doesn't say anything about which order one can expect things in, only saying about the response containing an operationId:

This return value indicates that the request has successfully started.
operationId is a string containing an opaque value representing the operation.

If we want to guarantee that this retuns before any corresponding events on the follow subscription, let's make it clear in the spec too.

I mean, the spec does say things like:

When an operation needs to be performed, it is put in a FIFO queue

and

In order to provide a good quality of service, the time between the moment a request is received by the server and the moment the response is sent back must be short.

So, sure... Perhaps it could be even more explicit, but IMO it's already mentioning that the order must be preserved.

On top of that, the documentation of the event clearly states that:

operationId is a string returned by chainHead_unstable_storage.

Ergo, the operationId must have been returned by the time that the event happens.

@lexnv
Copy link
Contributor

lexnv commented Jan 22, 2024

Is there anything stopping the user from having separate WebSocket connections?

  • one socket connection to chainHead_follow
  • a separate socket connection to chainHead_call

In this case, we can only guarantee the chainHead_call response arrived in the kernel buffer of the server; and not to the application layer of the user. That is, without introducing an ACK method chainHead_userHasReceivedThisOperationID.

@josepot @bkchr would love to hear your thoughts on this 🙏

@josepot
Copy link
Author

josepot commented Jan 22, 2024

Is there anything stopping the user from having separate WebSocket connections?

Different WebSocket connections implies different connections, which implies different subscriptions, which means: how is this even relevant to this discussion?

  • one socket connection to chainHead_follow
  • a separate socket connection to chainHead_call

This makes absolutely no sense. You can not mix the chainHead_follow subscription of one connection with the subscription of the other connection. They are different subscriptions. Even if the server returned the same ID, that ID is only relevant within the context of the connection that you are in.

@niklasad1
Copy link
Member

niklasad1 commented Jan 22, 2024

This makes absolutely no sense. You can not mix the chainHead_follow subscription of one connection with the subscription of the other connection. They are different subscriptions. Even if the server returned the same ID, that ID is only relevant within the context of the connection that you are in.

It's technically possible to create a subscription on different connection than the chainHead_call but that's really an edge-case and nothing I personally care about. Totally possible to fetch the subscription ID from a separate connection

Let's get back to main point, for a single connection we should enforce the order of the messages period.
No need to discuss it further here, we are on the same page.

We would need a more low-level API in jsonrpsee because the response is sent about when the RPC method returns and we emit these "events at same time" and race-conditions can occur.

@lexnv
Copy link
Contributor

lexnv commented Jan 22, 2024

The chainHead_follow gives you a subscription ID at the application layer from one socket.
The subscription ID is the identifier of the subscription, not the originated socket connection.

Users then can make a chainHead_call from a different socket connection using the above subscription ID identifier.

This makes absolutely no sense... Even if the server returned the same ID, that ID is only relevant within the context of the connection that you are in.

There is nothing in the server implementation or the spec; that would stop or restrict this behavior.

You can not mix the chainHead_follow subscription of one connection with the subscription of the other connection.

To make the call to the chainHead_call (or any other method); you'd need two things:

  • the follow_subscription subscription ID that identifies the chainHead_follow subscription that receives events
  • the pinned block hash reported by chainHead_follow

The server doesn't look at the socket connection or the IP of the user etc:

let mut block_guard = match self.subscriptions.lock_block(&follow_subscription, hash, 1) {

Any change to the spec should take into account this information. Or restrict / clarify user connections.

And for the time being, we should be looking at fixing the "single connection" use-case since we have more control there 👍

@josepot
Copy link
Author

josepot commented Jan 22, 2024

It's technically possible to create a subscription on different connection than the chainHead_call but that's really an edge-case and nothing I personally care about.

Is it possible to create different subscriptions on different connections? Sure!

Can a user use the subscriptionId returned in the context of one connection within the context of another connection?

Obviously, the answer is: no, it can not, because WebSocket connections -unlike REST APIs- are stateful. So, this is a moot point.

@jsdw
Copy link
Contributor

jsdw commented Feb 5, 2024

Just to echo Niklas's reply on paritytech/json-rpc-interface-spec#125:

The following PR in jsonrpsee should be merged soon, which will give jsonrpsee server RPC method impls the ability to wait until some message is queued for delivery from one RPC method before performing some action:

paritytech/jsonrpsee#1281

When that is merged, @niklasad1 will cherry pick that into a small jsonrpsee release and open a PR here to address this issue! We are aiming to have this PR ready to merge this week.

@lexnv
Copy link
Contributor

lexnv commented Feb 5, 2024

Kudos to @niklasad1 for adding a nice API to handle the single connection race in: paritytech/jsonrpsee#1281.

From the substrate perspective, I have created this issue to allow chainHead operations to be made only from a single connection:

At the time being, we can use the subscriptionId of one connection in another connection to make calls (see chainhead-cli tool for more info)

github-merge-queue bot pushed a commit that referenced this issue Feb 14, 2024
Close #2992 

Breaking changes:
- rpc server grafana metric `substrate_rpc_requests_started` is removed
(not possible to implement anymore)
- rpc server grafana metric `substrate_rpc_requests_finished` is removed
(not possible to implement anymore)
- rpc server ws ping/pong not ACK:ed within 30 seconds more than three
times then the connection will be closed

Added
- rpc server grafana metric `substrate_rpc_sessions_time` is added to
get the duration for each websocket session
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
…tech#3230)

Close paritytech#2992 

Breaking changes:
- rpc server grafana metric `substrate_rpc_requests_started` is removed
(not possible to implement anymore)
- rpc server grafana metric `substrate_rpc_requests_finished` is removed
(not possible to implement anymore)
- rpc server ws ping/pong not ACK:ed within 30 seconds more than three
times then the connection will be closed

Added
- rpc server grafana metric `substrate_rpc_sessions_time` is added to
get the duration for each websocket session
@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-api-2024-q1-update/7462/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants