-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unexpected failure while sending request, -84 is not a valid id #4494
Comments
Previous issue linked to #3771 which was supposed to have been fixed, but there's at least one other report of this error in 2.14 here: #3771 (comment) CC: @cwperks |
Possibly is related to opensearch-project/performance-analyzer#606 Still-open PR to fix: opensearch-project/performance-analyzer#609 |
I'm seeing the same issue upgrading from 2.12 to 2.14 |
@Jakob3xD It would be helpful for debugging to understand the setup that produced this error. What plugins or features are enabled? When this error was seen on upgrade from < 2.11 to 2.11, it was because PerformanceAnalyzer had wrapped an instance of a TransportChannel and was not delegating Backwards compatibility for serialization requires the receiving or transmitting node of a transport request to know the correct version of the target node. The target node being the node that the transport request is being sent to (from the transmitter end) or the node that sent the transport request (on the receiver end) It would be helpful to know the concrete className for security/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java Line 155 in 9caf5cb
It would also be helpful the know the concrete className for security/src/main/java/org/opensearch/security/ssl/transport/SecuritySSLRequestHandler.java Line 96 in 9caf5cb
|
We're also seeing this upgrading from
PA is disabled at startup:
error does reference
|
The plugins are listed in the issue. PA is completely removed during the image build process via To give more details on how i encountered the bug. I upgraded all my remote clusters from 2.13 to 2.14 without any issues and afterwards I wanted to upgrade the main clusters, where all the remote targets are configured. First two node reboots worked without any issue but after the third node thinks started to fail with the already named exception.
Not sure how I would get those? I am not familiar with java. |
Based on the stack trace it appears the issue is this: 2.12 node ------- transport request ------> 2.14 node When the 2.12 node is serializing its opting to use JDK serialization, but the 2.14 node is trying to deserialize as custom. The 2.12 node should be opting to use custom serialization since the target node is >= 2.11: https://github.com/opensearch-project/security/blob/2.12/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java#L153 |
No, we don't have cross cluster search configured on the cluster that had the issue. I was also able to recover by doing a full upgrade instead of rolling (shut all nodes down and start up with |
Nope, only happened on the main cluster. |
I could not reproduce the error with the security plugin disabled, FWIW |
[Triage] Hi @Jakob3xD, thanks for filing this issue. This looks like a good bug to get fixed. Going to mark as triaged. |
Also check - #4521 (comment) |
Just to add, we didn't had PA enabled but still saw the issue happening while upgrading from < 2.10 to 2.11.1 and now same while upgrading to 2.15. For ref: #4085 |
Reposting with better summary- In 2.14, custom serialization was disabled and the plugin moved back to JDK serialization with #4264 Hence, I think for a cluster upgrading from 2.11/2.12/2.13 to any 2.14+ version, the cluster can run into this issue due to different serialization methods. Similarly, a cluster upgrading from < 2.11 to 2.11-2.13 versions can run into the same issue. In 2.11/2.12/2.13 - serialization method is decided using version check In 2.14+ - serialization is decided using version check in SerializationFormat To fix the issue, one of the possible options is to pick the serialization method using the source node version of the transport request. The fix might be similar to #3826 |
Hey Everyone, Is anyone working on fixing this? It would be great if someone can help pick this up as it's hampering upgrades to the newer version after 2.10. |
@Dhruvan1217 One thing to consider is jumping to 2.14 and above which has Custom serialization disabled by default. I have not been able to reproduce this issue. Can you share a little more information about the requests that receive this error? One thing I have been looking into is whether its possible that there are multiple transport hops happening while serving a request. i.e. Client ----> 2.14 Node (Coordinator node) -----> 2.14 node -----> 2.12 node. The header that is causing problems that @Jakob3xD pointed out is first populated on the 2.14 node that receives the transport request from another 2.14 node and uses JDK serialization. If there's another transport hop from the 2.14 node -> 2.12 node then it re-uses the header which is JDK serialized. When the 2.12 node receives the request it assumes its custom serialized because the logic on the 2.12 branch looks to see if the transmitter is >= 2.11. This is all hypothetical though, I haven't been able to replicate. An example of an API that uses multiple transport hops is indexing a document. When making a request to insert a document the coordinator node forwards the request to the nodes with the primaries which then forward the request to the nodes containing replicas. I looked into whether indexing a document would encounter this error, but OpenSearch ensures that replicas are always on the same node version or greater than the primaries. |
Update: I was able to reproduce it with the scenario from above. It happens when replica shards are on an older node than the primary. |
Not sure if I can reproduce it again, but these were the steps I followed: Start with 3 2.12 nodes
Reboot the node with a primary shard as 2.14
Check shard allocation
Send a request to a node that does not have a shard for the |
@cwperks This situation makes sense to me.
I think during the upgrade, there are chances of Primary being on an updated node and replica on an old node. |
Describe the bug
With the rolling upgrade from 2.13.0 to 2.14 one of my Opensearch clusters starts to fail and throwing exceptions.
After restarting the third of six hardware nodes and doing the upgrade of 2.13 to 2.14 some indices went from yellow to red caused by shard failures with the following exception:
Related component
Other
To Reproduce
Not know.
The only big difference to my other clusters is, that the failing cluster is doing cross cluster search and therefore has remote targets configured.
Expected behavior
The upgrade from 2.13 to 2.14 should happen without such exceptions.
Additional Details
Plugins
Host/Environment (please complete the following information):
Additional context
Similar exception already existed in the past when updating from 2.10 to 2.11.
opensearch-project/OpenSearch#11491
The text was updated successfully, but these errors were encountered: