Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {} #1494

Closed
t-botz opened this issue May 24, 2021 · 14 comments
Milestone

Comments

@t-botz
Copy link

t-botz commented May 24, 2021

Please use the following questions as a guideline to help me answer
your issue/question without further inquiry. Thank you.

Which version of Elastic are you using?

[x] elastic.v7 (for Elasticsearch 7.x)
[ ] elastic.v6 (for Elasticsearch 6.x)
[ ] elastic.v5 (for Elasticsearch 5.x)
[ ] elastic.v3 (for Elasticsearch 2.x)
[ ] elastic.v2 (for Elasticsearch 1.x)

Please describe the expected behavior

When calling client.ClusterStats().Do(ctx) on a cluster with a failing Node I expect do get a valid ClusterStatsResponse.

The response from the cluster for /_cluster/stats is :

{
    "_nodes": {
        "total": 2,
        "successful": 1,
        "failed": 1,
        "failures": [
            {
                "type": "failed_node_exception",
                "reason": "Failed node [mhUZF1sPTcu2b-pIJfqQRg]",
                "node_id": "mhUZF1sPTcu2b-pIJfqQRg",
                "caused_by": {
                    "type": "node_not_connected_exception",
                    "reason": "[es02][172.27.0.2:9300] Node not connected"
                }
            }
        ]
    },
    "cluster_name": "es-docker-cluster",
    "cluster_uuid": "r-OkEGlJTFOE8wP36G-VSg",
    "timestamp": 1621834319499,
    "indices": {
        "count": 0,
        "shards": {},
        "docs": {
            "count": 0,
            "deleted": 0
        },
        "store": {
            "size_in_bytes": 0,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 0,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 0,
            "memory_in_bytes": 0,
            "terms_memory_in_bytes": 0,
            "stored_fields_memory_in_bytes": 0,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 0,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 0,
            "index_writer_memory_in_bytes": 0,
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set_memory_in_bytes": 0,
            "max_unsafe_auto_id_timestamp": -9223372036854775808,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": []
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": []
        }
    },
    "nodes": {
        "count": {
            "total": 1,
            "coordinating_only": 0,
            "data": 1,
            "data_cold": 1,
            "data_content": 1,
            "data_hot": 1,
            "data_warm": 1,
            "ingest": 1,
            "master": 1,
            "ml": 1,
            "remote_cluster_client": 1,
            "transform": 1,
            "voting_only": 0
        },
        "versions": [
            "7.10.0"
        ],
        "os": {
            "available_processors": 6,
            "allocated_processors": 6,
            "names": [
                {
                    "name": "Linux",
                    "count": 1
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "CentOS Linux 8 (Core)",
                    "count": 1
                }
            ],
            "mem": {
                "total_in_bytes": 2084679680,
                "free_in_bytes": 590282752,
                "used_in_bytes": 1494396928,
                "free_percent": 28,
                "used_percent": 72
            }
        },
        "process": {
            "cpu": {
                "percent": 0
            },
            "open_file_descriptors": {
                "min": 260,
                "max": 260,
                "avg": 260
            }
        },
        "jvm": {
            "max_uptime_in_millis": 1042623,
            "versions": [
                {
                    "version": "15.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "15.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 1
                }
            ],
            "mem": {
                "heap_used_in_bytes": 208299344,
                "heap_max_in_bytes": 314572800
            },
            "threads": 32
        },
        "fs": {
            "total_in_bytes": 62725623808,
            "free_in_bytes": 26955173888,
            "available_in_bytes": 23738458112
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 1
            },
            "http_types": {
                "security4": 1
            }
        },
        "discovery_types": {
            "zen": 1
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 1
            }
        ],
        "ingest": {
            "number_of_pipelines": 0,
            "processor_stats": {}
        }
    }
}

Please describe the actual behavior

ClusterStatsResponse is failing unmarshalling with the error:
json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {}

Any steps to reproduce the behavior?

Launch a 2 node elasticsearch cluster, kill -9 the master node, ask the other node how it feels :)

@t-botz
Copy link
Author

t-botz commented May 24, 2021

FWIW I had a similar issue in production and due to this error I can't know what actually happened in prod or what kind of failure it was, but the stack trace was similar as the one I reproduced.

@olivere olivere added this to the 7.0.25 milestone Jun 16, 2021
@olivere
Copy link
Owner

olivere commented Jun 16, 2021

Sad node on the other end ;-)

Should be fixed now.

@zhaozong
Copy link

zhaozong commented Jun 29, 2021

After upgrading to 7.0.25, it reports an error here

“cannot unmarshal object into Go struct field FailedNodeException._shards.failures.reason of type string”

my elasticsearch version is 7.10.1
@olivere

@olivere
Copy link
Owner

olivere commented Jul 4, 2021

@zhaozong Oh my, seems like Elasticsearch has a breaking change in the response structure then, in a minor version update. I will look into this, but it looks like we can't change it nicely and make everyone happy.

@gboddin
Copy link

gboddin commented Jul 5, 2021

We had to revert to .24 for now, using ES 7.12.1

@olivere
Copy link
Owner

olivere commented Jul 6, 2021

That's unfortunate. I will look into an alternative for the next release. The problem is that it probably would be a breaking change for anyone :-(

@olivere olivere reopened this Jul 6, 2021
@olivere olivere modified the milestones: 7.0.25, 7.0.26 Jul 6, 2021
@gboddin
Copy link

gboddin commented Jul 6, 2021

Ah damn, what's the breaking change ? btw it was on a single (healthy) node.

Let me know if you need me to run some tests !

( https://github.com/LeakIX/yql-elastic , matching a single term (like test:citizen) seems to trigger the issue)

olivere added a commit that referenced this issue Jul 7, 2021
This commit adds a few more test cases for happy/unhappy responses of
Cluster Stats API across different ES 7.x versions.

We also added a Docker Compose file to start a cluster in a specific
version.

See #1494
@olivere
Copy link
Owner

olivere commented Jul 7, 2021

The issue seems to be that there are different kinds of errors returned in a similar structure. Some operations return expections on the shard-level (SharedOperationFailedException in Java), some return exceptions on the operation level (a general ElasticsearchException in the Java source). I might have mixed them because they're very similar, e.g. both return a failures property but with a different structure. And it seems to only just now has been uncovered due to a subtle difference in the reason property (for ShardOperationFailedException it's a JSON object, for ElasticsearchException it's a string).

Worse is that I can only find this in the failure case, because only then the failures structure is properly populated by the server. I will see if and how to do this.

Any fully runnable and reproducible test case will help.

@gboddin
Copy link

gboddin commented Jul 7, 2021

Ok, so in my case :

There was an error before. I'm using an highlighter and here's the query exchange :

POST /l9leakip%2Cl9leakdomain/_search HTTP/1.1
Host: 192.168.10.2:9200
User-Agent: elastic/7.0.24 (linux-amd64)
Transfer-Encoding: chunked
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip

{"from":0,"highlight":{"fields":{"events.summary":{}},"post_tags":[""],"pre_tags":[""]},"query":{"bool":{"must":{"bool":{"should":{"bool":{"should":[{"nested":{"path":"events","query":{"match":{"events.hostname":{"query":"mysearch"}}}}},{"nested":{"path":"events","query":{"match":{"events.summary":{"query":"mysearch"}}}}},{"match":{"plugins":{"query":"mysearch"}}},{"nested":{"path":"events","query":{"match":{"events.ip.keyword":{"query":"mysearch"}}}}}]}}}}}},"size":20,"sort":[{"creation_date":{"order":"desc"}},{"_score":{"order":"desc"}}],"track_total_hits":true}

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-encoding: gzip
transfer-encoding: chunked

{
   "took":1146,
   "timed_out":false,
   "_shards":{
      "total":8,
      "successful":6,
      "skipped":0,
      "failed":2,
      "failures":[
         {
            "shard":1,
            "index":"l9leakip-0000001",
            "node":"AsQq1Dh2QxCSTRSLTg0vFw",
            "reason":{
               "type":"illegal_argument_exception",
               "reason":"The length [1119437] of field [events.summary] in doc[2524900]/index[l9leakip-0000001] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."
            }
         },
         {
            "shard":3,
            "index":"l9leakip-0000001",
            "node":"AsQq1Dh2QxCSTRSLTg0vFw",
            "reason":{
               "type":"illegal_argument_exception",
               "reason":"The length [1023566] of field [events.summary] in doc[2168434]/index[l9leakip-0000001] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."
            }
         }
      ]
   },
   "hits":{}
}

Can we return the results AND the error ?

It's a breaking change but it could makes sense. You won't have to check results.Shards.Failures unless you get a shard error from your query.

@olivere
Copy link
Owner

olivere commented Jul 7, 2021

@gboddin OK, that's helpful. I was looking at how Cluster Stats API changed, but my change affected other locations in the code as well. Thanks for the example.

The _shards property is already returned in the search response, so there's no breaking change. The problem is that there is a failures structure that returns reason as string and another (your example) where it is returned as an object (which itself has a reason of type string).

I wasn't aware of that, hence I broke it in 7.0.25.

I have a version that will revert the change in 7.0.25 and still make Cluster Stats API work fine. But I fear that there might be more locations in the code where I'm using the wrong failures struct.

@olivere olivere closed this as completed in fb654ed Jul 7, 2021
@olivere
Copy link
Owner

olivere commented Jul 7, 2021

@gboddin Do you have the time and option to test your code with the latest release-branch.v7 I just committed with fb654ed? Otherwise, I'll release 7.0.26 later.

@olivere olivere reopened this Jul 7, 2021
@gboddin
Copy link

gboddin commented Jul 7, 2021

Done,

It's returning the results, err is nil, and results.Shards.Failures is populated with the 2 failures !

Thanks !

@olivere
Copy link
Owner

olivere commented Jul 8, 2021

Thanks @gboddin. Will release 7.0.26 in a minute that will hopefully fix this issue.

@olivere olivere closed this as completed Jul 8, 2021
@gboddin
Copy link

gboddin commented Jul 8, 2021

Thank you so much. ( For the fix and the awesome library ! )

dungnx pushed a commit to dungnx/elastic that referenced this issue Sep 16, 2021
This commit fixes a change in the cluster stats response structure.

Close olivere#1494
dungnx pushed a commit to dungnx/elastic that referenced this issue Sep 16, 2021
This commit adds a few more test cases for happy/unhappy responses of
Cluster Stats API across different ES 7.x versions.

We also added a Docker Compose file to start a cluster in a specific
version.

See olivere#1494
dungnx pushed a commit to dungnx/elastic that referenced this issue Sep 16, 2021
This commit fixes a few issues regarding different response structures
with a `failures` property. E.g. the `_shards` response structure has a
`failures` property which returns different failures for the `reason`
property than the `failures[x].reason` property of `_nodes` (returned
from Cluster Stats API). I was confused due to this and messed up
7.0.25 because of it.

This hopefully fixes olivere#1494.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants