json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {} #1494

t-botz · 2021-05-24T06:19:30Z

Please use the following questions as a guideline to help me answer
your issue/question without further inquiry. Thank you.

Which version of Elastic are you using?

[x] elastic.v7 (for Elasticsearch 7.x)
[ ] elastic.v6 (for Elasticsearch 6.x)
[ ] elastic.v5 (for Elasticsearch 5.x)
[ ] elastic.v3 (for Elasticsearch 2.x)
[ ] elastic.v2 (for Elasticsearch 1.x)

Please describe the expected behavior

When calling client.ClusterStats().Do(ctx) on a cluster with a failing Node I expect do get a valid ClusterStatsResponse.

The response from the cluster for /_cluster/stats is :

{
    "_nodes": {
        "total": 2,
        "successful": 1,
        "failed": 1,
        "failures": [
            {
                "type": "failed_node_exception",
                "reason": "Failed node [mhUZF1sPTcu2b-pIJfqQRg]",
                "node_id": "mhUZF1sPTcu2b-pIJfqQRg",
                "caused_by": {
                    "type": "node_not_connected_exception",
                    "reason": "[es02][172.27.0.2:9300] Node not connected"
                }
            }
        ]
    },
    "cluster_name": "es-docker-cluster",
    "cluster_uuid": "r-OkEGlJTFOE8wP36G-VSg",
    "timestamp": 1621834319499,
    "indices": {
        "count": 0,
        "shards": {},
        "docs": {
            "count": 0,
            "deleted": 0
        },
        "store": {
            "size_in_bytes": 0,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 0,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 0,
            "memory_in_bytes": 0,
            "terms_memory_in_bytes": 0,
            "stored_fields_memory_in_bytes": 0,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 0,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 0,
            "index_writer_memory_in_bytes": 0,
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set_memory_in_bytes": 0,
            "max_unsafe_auto_id_timestamp": -9223372036854775808,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": []
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": []
        }
    },
    "nodes": {
        "count": {
            "total": 1,
            "coordinating_only": 0,
            "data": 1,
            "data_cold": 1,
            "data_content": 1,
            "data_hot": 1,
            "data_warm": 1,
            "ingest": 1,
            "master": 1,
            "ml": 1,
            "remote_cluster_client": 1,
            "transform": 1,
            "voting_only": 0
        },
        "versions": [
            "7.10.0"
        ],
        "os": {
            "available_processors": 6,
            "allocated_processors": 6,
            "names": [
                {
                    "name": "Linux",
                    "count": 1
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "CentOS Linux 8 (Core)",
                    "count": 1
                }
            ],
            "mem": {
                "total_in_bytes": 2084679680,
                "free_in_bytes": 590282752,
                "used_in_bytes": 1494396928,
                "free_percent": 28,
                "used_percent": 72
            }
        },
        "process": {
            "cpu": {
                "percent": 0
            },
            "open_file_descriptors": {
                "min": 260,
                "max": 260,
                "avg": 260
            }
        },
        "jvm": {
            "max_uptime_in_millis": 1042623,
            "versions": [
                {
                    "version": "15.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "15.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 1
                }
            ],
            "mem": {
                "heap_used_in_bytes": 208299344,
                "heap_max_in_bytes": 314572800
            },
            "threads": 32
        },
        "fs": {
            "total_in_bytes": 62725623808,
            "free_in_bytes": 26955173888,
            "available_in_bytes": 23738458112
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 1
            },
            "http_types": {
                "security4": 1
            }
        },
        "discovery_types": {
            "zen": 1
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 1
            }
        ],
        "ingest": {
            "number_of_pipelines": 0,
            "processor_stats": {}
        }
    }
}

Please describe the actual behavior

ClusterStatsResponse is failing unmarshalling with the error:
json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {}

Any steps to reproduce the behavior?

Launch a 2 node elasticsearch cluster, kill -9 the master node, ask the other node how it feels :)

The text was updated successfully, but these errors were encountered:

t-botz · 2021-05-24T06:22:12Z

FWIW I had a similar issue in production and due to this error I can't know what actually happened in prod or what kind of failure it was, but the stack trace was similar as the one I reproduced.

olivere · 2021-06-16T10:06:42Z

Sad node on the other end ;-)

Should be fixed now.

zhaozong · 2021-06-29T10:51:44Z

After upgrading to 7.0.25, it reports an error here

“cannot unmarshal object into Go struct field FailedNodeException._shards.failures.reason of type string”

my elasticsearch version is 7.10.1
@olivere

olivere · 2021-07-04T10:46:43Z

@zhaozong Oh my, seems like Elasticsearch has a breaking change in the response structure then, in a minor version update. I will look into this, but it looks like we can't change it nicely and make everyone happy.

gboddin · 2021-07-05T16:27:36Z

We had to revert to .24 for now, using ES 7.12.1

olivere · 2021-07-06T07:02:10Z

That's unfortunate. I will look into an alternative for the next release. The problem is that it probably would be a breaking change for anyone :-(

gboddin · 2021-07-06T21:00:19Z

Ah damn, what's the breaking change ? btw it was on a single (healthy) node.

Let me know if you need me to run some tests !

( https://github.com/LeakIX/yql-elastic , matching a single term (like test:citizen) seems to trigger the issue)

This commit adds a few more test cases for happy/unhappy responses of Cluster Stats API across different ES 7.x versions. We also added a Docker Compose file to start a cluster in a specific version. See #1494

olivere · 2021-07-07T10:05:10Z

The issue seems to be that there are different kinds of errors returned in a similar structure. Some operations return expections on the shard-level (SharedOperationFailedException in Java), some return exceptions on the operation level (a general ElasticsearchException in the Java source). I might have mixed them because they're very similar, e.g. both return a failures property but with a different structure. And it seems to only just now has been uncovered due to a subtle difference in the reason property (for ShardOperationFailedException it's a JSON object, for ElasticsearchException it's a string).

Worse is that I can only find this in the failure case, because only then the failures structure is properly populated by the server. I will see if and how to do this.

Any fully runnable and reproducible test case will help.

gboddin · 2021-07-07T13:45:42Z

Ok, so in my case :

There was an error before. I'm using an highlighter and here's the query exchange :

POST /l9leakip%2Cl9leakdomain/_search HTTP/1.1
Host: 192.168.10.2:9200
User-Agent: elastic/7.0.24 (linux-amd64)
Transfer-Encoding: chunked
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip

{"from":0,"highlight":{"fields":{"events.summary":{}},"post_tags":[""],"pre_tags":[""]},"query":{"bool":{"must":{"bool":{"should":{"bool":{"should":[{"nested":{"path":"events","query":{"match":{"events.hostname":{"query":"mysearch"}}}}},{"nested":{"path":"events","query":{"match":{"events.summary":{"query":"mysearch"}}}}},{"match":{"plugins":{"query":"mysearch"}}},{"nested":{"path":"events","query":{"match":{"events.ip.keyword":{"query":"mysearch"}}}}}]}}}}}},"size":20,"sort":[{"creation_date":{"order":"desc"}},{"_score":{"order":"desc"}}],"track_total_hits":true}

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-encoding: gzip
transfer-encoding: chunked

{
   "took":1146,
   "timed_out":false,
   "_shards":{
      "total":8,
      "successful":6,
      "skipped":0,
      "failed":2,
      "failures":[
         {
            "shard":1,
            "index":"l9leakip-0000001",
            "node":"AsQq1Dh2QxCSTRSLTg0vFw",
            "reason":{
               "type":"illegal_argument_exception",
               "reason":"The length [1119437] of field [events.summary] in doc[2524900]/index[l9leakip-0000001] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."
            }
         },
         {
            "shard":3,
            "index":"l9leakip-0000001",
            "node":"AsQq1Dh2QxCSTRSLTg0vFw",
            "reason":{
               "type":"illegal_argument_exception",
               "reason":"The length [1023566] of field [events.summary] in doc[2168434]/index[l9leakip-0000001] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."
            }
         }
      ]
   },
   "hits":{}
}

Can we return the results AND the error ?

It's a breaking change but it could makes sense. You won't have to check results.Shards.Failures unless you get a shard error from your query.

olivere · 2021-07-07T14:42:12Z

@gboddin OK, that's helpful. I was looking at how Cluster Stats API changed, but my change affected other locations in the code as well. Thanks for the example.

The _shards property is already returned in the search response, so there's no breaking change. The problem is that there is a failures structure that returns reason as string and another (your example) where it is returned as an object (which itself has a reason of type string).

I wasn't aware of that, hence I broke it in 7.0.25.

I have a version that will revert the change in 7.0.25 and still make Cluster Stats API work fine. But I fear that there might be more locations in the code where I'm using the wrong failures struct.

olivere · 2021-07-07T15:17:13Z

@gboddin Do you have the time and option to test your code with the latest release-branch.v7 I just committed with fb654ed? Otherwise, I'll release 7.0.26 later.

gboddin · 2021-07-07T16:44:19Z

Done,

It's returning the results, err is nil, and results.Shards.Failures is populated with the 2 failures !

Thanks !

olivere · 2021-07-08T08:46:11Z

Thanks @gboddin. Will release 7.0.26 in a minute that will hopefully fix this issue.

gboddin · 2021-07-08T17:36:29Z

Thank you so much. ( For the fix and the awesome library ! )

This commit fixes a change in the cluster stats response structure. Close olivere#1494

This commit adds a few more test cases for happy/unhappy responses of Cluster Stats API across different ES 7.x versions. We also added a Docker Compose file to start a cluster in a specific version. See olivere#1494

This commit fixes a few issues regarding different response structures with a `failures` property. E.g. the `_shards` response structure has a `failures` property which returns different failures for the `reason` property than the `failures[x].reason` property of `_nodes` (returned from Cluster Stats API). I was confused due to this and messed up 7.0.25 because of it. This hopefully fixes olivere#1494.

olivere added this to the 7.0.25 milestone Jun 16, 2021

olivere closed this as completed in 107c379 Jun 16, 2021

olivere reopened this Jul 6, 2021

olivere modified the milestones: 7.0.25, 7.0.26 Jul 6, 2021

olivere closed this as completed in fb654ed Jul 7, 2021

olivere reopened this Jul 7, 2021

olivere closed this as completed Jul 8, 2021

dungnx pushed a commit to dungnx/elastic that referenced this issue Sep 16, 2021

Fix Cluster Stats response structure

dfe9b81

This commit fixes a change in the cluster stats response structure. Close olivere#1494

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {} #1494

json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {} #1494

t-botz commented May 24, 2021

t-botz commented May 24, 2021

olivere commented Jun 16, 2021

zhaozong commented Jun 29, 2021 •

edited

Loading

olivere commented Jul 4, 2021

gboddin commented Jul 5, 2021

olivere commented Jul 6, 2021

gboddin commented Jul 6, 2021

olivere commented Jul 7, 2021

gboddin commented Jul 7, 2021

olivere commented Jul 7, 2021

olivere commented Jul 7, 2021

gboddin commented Jul 7, 2021

olivere commented Jul 8, 2021

gboddin commented Jul 8, 2021

json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {} #1494

json: cannot unmarshal string into Go struct field ShardFailure._nodes.failures.reason of type map[string]interface {} #1494

Comments

t-botz commented May 24, 2021

Which version of Elastic are you using?

Please describe the expected behavior

Please describe the actual behavior

Any steps to reproduce the behavior?

t-botz commented May 24, 2021

olivere commented Jun 16, 2021

zhaozong commented Jun 29, 2021 • edited Loading

olivere commented Jul 4, 2021

gboddin commented Jul 5, 2021

olivere commented Jul 6, 2021

gboddin commented Jul 6, 2021

olivere commented Jul 7, 2021

gboddin commented Jul 7, 2021

olivere commented Jul 7, 2021

olivere commented Jul 7, 2021

gboddin commented Jul 7, 2021

olivere commented Jul 8, 2021

gboddin commented Jul 8, 2021

zhaozong commented Jun 29, 2021 •

edited

Loading