Skip to content

sorted cardinality results don't include the largest bucket #67782

@LeeDr

Description

@LeeDr

Elasticsearch version (bin/elasticsearch --version): 8.0.0, 7.12.0, 7.11.0

Plugins installed: [] none, default distribution

JVM version (java -version): built-in JDK

OS version (uname -a if on a Unix-like system): all (this is my current master source running but this impacts 7.x and 7.11 branches as well)

  "name" : "LEEDR-XPS",
  "cluster_name" : "es-test-cluster",
  "cluster_uuid" : "s-GANDlNSZ2nNdr00SQw3g",
  "version" : {
    "number" : "8.0.0-SNAPSHOT",
    "build_flavor" : "oss",
    "build_type" : "zip",
    "build_hash" : "3454a094f73e7696446dbd2c0525041293dd4460",
    "build_date" : "2021-01-19T19:31:16.897887417Z",
    "build_snapshot" : true,
    "lucene_version" : "8.8.0",
    "minimum_wire_compatibility_version" : "7.12.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

Description of the problem including expected versus actual behavior: A cardinality agg with split by terms is no longer returning the term with the largest result count. Results vary based on the "size".

Almost 3 years ago we automated the Shakespeare Kibana getting started tutorial https://www.elastic.co/guide/en/kibana/6.8/tutorial-load-dataset.html
The test has been passing with the same expected results until about Oct 29, 2020 when the results returned by the aggregation changed. Unfortunately the test was skipped to allow Kibana to take the new Elasticsearch snapshot and wasn't investigated until now.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

  1. download this data https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json
  2. create this mapping;
PUT /shakespeare
{
 "mappings": {
   "properties": {
     "speaker": {
       "type": "keyword"
     },
     "play_name": {
       "type": "keyword"
     },
     "line_id": {
       "type": "integer"
     },
     "speech_number": {
       "type": "integer"
     }
   }
 }
}
  1. Load the data;
    curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
  2. count the docs to make sure we have the same data curl -XGET 'localhost:9220/shakespeare/_count' "count":111396
  3. run the same query as the Kibana visualization test;
GET /shakespeare/_search
{
  "aggs": {
    "2": {
      "terms": {
        "field": "play_name",
        "order": {
          "1": "desc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "speaker"
          }
        }
      }
    }
  },
  "size": 0,
  "fields": [],
  "script_fields": {},
  "stored_fields": [
    "*"
  ],
  "_source": {
    "excludes": []
  },
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

The results I get on latest master are incorrect;

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "2" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 94454,
      "buckets" : [
        {
          "key" : "Henry VI Part 2",
          "doc_count" : 3334,
          "1" : {
            "value" : 65
          }
        },
        {
          "key" : "Coriolanus",
          "doc_count" : 3992,
          "1" : {
            "value" : 62
          }
        },
        {
          "key" : "Antony and Cleopatra",
          "doc_count" : 3862,
          "1" : {
            "value" : 55
          }
        },
        {
          "key" : "Henry VI Part 1",
          "doc_count" : 2983,
          "1" : {
            "value" : 53
          }
        },
        {
          "key" : "Julius Caesar",
          "doc_count" : 2771,
          "1" : {
            "value" : 51
          }
        }
      ]
    }
  }
}

If we increase the terms agg size to 12 we get results that show the largest bucket value of 71 which is what the Kibana test has expected since it was written almost 3 years ago and is what 7.10 shows;

      "buckets" : [
        {
          "key" : "Richard III",
          "doc_count" : 3941,
          "1" : {
            "value" : 71
          }
        },

Provide logs (if relevant):

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions