Skip to content

[BUG] Performance regression in top_hits aggregation #1647

@andrross

Description

@andrross

I believe I have isolated a regression in performance of the top_hits aggregation between ES 7.9 and ES 7.10.2 (which means the regression is present in OpenSearch 1.0 as well).

To Reproduce
I added the following operation to the default nyc_taxis track in es rally:

{
  "name": "top_hits_agg",
  "warmup-iterations": 10,
  "iterations": 50,
  "clients": 2,
  "operation": {
    "operation-type": "search",
    "body": {
      "size": 0,
      "query": {
        "range": {
          "dropoff_datetime": {
            "from": "2015-01-01 00:00:00",
            "to": "2015-01-01 12:00:00"
          }
        }
      },
      "aggs": {
        "2": {
          "terms": {
            "field": "payment_type",
            "size": 1000
          },
          "aggregations": {
            "3": {
              "date_histogram": {
                "field": "dropoff_datetime",
                "fixed_interval": "5s"
              },
              "aggregations": {
                "1": {
                  "top_hits": {
                    "size": 1,
                    "sort": [
                      {
                        "dropoff_datetime": {
                          "order": "desc"
                        }
                      }
                    ]
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

After populating both a 7.9 and 7.10.2 cluster with the nyc_taxis dataset, I then ran a race with the following commands:

esrally race --pipeline=benchmark-only --distribution-version=7.9.0 --track=nyc_taxis --target-host=127.0.0.1:9790 --report-file=es_790_report_$(date +%s) --user-tag='version:7.9,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg

esrally race --pipeline=benchmark-only --distribution-version=7.10.2 --track=nyc_taxis --target-host=127.0.0.1:9710 --report-file=es_710_report_$(date +%s) --user-tag='version:7.10.2,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg

Expected behavior
The expected behavior is that there will be very similar performance for the top_hits aggregation between the two versions of the software. The following data compares ES7.9 as baseline and ES7.10.2 as the contender. As expected, the performance of the date_histogram_agg operation is more-or-less at parity between the two. However, the top_hits aggregation performance is consistently 20-30% worse in 7.10.2 than 7.9:

Metric Task Baseline Contender Diff Unit
Cumulative indexing time of primary shards 125.509 110.375 -15.1333 min
Min cumulative indexing time across primary shard 125.509 110.375 -15.1333 min
Median cumulative indexing time across primary shard 125.509 110.375 -15.1333 min
Max cumulative indexing time across primary shard 125.509 110.375 -15.1333 min
Cumulative indexing throttle time of primary shards 0 0 0 min
Min cumulative indexing throttle time across primary shard 0 0 0 min
Median cumulative indexing throttle time across primary shard 0 0 0 min
Max cumulative indexing throttle time across primary shard 0 0 0 min
Cumulative merge time of primary shards 41.7306 47.5379 5.80733 min
Cumulative merge count of primary shards 93 89 -4
Min cumulative merge time across primary shard 41.7306 47.5379 5.80733 min
Median cumulative merge time across primary shard 41.7306 47.5379 5.80733 min
Max cumulative merge time across primary shard 41.7306 47.5379 5.80733 min
Cumulative merge throttle time of primary shards 4.12925 4.25008 0.12083 min
Min cumulative merge throttle time across primary shard 4.12925 4.25008 0.12083 min
Median cumulative merge throttle time across primary shard 4.12925 4.25008 0.12083 min
Max cumulative merge throttle time across primary shard 4.12925 4.25008 0.12083 min
Cumulative refresh time of primary shards 2.60575 2.36 -0.24575 min
Cumulative refresh count of primary shards 82 79 -3
Min cumulative refresh time across primary shard 2.60575 2.36 -0.24575 min
Median cumulative refresh time across primary shard 2.60575 2.36 -0.24575 min
Max cumulative refresh time across primary shard 2.60575 2.36 -0.24575 min
Cumulative flush time of primary shards 5.45558 5.48922 0.03363 min
Cumulative flush count of primary shards 35 34 -1
Min cumulative flush time across primary shard 5.45558 5.48922 0.03363 min
Median cumulative flush time across primary shard 5.45558 5.48922 0.03363 min
Max cumulative flush time across primary shard 5.45558 5.48922 0.03363 min
Total Young Gen GC time 0.258 1.24 0.982 s
Total Young Gen GC count 32 136 104
Total Old Gen GC time 0 0 0 s
Total Old Gen GC count 0 0 0
Store size 25.0414 24.3655 -0.67586 GB
Translog size 5.12227e-08 5.12227e-08 0 GB
Heap used for segments 0.193222 0.0988235 -0.0944 MB
Heap used for doc values 0.0354309 0.034874 -0.00056 MB
Heap used for terms 0.0408325 0.0370789 -0.00375 MB
Heap used for norms 0 0 0 MB
Heap used for points 0 0 0 MB
Heap used for stored fields 0.116959 0.0268707 -0.09009 MB
Segment count 32 30 -2
Min Throughput date_histogram_agg 1.50023 1.50023 -1e-05 ops/s
Median Throughput date_histogram_agg 1.50035 1.50035 -0 ops/s
Max Throughput date_histogram_agg 1.50054 1.50054 0 ops/s
50th percentile latency date_histogram_agg 548.563 549.301 0.73863 ms
90th percentile latency date_histogram_agg 557.053 559.417 2.36497 ms
99th percentile latency date_histogram_agg 571.816 569.581 -2.23542 ms
100th percentile latency date_histogram_agg 578.69 575.319 -3.37139 ms
50th percentile service time date_histogram_agg 547.429 548.28 0.85141 ms
90th percentile service time date_histogram_agg 555.608 558.266 2.65809 ms
99th percentile service time date_histogram_agg 570.955 568.385 -2.57038 ms
100th percentile service time date_histogram_agg 577.466 573.309 -4.1576 ms
error rate date_histogram_agg 0 0 0 %
Min Throughput top_hits_agg 0.720015 0.559013 -0.161 ops/s
Median Throughput top_hits_agg 0.747501 0.577278 -0.17022 ops/s
Max Throughput top_hits_agg 0.759315 0.582273 -0.17704 ops/s
50th percentile latency top_hits_agg 2569.6 3373.35 803.747 ms
90th percentile latency top_hits_agg 2609.98 3455.39 845.411 ms
99th percentile latency top_hits_agg 2622.45 3502.29 879.836 ms
100th percentile latency top_hits_agg 2694.13 3503.48 809.352 ms
50th percentile service time top_hits_agg 2569.6 3373.35 803.747 ms
90th percentile service time top_hits_agg 2609.98 3455.39 845.411 ms
99th percentile service time top_hits_agg 2622.45 3502.29 879.836 ms
100th percentile service time top_hits_agg 2694.13 3503.48 809.352 ms
error rate top_hits_agg 0 0 0 %

Plugins
None

Host/Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Version ES 7.9 and ES 7.10.2
  • Host type: c6i.8xlarge

Additional context
The elasticsearch distributions are just the stock versions downloaded as tar files, extracted and started with the following command:

ES_JAVA_OPTS="-Xms8g -Xmx8g" ./bin/elasticsearch

The only non-default setting in the yml configuration file is the port number. Both instances are running on the same machine, though there load is only driven one at a time (not concurrently). The service is bound to localhost and es rally is running on the same machine.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions