Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dynamic pruning to cardinality aggregations on low-cardinality keyword fields. #92060

Merged
merged 14 commits into from
May 1, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Dec 2, 2022

On low-cardinality keyword fields, the cardinality aggregation currently uses the global_ordinals execution mode most of the time, which consists of collecting all documents that match the query, reading ordinals of the values that these documents contain, and setting bits in a bitset for these ordinals.

This commit introduces a feedback loop between the query and the cardinality aggregator, which allows the query to skip documents that only contain values that have already been seen by the cardinality aggregator. On the nyc_taxis dataset, a match_all query and the vendor_id field (2 unique values), the cardinality aggregation went from 3s to 3ms. The speedup would certainly not be as good in all cases, but I would still expect it to be very significant in many cases.

…rdinality keyword fields.

On low-cardinality keyword fields, the `cardinality` aggregation currently uses
the `global_ordinals` execution mode most of the time, which consists of
collecting all documents that match the query, reading ordinals of the values
that these documents contain, and setting bits in a bitset for these ordinals.

This commit introduces a feedback loop between the query and the `cardinality`
aggregator, which allows the query to skip documents that only contain values
that have already been seen by the `cardinality` aggregator. On the `nyc_taxis`
dataset, a `match_all` query and the `vendor_id` field (2 unique values), the
`cardinality` aggregation went from 3s to 3ms. The speedup would certainly not
be as good in all cases, but I would still expect in to be very significant in
many cases.
@elasticsearchmachine
Copy link
Collaborator

Hi @jpountz, I've created a changelog YAML for you.

@jpountz jpountz marked this pull request as ready for review December 2, 2022 13:56
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Dec 2, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@jpountz
Copy link
Contributor Author

jpountz commented Dec 7, 2022

I've run some benchmarks on a synthetically generated index that has 100M documents with the following values for the i-th indexed document:

  • k0 is i % 1000 < 50
  • k1 is i % 500 < 25
  • k2 is i % 500

Then the following queries:

match_all-cardinality:

      {
        "size": 0,
        "track_total_hits": false,
        "aggs": {
          "k2_cardinality": {
            "cardinality": {
              "field": "k2",
              "execution_hint": "global_ordinals"
            }
          }
        }
      }

k0-cardinality. This runs a cardinality aggregation on a term query that would collect every unique value of k2 so it should be able to exit early. This is expected to simulate a cardinality on the number of agents sending data over a date range, since all agents are expected to exist given a sufficiently long time range.

      {
        "size": 0,
        "track_total_hits": false,
        "query": {
          "term": {
            "k0": "0"
          }
        },
        "aggs": {
          "k2_cardinality": {
            "cardinality": {
              "field": "k2",
              "execution_hint": "global_ordinals"
            }
          }
        }
      }

k1-cardinality. This runs a cardinality aggregation on a term query that does NOT see every value of k2. So it can prune some hits, but it cannot end early, it need to go through the whole index. This would typically happen when there is correlation between the query and the aggregated field, e.g. compute the cardinality of agents that reported errors: probably that some monitored hosts are more likely to have errors than others.

      {
        "size": 0,
        "track_total_hits": false,
        "query": {
          "term": {
            "k1": "0"
          }
        },
        "aggs": {
          "k2_cardinality": {
            "cardinality": {
              "field": "k2",
              "execution_hint": "global_ordinals"
            }
          }
        }
      }

Here is what Rally reported:

|                                                Min Throughput | match_all-cardinality |    0.57575     | 444.294       |   443.718   |  ops/s | +77067.82% |
|                                               Mean Throughput | match_all-cardinality |    0.578541    | 444.294       |   443.715   |  ops/s | +76695.54% |
|                                             Median Throughput | match_all-cardinality |    0.57887     | 444.294       |   443.715   |  ops/s | +76651.97% |
|                                                Max Throughput | match_all-cardinality |    0.580407    | 444.294       |   443.713   |  ops/s | +76448.66% |
|                                       50th percentile latency | match_all-cardinality | 1719.91        |   1.88433     | -1718.02    |     ms |    -99.89% |
|                                       90th percentile latency | match_all-cardinality | 1753.1         |   2.1355      | -1750.96    |     ms |    -99.88% |
|                                      100th percentile latency | match_all-cardinality | 1791.7         |   2.63628     | -1789.06    |     ms |    -99.85% |
|                                  50th percentile service time | match_all-cardinality | 1719.91        |   1.88433     | -1718.02    |     ms |    -99.89% |
|                                  90th percentile service time | match_all-cardinality | 1753.1         |   2.1355      | -1750.96    |     ms |    -99.88% |
|                                 100th percentile service time | match_all-cardinality | 1791.7         |   2.63628     | -1789.06    |     ms |    -99.85% |
|                                                    error rate | match_all-cardinality |    0           |   0           |     0       |      % |      0.00% |
|                                                Min Throughput |        k0-cardinality |    0.5367      | 473.317       |   472.78    |  ops/s | +88090.11% |
|                                               Mean Throughput |        k0-cardinality |    0.538432    | 473.317       |   472.778   |  ops/s | +87806.55% |
|                                             Median Throughput |        k0-cardinality |    0.538581    | 473.317       |   472.778   |  ops/s | +87782.16% |
|                                                Max Throughput |        k0-cardinality |    0.539811    | 473.317       |   472.777   |  ops/s | +87581.97% |
|                                       50th percentile latency |        k0-cardinality | 1846.59        |   1.83855     | -1844.75    |     ms |    -99.90% |
|                                       90th percentile latency |        k0-cardinality | 1891.9         |   2.06683     | -1889.83    |     ms |    -99.89% |
|                                      100th percentile latency |        k0-cardinality | 1953.52        |   2.39496     | -1951.12    |     ms |    -99.88% |
|                                  50th percentile service time |        k0-cardinality | 1846.59        |   1.83855     | -1844.75    |     ms |    -99.90% |
|                                  90th percentile service time |        k0-cardinality | 1891.9         |   2.06683     | -1889.83    |     ms |    -99.89% |
|                                 100th percentile service time |        k0-cardinality | 1953.52        |   2.39496     | -1951.12    |     ms |    -99.88% |
|                                                    error rate |        k0-cardinality |    0           |   0           |     0       |      % |      0.00% |
|                                                Min Throughput |        k1-cardinality |    0.529033    |   1.876       |     1.34697 |  ops/s |   +254.61% |
|                                               Mean Throughput |        k1-cardinality |    0.532735    |   1.88155     |     1.34881 |  ops/s |   +253.19% |
|                                             Median Throughput |        k1-cardinality |    0.533744    |   1.8789      |     1.34515 |  ops/s |   +252.02% |
|                                                Max Throughput |        k1-cardinality |    0.534619    |   1.89503     |     1.36041 |  ops/s |   +254.46% |
|                                       50th percentile latency |        k1-cardinality | 1871.84        | 535.195       | -1336.64    |     ms |    -71.41% |
|                                       90th percentile latency |        k1-cardinality | 1901.33        | 539.858       | -1361.47    |     ms |    -71.61% |
|                                      100th percentile latency |        k1-cardinality | 1945.91        | 542.866       | -1403.04    |     ms |    -72.10% |
|                                  50th percentile service time |        k1-cardinality | 1871.84        | 535.195       | -1336.64    |     ms |    -71.41% |
|                                  90th percentile service time |        k1-cardinality | 1901.33        | 539.858       | -1361.47    |     ms |    -71.61% |
|                                 100th percentile service time |        k1-cardinality | 1945.91        | 542.866       | -1403.04    |     ms |    -72.10% |
|                                                    error rate |        k1-cardinality |    0           |   0           |     0       |      % |      0.00% |

The query on k1 is between 3x and 4x faster. And the query on a match_all and on k0 are ~1000x faster.

I also pushed some more tests that make sure that the optimization kicks in by using the debug info, so I think it's ready for someone to have a closer look at how it works.

@not-napoleon not-napoleon self-requested a review December 19, 2022 14:28
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@martijnvg
Copy link
Member

@elasticmachine update branch

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's taken me so long to look at this. I think this is fine to merge.

// we'd be paying the overhead of dynamic pruning without getting any benefits.
private static final int MAX_FIELD_CARDINALITY_FOR_DYNAMIC_PRUNING = 1024;

// Only start dynamic pruning when 128 ordinals or less have not been seen yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this comment. Do we prune when we have less than 128 or more than 128 ordinals?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand this, is that if we have less then or equal to 128 unseen ordinals then we prune.

@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@martijnvg
Copy link
Member

@elasticmachine update branch

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get this change merged. LGTM

// we'd be paying the overhead of dynamic pruning without getting any benefits.
private static final int MAX_FIELD_CARDINALITY_FOR_DYNAMIC_PRUNING = 1024;

// Only start dynamic pruning when 128 ordinals or less have not been seen yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand this, is that if we have less then or equal to 128 unseen ordinals then we prune.

void startPruning() throws IOException {
dynamicPruningSuccess++;
nonVisitedOrds = new HashMap<>();
// TODO: iterate the bitset using a `nextClearBit` operation?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Letting this loop be lead by nextClearBit() is a good idea 👍
I think we can add that method to bitset and make use of that in a follow up change.

noData++;
return LeafBucketCollector.NO_OP_COLLECTOR;
}
// Otherwise we might be aggregating e.g. an IP field, which indexes data using points rather than an inverted index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in a follow up change, we can also do the same trick here with points?

final CompetitiveIterator competitiveIterator;

{
// This optimization only works for top-level cardinality aggregations that collect bucket 0, so we can retrieve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only current limitation then I think we can get around this by creating CompetitiveIterator instance lazily? We would then need a CompetitiveIterator per bucket ordinal. Not sure if this can work. But could be explored in a follow up PR.

this.docsWithField = docsWithField;
}

private Map<Long, PostingsEnum> nonVisitedOrds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this optimisation is limited in its use, only if on fields with <= 1024 terms, the need to use LongObjectPagedHashMap isn't needed here.

if (indexTerms != null) {
BitArray bits = visitedOrds.get(0);
final int numNonVisitedOrds = maxOrd - (bits == null ? 0 : (int) bits.cardinality());
if (maxOrd <= MAX_FIELD_CARDINALITY_FOR_DYNAMIC_PRUNING || numNonVisitedOrds <= MAX_TERMS_FOR_DYNAMIC_PRUNING) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the optimisation also kicks in on fields with more than 1024 unique values, if there 128 or less terms to be processed. The brute force leaf bucket collector implementation does update the bit sets used here to determine numNonVisitedOrds.

@martijnvg martijnvg merged commit 79ad42c into elastic:main May 1, 2023
@martijnvg
Copy link
Member

This improvement also yielded a significant speedup for another benchmark that is under development. This search request counts the number of unique deployments using the cardinality aggregation:

{
    "aggs": {
      "0": {
        "cardinality": {
          "field": "kubernetes.deployment.name"
        }
      }
    },
    "size": 0,
    "fields": [
      {
        "field": "@timestamp",
        "format": "date_time"
      },
      {
        "field": "event.ingested",
        "format": "date_time"
      },
      {
        "field": "process.cpu.start_time",
        "format": "date_time"
      },
      {
        "field": "system.process.cpu.start_time",
        "format": "date_time"
      }
    ],
    "script_fields": {},
    "stored_fields": [
      "*"
    ],
    "runtime_mappings": {},
    "_source": {
      "excludes": []
    },
    "query": {
      "bool": {
        "must": [],
        "filter": [
          {
            "match_phrase": {
              "data_stream.dataset": "kubernetes.pod"
            }
          },
          {
            "range": {
              "@timestamp": {
                "format": "strict_date_optional_time",
                "gte": "{{info[1]}}",
                "lte": "{{end_time}}"
              }
            }
          }
        ],
        "should": [],
        "must_not": []
      }
    }
  }  

This search request has a top level cardinality aggregation, which a requirement for the improvements in this change to kick in. Results:

                              Min Throughput |            unique_deployment_count_15_minutes |     24.6054      |    284.15        |     259.545       |  ops/s | +1054.83% |
|                                               Mean Throughput |            unique_deployment_count_15_minutes |     24.8917      |    284.15        |     259.258       |  ops/s | +1041.55% |
|                                             Median Throughput |            unique_deployment_count_15_minutes |     24.9475      |    284.15        |     259.203       |  ops/s | +1038.99% |
|                                                Max Throughput |            unique_deployment_count_15_minutes |     25.0225      |    284.15        |     259.128       |  ops/s | +1035.58% |
|                                       50th percentile latency |            unique_deployment_count_15_minutes |     39.1605      |      2.79196     |     -36.3685      |     ms |   -92.87% |
|                                       90th percentile latency |            unique_deployment_count_15_minutes |     40.4239      |      3.40919     |     -37.0147      |     ms |   -91.57% |
|                                       99th percentile latency |            unique_deployment_count_15_minutes |     49.6284      |      6.01068     |     -43.6178      |     ms |   -87.89% |
|                                      100th percentile latency |            unique_deployment_count_15_minutes |     55.0896      |      8.12486     |     -46.9647      |     ms |   -85.25% |
|                                  50th percentile service time |            unique_deployment_count_15_minutes |     39.1605      |      2.79196     |     -36.3685      |     ms |   -92.87% |
|                                  90th percentile service time |            unique_deployment_count_15_minutes |     40.4239      |      3.40919     |     -37.0147      |     ms |   -91.57% |
|                                  99th percentile service time |            unique_deployment_count_15_minutes |     49.6284      |      6.01068     |     -43.6178      |     ms |   -87.89% |
|                                 100th percentile service time |            unique_deployment_count_15_minutes |     55.0896      |      8.12486     |     -46.9647      |     ms |   -85.25% |
|                                                    error rate |            unique_deployment_count_15_minutes |      0           |      0           |       0           |      % |     0.00% |
|                                                Min Throughput |               unique_deployment_count_2_hours |      8.43276     |    299.224       |     290.791       |  ops/s | +3448.35% |
|                                               Mean Throughput |               unique_deployment_count_2_hours |      8.44431     |    299.224       |     290.78        |  ops/s | +3443.50% |
|                                             Median Throughput |               unique_deployment_count_2_hours |      8.44032     |    299.224       |     290.784       |  ops/s | +3445.17% |
|                                                Max Throughput |               unique_deployment_count_2_hours |      8.4756      |    299.224       |     290.748       |  ops/s | +3430.42% |
|                                       50th percentile latency |               unique_deployment_count_2_hours |    117.179       |      2.79249     |    -114.386       |     ms |   -97.62% |
|                                       90th percentile latency |               unique_deployment_count_2_hours |    121.86        |      3.29555     |    -118.565       |     ms |   -97.30% |
|                                       99th percentile latency |               unique_deployment_count_2_hours |    131.548       |      3.82721     |    -127.721       |     ms |   -97.09% |
|                                      100th percentile latency |               unique_deployment_count_2_hours |    135.695       |      6.89987     |    -128.795       |     ms |   -94.92% |
|                                  50th percentile service time |               unique_deployment_count_2_hours |    117.179       |      2.79249     |    -114.386       |     ms |   -97.62% |
|                                  90th percentile service time |               unique_deployment_count_2_hours |    121.86        |      3.29555     |    -118.565       |     ms |   -97.30% |
|                                  99th percentile service time |               unique_deployment_count_2_hours |    131.548       |      3.82721     |    -127.721       |     ms |   -97.09% |
|                                 100th percentile service time |               unique_deployment_count_2_hours |    135.695       |      6.89987     |    -128.795       |     ms |   -94.92% |
|                                                    error rate |               unique_deployment_count_2_hours |      0           |      0           |       0           |      % |     0.00% |
|                                                Min Throughput |              unique_deployment_count_24_hours |      4.26713     |    308.831       |     304.563       |  ops/s | +7137.44% |
|                                               Mean Throughput |              unique_deployment_count_24_hours |      4.27294     |    308.831       |     304.558       |  ops/s | +7127.59% |
|                                             Median Throughput |              unique_deployment_count_24_hours |      4.27199     |    308.831       |     304.559       |  ops/s | +7129.19% |
|                                                Max Throughput |              unique_deployment_count_24_hours |      4.28726     |    308.831       |     304.543       |  ops/s | +7103.46% |
|                                       50th percentile latency |              unique_deployment_count_24_hours |    231.968       |      2.71129     |    -229.257       |     ms |   -98.83% |
|                                       90th percentile latency |              unique_deployment_count_24_hours |    242.298       |      3.03284     |    -239.266       |     ms |   -98.75% |
|                                       99th percentile latency |              unique_deployment_count_24_hours |    253.986       |      3.7028      |    -250.283       |     ms |   -98.54% |
|                                      100th percentile latency |              unique_deployment_count_24_hours |    255.924       |      7.39493     |    -248.53        |     ms |   -97.11% |
|                                  50th percentile service time |              unique_deployment_count_24_hours |    231.968       |      2.71129     |    -229.257       |     ms |   -98.83% |
|                                  90th percentile service time |              unique_deployment_count_24_hours |    242.298       |      3.03284     |    -239.266       |     ms |   -98.75% |
|                                  99th percentile service time |              unique_deployment_count_24_hours |    253.986       |      3.7028      |    -250.283       |     ms |   -98.54% |
|                                 100th percentile service time |              unique_deployment_count_24_hours |    255.924       |      7.39493     |    -248.53        |     ms |   -97.11% |
|                                                    error rate |              unique_deployment_count_24_hours |      0           |      0           |       0           |      % |     0.00% |

@martijnvg martijnvg mentioned this pull request May 3, 2023
7 tasks
@jpountz jpountz deleted the cardinality_dynamic_pruning branch June 9, 2023 16:09
kkrik-es added a commit to kkrik-es/elasticsearch that referenced this pull request Sep 22, 2023
const_keyword fields don't show up in the leafReader, since they have
a const value. elastic#92060 modified the logic to return no results in case
the leaf reader contains no information about the requested field in a
cardinality aggregation. This is wrong for const_keyword fields, as they
contain up to 1 distinct value.

To fix this, we fall back to the old logic in this case that can
handle const_keyword fields properly.

Fixes elastic#99776
kkrik-es added a commit that referenced this pull request Sep 25, 2023
* Fix cardinality agg for const_keyword

const_keyword fields don't show up in the leafReader, since they have
a const value. #92060 modified the logic to return no results in case
the leaf reader contains no information about the requested field in a
cardinality aggregation. This is wrong for const_keyword fields, as they
contain up to 1 distinct value.

To fix this, we fall back to the old logic in this case that can
handle const_keyword fields properly.

Fixes #99776

* Update docs/changelog/99814.yaml

* Update skip ranges for broken releases.
kkrik-es added a commit to kkrik-es/elasticsearch that referenced this pull request Sep 25, 2023
* Fix cardinality agg for const_keyword

const_keyword fields don't show up in the leafReader, since they have
a const value. elastic#92060 modified the logic to return no results in case
the leaf reader contains no information about the requested field in a
cardinality aggregation. This is wrong for const_keyword fields, as they
contain up to 1 distinct value.

To fix this, we fall back to the old logic in this case that can
handle const_keyword fields properly.

Fixes elastic#99776

* Update docs/changelog/99814.yaml

* Update skip ranges for broken releases.
elasticsearchmachine pushed a commit that referenced this pull request Sep 25, 2023
* Fix cardinality agg for const_keyword

const_keyword fields don't show up in the leafReader, since they have
a const value. #92060 modified the logic to return no results in case
the leaf reader contains no information about the requested field in a
cardinality aggregation. This is wrong for const_keyword fields, as they
contain up to 1 distinct value.

To fix this, we fall back to the old logic in this case that can
handle const_keyword fields properly.

Fixes #99776

* Update docs/changelog/99814.yaml

* Update skip ranges for broken releases.
floragunncom pushed a commit to floragunncom/search-guard that referenced this pull request Mar 3, 2024
floragunncom pushed a commit to floragunncom/search-guard that referenced this pull request Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants