Skip to content

Conversation

@JonasKunz
Copy link
Contributor

Adds a ValueExtractor and a ResultBuilder for exponential histograms. Sorting on histograms is not support, as they don't have a natural order.

I really want to keep the implementation detail of how the sub-components of exponential histograms are laid out (e.g. the zeroThreshold being a separate column instead of part of the encoded BytesRef) local to the block, block-builder and block-loader. The layout comes from the FieldMapper and I don't want changes there to ripple through the entire ES|QL code base. For that reason I added functions to serialize/deserialize individual histograms to the block and builder and used those from the TopN ValueExtractor and ResultBuilder.

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.3.0 labels Oct 29, 2025
@JonasKunz JonasKunz marked this pull request as ready for review October 29, 2025 10:20
@JonasKunz JonasKunz requested a review from dnhatn October 29, 2025 10:20
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@JonasKunz JonasKunz marked this pull request as draft October 30, 2025 11:44
@JonasKunz
Copy link
Contributor Author

This should wait on #137368

# Conflicts:
#	x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramArrayBlock.java
#	x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramBlock.java
#	x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/ExponentialHistogramBlockAccessor.java
@JonasKunz JonasKunz marked this pull request as ready for review October 30, 2025 14:32
@JonasKunz
Copy link
Contributor Author

#137368 is merged and conflicts have been fixed, so this one is ready for review again.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some smaller comments, but LGTM. Thanks Jonas!

}

@Override
public void serializeExponentialHistogram(int valueIndex, SerializedOutput out, BytesRef scratch) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueIndex should be position instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended usage pattern for serializeExponentialHistogram is just like for getters on blocks:

for (int i=0; i<block.getPositionCount(); i++) {
  for (int j = 0; j < block.getValueCount(i); j++) {
    block.serializeExponentialHistogram(block.getFirstValueIndex(i) + j, ...)
  }
}

So valueIndex instead of position is correct here.
Right now it is true for exponential histogram blocks that we just use the positions directly as valueIndex, but that will change when we support multi-values.

See also this comment: #133393 (comment)

I'll add a comment in ExponentialHistogramArrayBlock explaining the mapping of positions and valueIndices to positions in the sub-blocks.

@Override
public void serializeExponentialHistogram(int valueIndex, SerializedOutput out, BytesRef scratch) {
long valueCount = valueCounts.getLong(valueCounts.getFirstValueIndex(valueIndex));
out.appendLong(valueCounts.getLong(valueCounts.getFirstValueIndex(valueIndex)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueCounts.getLong(valueCounts.getFirstValueIndex(valueIndex)) -> valueCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are confusing this with the value count returned via getValueCount(position). This is a different value-count: It is the number of samples the histogram was generated from. I've added a comment to avoid this confusion.

* @param out
* @param scratch
*/
void serializeExponentialHistogram(int valueIndex, SerializedOutput out, BytesRef scratch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueIndex -> position

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Abstraction to use for writing individual values via {@link #serializeExponentialHistogram(int, SerializedOutput, BytesRef)}.
*/
interface SerializedOutput {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to have these interfaces, but they're okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want to expose the zero-threshold from the block to prevent it becoming a maintenance nightmare if we do changes to the disk format. That's why I want to keep the knowledge of its existence local to the block implementation.

I though pulling in a direct dependency on the TopNEncoder would not be a good idea, that's why I added these interfaces in between. If you prefer it, I can directly use the TopNEncoder in serializeExponentialHistogram instead in a follow-up

@JonasKunz JonasKunz enabled auto-merge (squash) November 6, 2025 09:15
@JonasKunz JonasKunz merged commit bcb861a into elastic:main Nov 6, 2025
34 checks passed
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 6, 2025
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json

* upstream/main:
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690
  [LTR] Fix feature display order when using explain. (elastic#137671)
  Remove extra RemoteClusterService instances in unit test (elastic#137647)
  Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669)
  Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233)
  Update bundled JDK to 25.0.1 (elastic#137640)
  resolve indices for prefixed _all expressions (elastic#137330)
  ESQL: Add TopN support for exponential histograms (elastic#137313)
  allows field caps to be cross project (elastic#137530)
  ESQL: Add exponential histogram percentile function (elastic#137553)
  Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636)
  Tighten on when THROTTLE decision can be returned (elastic#136794)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655
  Add a test for two little known conditional processor paths (elastic#137645)
  Extract a common ORIGIN constant (elastic#137612)
  Remove early phase failure in batched (elastic#136889)
  Returning correct index mode from get data streams api (elastic#137646)
  [ML] Manage AD results indices (elastic#136065)
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json

* upstream/main:
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690
  [LTR] Fix feature display order when using explain. (elastic#137671)
  Remove extra RemoteClusterService instances in unit test (elastic#137647)
  Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669)
  Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233)
  Update bundled JDK to 25.0.1 (elastic#137640)
  resolve indices for prefixed _all expressions (elastic#137330)
  ESQL: Add TopN support for exponential histograms (elastic#137313)
  allows field caps to be cross project (elastic#137530)
  ESQL: Add exponential histogram percentile function (elastic#137553)
  Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636)
  Tighten on when THROTTLE decision can be returned (elastic#136794)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655
  Add a test for two little known conditional processor paths (elastic#137645)
  Extract a common ORIGIN constant (elastic#137612)
  Remove early phase failure in batched (elastic#136889)
  Returning correct index mode from get data streams api (elastic#137646)
  [ML] Manage AD results indices (elastic#136065)
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants