Skip to content

Add support for top-level arithmetic ops to TS|STATS#140135

Merged
felixbarny merged 49 commits intoelastic:mainfrom
felixbarny:ts-binary-ops
Jan 13, 2026
Merged

Add support for top-level arithmetic ops to TS|STATS#140135
felixbarny merged 49 commits intoelastic:mainfrom
felixbarny:ts-binary-ops

Conversation

@felixbarny
Copy link
Member

@felixbarny felixbarny commented Jan 2, 2026

This is what's happening at a high level:

  • TranslateTimeSeriesAggregate now not only handles AggregateFunctions, but all Functions, including BinaryScalarFunctions
  • Going into TranslateTimeSeriesAggregate, the aggregates are not be split up into evals anymore. The TranslateTimeSeriesAggregate rule now runs earlier in the optimizer (before ReplaceAggregateNestedExpressionWithEval and friends).
    • Enables adding all TimeSeriesAggregateFunctions to the first aggregation phase, without some TimeSeriesAggregateFunctions being placed in nested Evals.
    • Also ensures sure we can properly insert the default last_over_time function for expressions like foo + 1 or max(foo + 1), before the inner foo + 1 is extracted into an eval.
  • Nested expressions in the groupings of the TimeSeriesAggregate are still be replaced with an eval to make time bucket handling easier.
  • Extracts the injection of the default last_over_time function outside of TranslateTimeSeriesAggregate and into the analysis phase, so that InsertFromAggregateMetricDouble runs after the insertion of last_over_time. If it would execute later, the last_over_time function can't be resolved for downsampling indices where metrics are of type aggregate_metric_double. It needs to run after field resolution as the injection of the default inner agg is type-dependent - we have a different strategy for histograms.
  • TimeSeriesGroupByAll has been moved from the initialize to the resolution phase of the analyzer - after InsertDefaultInnerTimeSeriesAggregate, so that it can take that into account. That also fixes a missing reference issue in the nested eval for queries like network.total_bytes_in * 8.

Queries that are supported now but weren't before:

  • Bare metric (with group-by-all)
    • TS k8s | STATS network.cost
  • Group by all now supports post processing
    • TS k8s | STATS network.cost | SORT network.cost
    • Previously, there was a bug that complained about missing references as the id of the alias changed
  • Top-level arithmetic operations between metric and scalar
    • TS k8s | STATS 10 + max(10 + network.total_bytes_in)
    • Also supports implicit last_over_time and group-by-all
    • TS k8s | STATS network.total_bytes_in * 8
  • Top-level arithmetic operations between metric and metric
    • Also supports implicit last_over_time and group-by-all
    • TS k8s | STATS in_n_out=network.eth0.rx + network.eth0.tx
    • TS k8s | STATS max(last_over_time(network.eth0.tx::double) / (last_over_time(network.eth0.tx::double) + last_over_time(network.eth0.rx::double)))

closes #139570, #138702, #139580

Child PRs

PromQL support will be added in a follow-up:

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.4.0 labels Jan 2, 2026
@felixbarny felixbarny linked an issue Jan 2, 2026 that may be closed by this pull request
@dnhatn dnhatn self-requested a review January 2, 2026 19:37
@pabloem
Copy link
Contributor

pabloem commented Jan 4, 2026

this is awesome. Thanks for tackling it @felixbarny

@astefan astefan requested a review from costin January 5, 2026 16:56
@dnhatn
Copy link
Member

dnhatn commented Jan 5, 2026

@felixbarny Thank you for tackling this. I've spent quite some time on this. I think TranslateTimeSeriesAggregate should be a rule in the Analyzer, not an optimizer rule. As you found, it should execute before we substitute expressions around aggregations. However, I think the approach you proposed can be fragile, since we might scatter the substitutions for aggregations and groupings. I played with these rules and I think we can move TranslateTimeSeriesAggregate to the Analyzer. Are you okay to continue working this issue? Otherwise, we can share the work.

@felixbarny
Copy link
Member Author

I think TranslateTimeSeriesAggregate should be a rule in the Analyzer, not an optimizer rule.

I guess this implies that TranslatePromqlToTimeSeriesAggregate will also need to be executed during analysis so that it can run before TranslateTimeSeriesAggregate.

When I asked @costin why we don't run the PromQL translation during analysis, this was his response:

  1. Preservation of Query Integrity: The optimization process must assume a valid input query and should not modify the abstract syntax tree (AST). This is critical to ensure that any validation failures are reported directly and accurately to the user, without obfuscation from query transformation.
  2. Node Self-Sufficiency and Output Alignment: Each node within the query structure must fully and explicitly describe its own output. The existing discrepancy between the output reported by the PromQL tree and its actual translation needs to be resolved. This resolution should be handled during the tree assembly phase.

However, I think the approach you proposed can be fragile, since we might scatter the substitutions for aggregations and groupings.

Fair. Re-using ReplaceAggregateNestedExpressionWithEval but just for groupings was an attempt to simplify and reduce the surface area of the change. But I can try to do without it. I guess it'll still imply creating a nested eval for the time bucket so that both the first and the second pass groupings can re-use it. This will be very similar to what ReplaceAggregateNestedExpressionWithEval is doing, but isolated to just time bucket groupings. Is that what you had in mind?

What do you think of phasing out the change where in the first step, the PromQL and time series aggregate translation is still happening in the optimization phase, but runs early and also doesn't use ReplaceAggregateNestedExpressionWithEval internally? We can then follow-up and separately discuss whether to move it to the analysis phase, which should then also be simpler and incremental change. I'm happy to keep working on it.

@dnhatn
Copy link
Member

dnhatn commented Jan 6, 2026

What do you think of phasing out the change where in the first step, the PromQL and time series aggregate translation is still happening in the optimization phase, but runs early and also doesn't use ReplaceAggregateNestedExpressionWithEval internally? We can then follow-up and separately discuss whether to move it to the analysis phase, which should then also be simpler and incremental change. I'm happy to keep working on it.

++ Moving the PromQL and translate-time-series rules to the beginning of the logical optimizer rules is a good step toward moving them to the analysis phase.

@felixbarny felixbarny self-assigned this Jan 8, 2026
@felixbarny felixbarny marked this pull request as ready for review January 12, 2026 09:07
@felixbarny felixbarny requested a review from kkrik-es January 12, 2026 09:07
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jan 12, 2026
@felixbarny felixbarny added >enhancement :StorageEngine/ES|QL Timeseries / metrics / PromQL / logsdb capabilities in ES|QL labels Jan 12, 2026
@elasticsearchmachine
Copy link
Collaborator

Hi @felixbarny, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jan 12, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction, and the logic seems a bit cleaner. Will leave it to Nhat to approve.

Copy link
Contributor

@sidosera sidosera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I also love we move PromQL translation earlier in the chain.

I'm happy to accept to unblock, would still love to hear Nhat take when they get a chance.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, but this looks great. Thanks Felix for all the iterations.

new PruneUnusedIndexMode(),
// after translating metric aggregates, we need to replace surrogate substitutions and nested expressions again.
// re-executing the next two rules is a relic of when time series aggregates were translated after surrogate substitution
// removing this would fail in ccs scenarios where the remote cluster is on an older version (caught by bwc tests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to call SubstituteSurrogateAggregations twice consecutively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I had it initially but omitting it caused bwc tests to fail. Something related to class cast exceptions of different block instances.
Maybe we can look for a solution after merging this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works.

@felixbarny felixbarny merged commit e82b7ca into elastic:main Jan 13, 2026
36 checks passed
felixbarny added a commit to felixbarny/elasticsearch that referenced this pull request Jan 13, 2026
Examples of queries that are supported now:
* `network.bytes_in * 8`
* `network.eth0.rx + network.eth0.tx`
* `max(network.total_bytes_in) * 8`
* `network.total_bytes_in{cluster!="prod"} / network.total_bytes_in{cluster!="staging"}`

Follow-up from elastic#140135
felixbarny added a commit that referenced this pull request Jan 14, 2026
Examples of queries that are supported now:
* `network.bytes_in * 8`
* `network.eth0.rx + network.eth0.tx`
* `max(network.total_bytes_in) * 8`
* `network.total_bytes_in{cluster!="prod"} / network.total_bytes_in{cluster!="staging"}`

Follow-up from #140135
eranweiss-elastic pushed a commit to eranweiss-elastic/elasticsearch that referenced this pull request Jan 15, 2026
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
Examples of queries that are supported now:
* `network.bytes_in * 8`
* `network.eth0.rx + network.eth0.tx`
* `max(network.total_bytes_in) * 8`
* `network.total_bytes_in{cluster!="prod"} / network.total_bytes_in{cluster!="staging"}`

Follow-up from elastic#140135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :StorageEngine/ES|QL Timeseries / metrics / PromQL / logsdb capabilities in ES|QL Team:StorageEngine v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ES|QL: Better validation for last_over_time Arithmetic operation support in STATS

7 participants