ES|QL query approximation: move sample correction to data node#144005
ES|QL query approximation: move sample correction to data node#144005jan-elastic merged 10 commits intomainfrom
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
luigidellaquila
left a comment
There was a problem hiding this comment.
LGTM, thanks @jan-elastic
As for the off-line discussion, if you have further fixes to add here, please do, I'll have a look
x-pack/plugin/esql/qa/testFixtures/src/main/resources/approximation.csv-spec
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/approximation.csv-spec
Outdated
Show resolved
Hide resolved
luigidellaquila
left a comment
There was a problem hiding this comment.
LGTM, thanks @jan-elastic!
As a follow-up, I'd suggest to add approximation to the Generative tests. See how we do it for unmapped fields, it should be more or less the same.
| // so return null instead. TODO: this criterion is not ideal, and should be revisited. | ||
| // Allow a little bit of numerical imprecision in the consistency check, which can happen | ||
| // due to round-off errors when aggregating zero-variance stats (e.g. AVG(x) BY x). | ||
| if (lower - 1e-12 * Math.abs(lower) <= bestEstimate && bestEstimate <= upper + 1e-12 * Math.abs(upper)) { |
There was a problem hiding this comment.
Is there is a specific calculation behind the choice of 1e-12 as tolerance? I guess we just consider it small enough?
Maybe we should track this TODO with an issue
There was a problem hiding this comment.
A double has 52 bits for the significant digits, so that's approx 10^-16 of relative precision.
10^-12 is a very small number, but still a lot larger than this precision.
|
|
||
| import java.util.List; | ||
|
|
||
| public class CountApproximateAggregatorFunction implements AggregatorFunction { |
There was a problem hiding this comment.
This is basically CountAggregatorFunction with Long -> Double.
Do we need to prevent this code duplication somehow?
There was a problem hiding this comment.
I don't have a strong opinion, apart from countMasked() and blockIndex() the rest of the code looks the same but the types are different, so we won't save much with a refactoring
There was a problem hiding this comment.
OK, I'll leave it like this. The one thing I can see doing is generating both from a StringTemplate file, like some other classes are.
|
The failing test is unrelated to my PR, and was already failing before this PR. So merging (bypassing rules). |
…elocations * upstream/main: (72 commits) [Test] Randomly disable sequence numbers in CcrTimeSeriesDataStreamsIT (elastic#143930) Fix AsyncSearchIndexServiceTests.testCircuitBreaker failure (elastic#144058) Refine GenerativeIT some more, this time with accounting for some added (elastic#144220) ESQL: Physical Planning on the Lookup Node (elastic#143707) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:approximation.Approximate stats by with zero variance} elastic#144240 Trigger counter metrics in test for delta temporality measurements (elastic#144193) fix capabiltiy approximation_v3 (elastic#144230) [ci] Add PR pipeline for testing ipv6 and fix tests not working with ipv6 (elastic#140473) update (elastic#144095) Make from/to optional in TBUCKET when Kibana timestamp filter is present (elastic#144057) Extract reroute behavior from create-index request classes (elastic#144140) ESQL: Fix release build only failures (elastic#144122) ES|QL query approximation: move sample correction to data node (elastic#144005) Add indexing pressure tracking to OTLP endpoints (elastic#144009) Fix replica writes after _seq_no doc values are pruned (elastic#144180) allow tests to configure supportsLoadingConfig (elastic#144061) [ES|QL] Unmute testGiantTextFieldInSubqueryIntermediateResultsWithSort (elastic#144126) [ESQL][DOCS] Add CPS page (unpublished for moment) (elastic#144206) ESQL: Forbid "load" unmapped_fields for certain commands (elastic#144115) Add CCS Remote Views Detection (elastic#143384) ...
…ic#144005) * looser bounds * Move sample correction to data nodes. * Buckets equal best estimate for exact counts (-> zero-width CI). * update capability * ignore test * Don't round before divisions. * CSV test with approximation=false * looser CSV test bounds * Fix ApproximationTests * small refactor / renaming
…ic#144005) * looser bounds * Move sample correction to data nodes. * Buckets equal best estimate for exact counts (-> zero-width CI). * update capability * ignore test * Don't round before divisions. * CSV test with approximation=false * looser CSV test bounds * Fix ApproximationTests * small refactor / renaming
Currently, sample correction happens on the coordinator node. When the data nodes send exact stats, they send null buckets, which indicates no sample correction is needed. This works if all data nodes send sampled or exact stats, but fails when some send sampled stats and others exact stats. (The coordinator receives null and non-null buckets, aggregates them, gets a non-null total bucket, and corrects it for sampling.)
This is solved by moving the sampling to the data nodes. In the case of exact stats, all buckets equal to the exact stats are sent, which gives zero variance.
At the moment, the rounding also happens on the data nodes, leading to round-off errors for some stats. This will be solved in a follow-up PR.The rounding is still done on the coordinator node to minimize round-off errors. Therefore, a new aggregation
CountApproximateis introduced.