Refactor compound block types#140219
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
…ound-block-types' into refactor-compound-block-types Conflicts: x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/AbstractDelegatingCompoundBlock.java
There was a problem hiding this comment.
LGTM!
There are several more functions that look like good refactoring targets, but have slightly different behavior. For example, AggregateMetricDoubleArrayBlock checks all sub blocks in areAllValuesNull, while the histogram blocks only check a single sub block. I think it's worth thinking through a bit more if the behavior for all of these is correct.
This is an interesting one. I think it essentially boils down to the question of whether a given component of the values the composite block represents (e.g. a histogram) is nullable: Can the component be null, even if the composite value it is part of is non-null?
For example for empty histograms sum, min and max will be null, but the BytesRef for the encoded histogram data will never be null.
That's why I think currently the implementations both for the histograms and AMD are correct:
For the histograms we delegate the null check to a component (the histogram bytes) which is known to be null if and only if the histogram is actually null.
For aggregate metric double, there is no such column. So we simply define "null" as all components being null.
This by the way implies that it is impossible to have an empty aggregate metric double containing no metrics.
If we want to generalize, I think we would need to give some meta information to the AbstractDelegatingCompoundBlock.
E.g. add a record for defining the layout of a composite block:
record Layout<B extends Block>(
List<Component<B>> components,
Function<Block[], B> constructor
){
enum Nullability {
NON_NULL,
NULLABLE
}
record Component<B>(String name, Nullability nullability, Function<B, Block> accessor) {}
}
And have each specialization return a singleton describing it's layout:
private static final Layout LAYOUT = new Layout<>(
List.of(
new Layout.Component<>("min", Layout.Nullability.NULLABLE, block -> block.minima),
new Layout.Component<>("max", Layout.Nullability.NULLABLE, block -> block.maxima),
new Layout.Component<>("sum", Layout.Nullability.NULLABLE, block -> block.sums),
new Layout.Component<>("count", Layout.Nullability.NON_NULL, block -> block.valueCounts),
new Layout.Component<>("zero_threshold", Layout.Nullability.NON_NULL, block -> block.zeroThresholds),
new Layout.Component<>("encoded_histogram", Layout.Nullability.NON_NULL, block -> block.encodedHistograms)
),
blocks -> new ExponentialHistogramArrayBlock(
(DoubleBlock) blocks[0],
(DoubleBlock) blocks[1],
(DoubleBlock) blocks[2],
(DoubleBlock) blocks[3],
(DoubleBlock) blocks[4],
(BytesRefBlock) blocks[5]
)
);
This way you can even get rid of getSubBlocks and have AbstractDelegatingCompoundBlock just operate on the layout. And it now has the meta information, such as nullability available directly to correctly implement areAllValuesNull / isNull.
We could even make it type-safe by providing specialized constructors for 2, 3, 4, 5.. sub components:
<C1 extends Block, C2 extends Block>
Layout(Component<C1> c1, Component<C2> c2, BiFunction<C1, C2, B> constructor) {
this(
List.of(c1, c2),
blocks -> constructor.apply(
(C1) blocks[0],
(C2) blocks[1]
)
);
}
|
@JonasKunz yeah, I agree about the different null behavior. I thought about doing something like that layout, but it seemed excessive for now. I think I'm okay with merging this as is, fixing |
* upstream/main: Add hook for blocking termination (elastic#133555) Delegate to ES93ScalarQuantizedVectorsFormat rather than copying behaviour (elastic#139834) Refactor compound block types (elastic#140219) Flush the rate buffer when the slice index changes (elastic#138856) Log linked project connection errors at debug during shutdown (elastic#140239) Periodic FIPS 140-3 buildkite pipelines (elastic#139909) ES|QL - Remove TERM function (elastic#139953) Fix name of started time field in shutdown status (elastic#139910) Drop `project_routing` from query params (elastic#140272) Fix flaky test: AllocationDecidersTests (elastic#140271) Add List Reindex API (elastic#140184) Expose _tier metadata attribute in ESQL (elastic#139894) Tweak TSDBRestEsqlIT#testTimeSeriesQuerying(...) (elastic#140210) Fix an OOM error when creating to many chained synonym graph token filter. (elastic#140026) Suppress Azure SDK error logs (elastic#139730) Rewritten integer sorts need to use SortedNumericSortField (elastic#139538) (elastic#139700) Adjust index versions for skippers for time series (elastic#139670) Fix host.name skippers index version range (elastic#139636) Remove BWC shim for a broken commit Fix index.mapping.use_doc_values_skipper defaults in serverless (elastic#139532)
TDigestArrayBlock, ExponentialHistogramArrayBlock, and AggregateMetricDoubleArrayBlock all work by composing several primitive blocks and delegating most operations to them. This PR abstracts that pattern out to a base class, which the three compound block types can implement. --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
TDigestArrayBlock,ExponentialHistogramArrayBlock, andAggregateMetricDoubleArrayBlockall work by composing several primitive blocks and delegating most operations to them. This PR abstracts that pattern out to a base class, which the three compound block types can implement.For a first step, I just pulled out the parts that worked identically in all three classes. There are several more functions that look like good refactoring targets, but have slightly different behavior. For example,
AggregateMetricDoubleArrayBlockchecks all sub blocks inareAllValuesNull, while the histogram blocks only check a single sub block. I think it's worth thinking through a bit more if the behavior for all of these is correct.