ESQL: Load many fields column-at-a-time#141926
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
I'm adding an integration test now. Will see if there's real performance improvement too. It's likely it'll be in a fun query like |
|
So, yes, there is some performance improvement: 6764 -> 5654. That's not much, but I didn't expect much. The load is 25122339ns-> 79986530ns. FWIW most of the time for that is spent in the top n operator - something I'll be looking into in a few weeks. |
|
@martijnvg , a bunch of the time series tests fail in this PR. When I comment out the reproduces it. |
|
|
||
| BlockLoader loader; | ||
| // TODO rework this bit of mutable state into something harder to forget | ||
| // Seriously, I've tripped over this twice. |
There was a problem hiding this comment.
I'll grab this in a follow-up change.
Looked into this a bit. The cause of the issue is that I'm not yet sure if the fix is to make a more robust/expensive form of isDense, or something else. Anyway, will follow-up on this tomorrow.
|
|
@dnhatn, @martijnvg, @parkertimmins , and I met to talk about this. If you use Lucene's reader interfaces you can read duplicate doc ids. Most of our We really want the performance we can get by not being tolerant of duplicates. Specifically, we need that performance for the "first load" of fields. And, at least in the case of I'm going to block this PR on another one I'm starting now. It will add a flag to |
|
Blocked on #142055 |
|
Unblocked! @parkertimmins, could you have a look at this one? And, could you make a follow up with unit tests for y'all's fancy |
Thanks for adding the |
| boolean toInt, | ||
| boolean binaryMultiValuedFormat | ||
| ) throws IOException { | ||
| if (docs.mayContainDuplicates()) { |
There was a problem hiding this comment.
I think we can go further even with duplicates, but Martijn or Parker can follow up on it.
There was a problem hiding this comment.
Yes, we can look into removing these if statements in follow ups.
martijnvg
left a comment
There was a problem hiding this comment.
Thanks Nik! One minor comment, LGTM 👍.
| boolean toInt, | ||
| boolean binaryMultiValuedFormat | ||
| ) throws IOException { | ||
| if (docs.mayContainDuplicates()) { |
There was a problem hiding this comment.
Yes, we can look into removing these if statements in follow ups.
| * </li> | ||
| * </ul> | ||
| */ | ||
| boolean mayContainDuplicates(); |
There was a problem hiding this comment.
A quick look indicates to most implementations return false here. Maybe have a default implementation that returns false?
There was a problem hiding this comment.
I thought about it but figured it was kinder to make the implementer think about the choice when implementing.
…duplicates (elastic#142409) Add a test to es819 codec test to verify changes from elastic#141926 . Just checks that situations which require incoming docs to not contain duplicates, return null on tryRead if passed docs with duplicates. Also, update DenseBinaryDocValues to return null if mayContainDuplicates
…duplicates (elastic#142409) Add a test to es819 codec test to verify changes from elastic#141926 . Just checks that situations which require incoming docs to not contain duplicates, return null on tryRead if passed docs with duplicates. Also, update DenseBinaryDocValues to return null if mayContainDuplicates
In #141926 I deprecated the `AllReader` because we no longer need to make a `BlockLoader` work both row-by-row and column-at-a-time. Now it's fine for a `BlockLoader` to work in either mode. And `AllReader` was the tool that we used to support working both ways. So it can go! This removes it.
In elastic#141926 I deprecated the `AllReader` because we no longer need to make a `BlockLoader` work both row-by-row and column-at-a-time. Now it's fine for a `BlockLoader` to work in either mode. And `AllReader` was the tool that we used to support working both ways. So it can go! This removes it.
ESQL: Load many fields column-at-a-time
Adds support for
ColumnAtATimeReaderin the case where we're loadingfrom many segments. This should marginally speed up loading many
documents after a top n. More importantly, it lets #141672 kick in
when loading from many fields. This should save significantly memory
when loading thousands of fields after a
| SORT | LIMITsequence.Finally, this changes the rules for
BlockLoader. Previously youcould return
nullfromcolumnAtATimeReaderbut must never returnnullfromrowStrideReader. Now the rule is that you may return nullfrom either of the two, but not both. This should let us delete a
bunch of code. While we're at it, we should add a
read(builder, docs, offset, nullsFiltered)override to save a copy.