Skip to content

[Prototype & Benchmark] Support directly writing to output block builder for scalar functions#9638

Closed
wenleix wants to merge 2 commits intoprestodb:masterfrom
wenleix:write2bb_bench
Closed

[Prototype & Benchmark] Support directly writing to output block builder for scalar functions#9638
wenleix wants to merge 2 commits intoprestodb:masterfrom
wenleix:write2bb_bench

Conversation

@wenleix
Copy link
Contributor

@wenleix wenleix commented Dec 28, 2017

Introduction

Today the return convention for scalar function is always to return on stack, and the callee will append the results value on stack into the result BlockBuilder(for out-most function call) or use it to invoke other functions (for inner/nested function call like f(g(x)) )

While this return convention works well for primitive types, it's not optimal for structural types since it always has to copy the result block.

Proposed Solution

To address this inefficiency, one idea is to introduce a new return convention that directly writes to the output block builder. This requires two part of the work:

An preliminary proof-of-concept of the InvocationAdapter can be found in commit wenleix@d0fb108

Benchmark Result

In this PR we prototyped the preliminary support to directly write to output block builder and benchmark the potential performance gain. This implementation is not a full support since the adaption for a caller expect return on stack behavior while callee provides directly write to block behavior requires more work .

To see the potential performance gain, we add a fake array_identity function which copies the array (see the second commit), and benchmark its performance:

Benchmark                                        (name)  Mode  Cnt   Score   Error  Units
BenchmarkArrayIdentity.benchmark         array_identity  avgt   20  22.721 ± 1.298  ns/op
BenchmarkArrayIdentity.benchmark  array_identity_direct  avgt   20  10.557 ± 0.466  ns/op

We see about 2x performance gain by directly writing to output block builder (instead of putting the result block on stack and copy to the final block).

This is based on #8747.

@sopel39
Copy link
Contributor

sopel39 commented Dec 29, 2017

Nice!
Have you also though about reversing the ownership of objects returned by scalar? Currently object returned by scalar is owned by caller. If the object is still owned by scalar then the we could use mutable accumulators to store the scalar result. We would save on object allocation (and memory zeroing for Slice), but also JIT could probably optimize such code better.

Alternatively for scalars the returned object could be passed as out parameter.

@wenleix
Copy link
Contributor Author

wenleix commented Jan 2, 2018

Thank you @sopel39 ! :)

  • Speaking of making scalar function owning the returning objects, the main benefit is to save object allocation right? I think CachedInstanceBinder is helping on this. -- In a typical usage, the cached instance (or the function state) will be a PageBuilder with output type. The output is always appending to this PageBuilder and slicing the last block as return type.

    In summary, CachedInstanceBinder helps the performance with the following two aspects:

    • Instead of allocating many small chunk of memory (BlockBuilder), it does one big allocation (a PageBuilder) and slicing block from it.
    • When a page is full and a new page is allocated (through PageBuilder.reset()), it can use the stats collected from PageBuilderStatus to help pre-allocate underlying block size in the future batches. This should help reduce the chance that the underlying BlockBuilder get resized (which needs copy the current data, etc).

    It doesn't help with

    • Avoid data copy to the output BlockBuilder.

@sopel39
Copy link
Contributor

sopel39 commented Jan 3, 2018

Speaking of making scalar function owning the returning objects, the main benefit is to save object allocation right?

That's correct. Specifically I was thinking about fast decimal, which is represented by Slice for larger scale. Currently we return a new Slice for every scalar execution which involves long decimals. I remember I did some POC and it improved decimal performance by double digit percentage numbers.

@wenleix
Copy link
Contributor Author

wenleix commented Jan 5, 2018

@sopel39 : With this new return convention , we can also support writing slice into output block :). So for outmost function that returns long decimal, it can avoid allocating new Slice and get all the saves (and even avoid the copying).

For function in the middle of chained calls (e.g. the g(x) in f(g(x))), it will write to an temporary block, and the engine will get the Slice from block, which might still have allocation overhead. One way to avoid this is to introduce the calling convention that passed in BLOCK_INPUT and BLOCK_INDEX, similar to aggregation functions.

@stale
Copy link

stale bot commented Oct 4, 2019

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the task, make sure you've addressed reviewer comments, and rebase on the latest master. Thank you for your contributions!

@stale stale bot added the stale label Oct 4, 2019
@stale stale bot closed this Oct 11, 2019
@wenleix wenleix changed the title [Benchmark] Benchmark support directly writing to output block builder for scalar functions [Design] Support directly writing to output block builder for scalar functions Jun 16, 2020
@wenleix wenleix changed the title [Design] Support directly writing to output block builder for scalar functions [Prototype & Benchmark] Support directly writing to output block builder for scalar functions Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants