Skip to content

Conversation

@yingsu00
Copy link
Contributor

@yingsu00 yingsu00 commented Jun 21, 2020

We used Blocks' sizeInBytes or logicalSizeInBytes to estimate the max capacity of the BlockEncodingBuffers. However, there were some error in calculating the max capacity from the decodedBlock.estimatedSerializedSizeInBytes such that the exclusive portion(exclusive of children BlockEncodingBuffers) of the current BlockEncodingBuffer was mistakenly passed to the children BlockEncodingBuffers as inclusive portion. Also, the max capacity for the nested blocks was incorrectly calculated if they are RLE or Dictionary Blocks . This PR fixes these two problems. With these fixes, the CPU time for the reported regressed query in T67972617 was reduced from 100s to 20s.

== NO RELEASE NOTE ==

@yingsu00 yingsu00 requested a review from a team June 22, 2020 07:19
@mbasmanova mbasmanova added the aria Presto Aria performance improvements label Jun 22, 2020
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Would you update PR description to describe the problem and the fix?

@yingsu00
Copy link
Contributor Author

@yingsu00 Would you update PR description to describe the problem and the fix?

hi @mbasmanova I just updated the PR message. Let me know if it explains your questions. Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 I don't understand this change. I'm seeing a new scale factor being introduced, but it is always 1 (1.0f). Would you share an example that illustrates the problem and show how this change fixes it? Would it be possible to code it into a test to avoid this being broken accidentally by future changes.

@yingsu00
Copy link
Contributor Author

@mbasmanova Hi Masha, the new scale factor childBlockEstimatedSerializedSizeScaleFactor is not always 1. It's passed to child decodeBlock as childBlockEstimatedSerializedSizeScaleFactor * decodedBlock.getPositionCount() / dictionary.getPositionCount() for DictionaryBlock and childBlockEstimatedSerializedSizeScaleFactor * decodedBlock.getPositionCount() for RLE Block. Take RLE Block of VariableWidthBlock for example, top level RLE block has 100 positions and VariableWidthBlock has 1 position of 10 bytes value. logicalSizeInBytes for the RLE Block is 1500 bytes = (10 bytes value + 4 bytes offset + 1 byte null) * 100, while logicalSizeInBytes for the VariableWidthBlock is just 15 bytes. We used the logicalSizeInBytes for estimatedSerializedSize for the blocks, so top level RLE block got 1500 but child got 15. Then this estimatedSerializedSize of the child is used to estimate the children buffer sizes so they only get 15 bytes. To fix this, the scale factor of 100(RLE Block's positionCOunt) is passed to decodeBlock, so that the estimatedSerializedSize for the VariableWidthBlock become 15 * 100 = 1500. Then in appendData, this 1500 will be used as estimation for the sliceBuffer and offsetsBuffer in VariableWidthBlockEncodingBuffer.

I will add some comments and tests to the code.

@yingsu00
Copy link
Contributor Author

Actually, the above way was building the tree of DecodedBlockNode and adding the estimatedSerializedSizeInBytes in a bottom-up way, and thus difficult to populate the correct sizes since RLE and Dictionary blocks are not leaf nodes. I'm thinking to do this in top-down manner so that the logical size of RLE and Dictionary blocks can be passed down to the children. I'll see if it can make the code easier.

@yingsu00
Copy link
Contributor Author

@mbasmanova I just realized I opened a can of worm. The current getLogicalSizeInBytes() is not 100% correct. Suppose there is an ArrayBlock of RLEBlock of VariableWidthBlock. The top level ArrayBlock.getLogicalSizeInBytes() just returns getSizeInBytes() and it doesn't consider if the child block is a RLEBlock or not. Thus it could return a much smaller logical size than actual. To fix this, we need to override getLogicalSizeInBytes() in ArrayBlock, MapBlock and RowBlock. We also need to implement something like getLogicalRegionSizeInBytes() for all Blocks. I need to think about if it's worth while to do all these. We have these options:

  1. Fix the getLogicalSizeInBytes() and getLogicalRegionSizeInBytes() in all blocks
  2. Implement a new getEstimatedSizeInBytes() in Block that does the correct size estimation
  3. Live with the faulty logicalSizeInBytes. This could result in some CPU regression for some rare cases like the one mentioned above;
  4. Revert the original fix to introduce estimated max capacity. This could result in 20-30% memory increase.
    WHat's your preference on this?

@mbasmanova
Copy link
Contributor

Here is how I'm thinking about this. #4 gets us back to stable state quickly. From there we can work on a new fix for memory usage. I'd start with that. Then, I'd consider #3. This requires running perf evaluation on a sample of production workload to see how big the regression is.

@mbasmanova mbasmanova requested a review from a team June 24, 2020 10:45
@yingsu00
Copy link
Contributor Author

Here is how I'm thinking about this. #4 gets us back to stable state quickly. From there we can work on a new fix for memory usage. I'd start with that. Then, I'd consider #3. This requires running perf evaluation on a sample of production workload to see how big the regression is.

@mbasmanova Thank you Masha. If we take 4, what would you think would be the new fix for memory usage?

@mbasmanova
Copy link
Contributor

what would you think would be the new fix for memory usage?

I don't know off the top of my head.

@yingsu00
Copy link
Contributor Author

I tend to choose 1) and 3) since my latest tests show if the max capacity is not under estimated, the CPU performance is not affected and it can generally save 20-30% buffer memory. I'll see how much work in 1) is required to get it right.

@yingsu00
Copy link
Contributor Author

@mbasmanova Hi Masha, I actually fixed the getLogicalSizeInBytes and added some tests. With these changes I saw 5x CPU gain on the regressed query. I can add some more test in TestBlockEncodingBuffers tomorrow. I also simplified DecodedBlockNode and put most logic in decodeBlock(). Appreciate your review again!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix getLogicalSizeInBytes() for Blocks looks good % some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation to have the default implementation? It seems incorrect to report region-size as region-logical-size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova For leaf blocks (ie. non Array/Map/Row/Dict/RLE blocks), the logicalSizeInBytes is the same as sizeInBytes. See the following code:

/**
     * Returns the size of the block contents, regardless of internal representation.
     * The same logical data values should always have the same size, no matter
     * what block type is used or how they are represented within a specific block.
     *
     * This can differ substantially from {@link #getSizeInBytes} for certain block
     * types. For RLE, it will be {@code N} times larger. For dictionary, it will be
     * larger based on how many times dictionary entries are reused.
     */
    default long getLogicalSizeInBytes()
    {
        return getSizeInBytes();
    }

Similarly, regional logical size for leaf blocks is the same as the regional size. We have default implementation here so that we don't have to implement the same thing for all leaf blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider replacing comments with variable names, e.g.

  • Block arrayOfLong =
  • Block arrayOfRleOfLong =
  • Block arrayOfRleOfArrayOfLong =
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I renamed the variables. However it's not as straightforward as the comment:

// Row(Dictionary(LongArrayBlock), Dictionary(Row(LongArrayBlock, LongArrayBlock)))
Block rowOfDictionaryOfLongAndDictionaryOfRowOfLongAndLong = ...

So I kept both the comments and renamed variables.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow additional error margin for estimatedMaxCapacity

typo in commit message: graceFactorFordMaxCapacity -> graceFactorForMaxCapacity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • all caps with underscores
  • consider making this configurable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I will send a separate PR to make it configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generic method that can be used in many places. However, the commit says that the change applies only to one specific use case. I'd expect the caller to apply this new factor when computing estimatedMaxCapacity.

  • use Math.toIntExact instead of (int)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generic method that can be used in many places. However, the commit says that the change applies only to one specific use case. I'd expect the caller to apply this new factor when computing estimatedMaxCapacity.

Moved the application of this new factor to setupDecodedBlockAndMapPositions() where the estimatedMaxCapacity is calculated.

  • use Math.toIntExact instead of (int)

It's actually casting double to int. toIntExact(long) only takes long.

@mbasmanova
Copy link
Contributor

@yingsu00

With these changes I saw 5x CPU gain on the regressed query.

To clarify, is the query running 5x faster when before the regression? E.g. before regression CPU time was N, after regression - 10N, now it is 0.2N or 2N?

@yingsu00
Copy link
Contributor Author

@yingsu00

With these changes I saw 5x CPU gain on the regressed query.

To clarify, is the query running 5x faster when before the regression? E.g. before regression CPU time was N, after regression - 10N, now it is 0.2N or 2N?

Hi Masha, it is 2N.

@mbasmanova
Copy link
Contributor

@yingsu00 How much regression is left after this change?

@yingsu00
Copy link
Contributor Author

yingsu00 commented Jul 1, 2020

@yingsu00 How much regression is left after this change?

@mbasmanova I tested on vll1_verifier1 and there is no regression any more.
WIthout optimized_repartitioning: 20200701_020416_00019_t29du PartitionedOutputOperator 8.9m
With fixed optimized_repartitioning: 20200701_020453_00021_t29du PartitionedOutputOperator 2.93m
With un-fixed optimized_repartitioning : 20200701_021836_00002_qpi5g PartitionedOutputOperator 2.95 min

@yingsu00
Copy link
Contributor Author

yingsu00 commented Jul 1, 2020

@mbasmanova Masha, I still need to touch up the test for BlockEncodingBuffers a bit. I will update the PR tomorrow.

@mbasmanova
Copy link
Contributor

@mbasmanova Masha, I still need to touch up the test for BlockEncodingBuffers a bit. I will update the PR tomorrow.

@yingsu00 Thank you for the heads up.

@mbasmanova
Copy link
Contributor

@yingsu00 How much regression is left after this change?

@mbasmanova I tested on vll1_verifier1 and there is no regression any more.
WIthout optimized_repartitioning: 20200701_020416_00019_t29du PartitionedOutputOperator 8.9m
With fixed optimized_repartitioning: 20200701_020453_00021_t29du PartitionedOutputOperator 2.93m
With un-fixed optimized_repartitioning : 20200701_021836_00002_qpi5g PartitionedOutputOperator 2.95 min

I'm confused. fixed and un-fixed are the same: 2.93m vs. 2.95m. What is un-fixed here? Is it the version that used more memory than original repartitioning? E.g. the "fix" refers to fixing memory usage?

@yingsu00
Copy link
Contributor Author

yingsu00 commented Jul 2, 2020

@yingsu00 How much regression is left after this change?

@mbasmanova I tested on vll1_verifier1 and there is no regression any more.
WIthout optimized_repartitioning: 20200701_020416_00019_t29du PartitionedOutputOperator 8.9m
With fixed optimized_repartitioning: 20200701_020453_00021_t29du PartitionedOutputOperator 2.93m
With un-fixed optimized_repartitioning : 20200701_021836_00002_qpi5g PartitionedOutputOperator 2.95 min

I'm confused. fixed and un-fixed are the same: 2.93m vs. 2.95m. What is un-fixed here? Is it the version that used more memory than original repartitioning? E.g. the "fix" refers to fixing memory usage?

@mbasmanova Hi Masha, yes, the un-fixed refers to the version early this year without the memory reduction fixes. fixed means this PR + other CPU regression fixes + all previous memory reduction PR. I can test the regressed version too (all previous memory reduction PR but no CPU regression fixes)

@yingsu00 yingsu00 force-pushed the fixEstimatedSize branch from 71ce03c to 61f21f7 Compare July 2, 2020 12:41
@yingsu00
Copy link
Contributor Author

yingsu00 commented Jul 2, 2020

@mbasmanova Hi Masha, I just updated the PR with the following changes:

  • Fixed a bug in RowBlockEncodingBuffer.setupDecodedBlockAndMapPositions() in 3861db42e4 Fix serialized size estimation in BlockEncodingBuffers where childrenEstimatedSerializedSizeInBytes was not added up.
  • Fixed TestMapBlock.test() in Fix getLogicalSizeInBytes() for Blocks
  • Added 47b093162d Add tests for max buffer capacity estimation
  • Added 22132d19fb Always make space for nullsBuffer and hashTablesBuffer
  • Moved the application of the new scale factor to setupDecodedBlockAndMapPositions() in 61f21f7f2c Allow additional error margin for estimatedMaxCapacity

Thank you very much for reviewing!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 LGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, refactor to extract a helper method to avoid copy-paste

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova did you mean something like this?

setEstimatedNullsBufferMaxCapacity(getEstimatedBufferMaxCapacity(targetBufferSize, Byte.BYTES, POSITION_SIZE));
estimatedValueBufferMaxCapacity = getEstimatedBufferMaxCapacity(targetBufferSize, Byte.BYTES, POSITION_SIZE);

and in AbstractBlockEncodingBuffer:

protected static int getEstimatedBufferMaxCapacity(double targetBufferSize, int unitSize, int positionSize)
    {
        return (int) (targetBufferSize * unitSize / positionSize * GRACE_FACTOR_FOR_MAX_BUFFER_CAPACITY);
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Yes, this might reduce copy-paste and make it easier to read and ensure we don't forget GRACE_FACTOR_FOR_MAX_BUFFER_CAPACITY somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Hi Masha, I just updated the PR with a new commit e8511df636 Refactor buffer max capacity calculation. Thank you again, and happy long weekend!

@yingsu00 yingsu00 force-pushed the fixEstimatedSize branch from 61f21f7 to e8511df Compare July 2, 2020 23:02
Ying Su added 8 commits July 4, 2020 02:13
getLogicalSizeInBytes was supposed to get the deflated sizes of the
blocks if they are DictionaryBlock or RunLengthEncodedBlock. However
if the nested blocks are DictionaryBlock or RunLengthEncodedBlock,
the size was not correctly calculated. This commit fixed this issue.
When a block passed to OptimizedPartitionedOutputOperator is a RLE or
Dictionary block, we used to estimated the serialized size using
getLogicalSize() which returns the size of the block after inflation.
However the child block of the RLE or Dictionary Block was using plain
sizeInBytes without considering it is going to be expanded. This
commit fixes this problem by adding a scale factor to estimate how many
times the child blocks are going to be expanded.
Block.getSizeInBytes() and Block.getLogicalSizeInBytes() always adds up
the sizes of nulls buffer even if the block cannot contain nulls. When
estimating the max buffer capacity for BlockEncodingBuffers, we can also
leave the space for the nullsBuffer and hashTablesBuffer. This will not
waste memory because the buffers are not actually allocated until blocks
with nulls or hashtables come in. It will make the buffers sizes
proportional to the blocks' logical sizes, and make the code cleaner.
In "Enforce buffer size limits for BlockEncodingBuffer" we introduced
estimatedMaxCapacity such that the growth of the buffers beyond that
value become slower. However the estimated max capacity is not always
100% accurate, and a underestimated value has negative impact on the
CPU performance. This commit gives the estimatedMaxCapacity some head
room by introducing a GRACE_FACTOR_FOR_MAX_BUFFER_CAPACITY with
default value 1.2f.
@yingsu00 yingsu00 force-pushed the fixEstimatedSize branch from e8511df to 533fd29 Compare July 4, 2020 11:03
@yingsu00
Copy link
Contributor Author

yingsu00 commented Jul 6, 2020

@mbasmanova Hi Masha, the tests now all passed. Thank you for reviewing!

@mbasmanova mbasmanova merged commit 6b51cbf into prestodb:master Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aria Presto Aria performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants