Skip to content

Add cast bingtile to/from bigint#14125

Merged
arhimondr merged 2 commits intoprestodb:masterfrom
jagill:tile_bigint
Mar 11, 2020
Merged

Add cast bingtile to/from bigint#14125
arhimondr merged 2 commits intoprestodb:masterfrom
jagill:tile_bigint

Conversation

@jagill
Copy link
Contributor

@jagill jagill commented Feb 20, 2020

Externally, tiles are encoded in a string of chars '0' to '3' called a
quadkey. Internally, Presto encodes a tile in 64 bits, represented by a
BIGINT. Storing a tile as a bigint is not only more space/cpu efficient
than storing it as a quadkey, but it also avoids the bucket-skew problem
caused by the non-uniform distribution of hash(quadkey) mod 2^k.

== RELEASE NOTES ==

General Changes
* Add `cast(tile AS bigint)` and `cast(bigint_value AS bingtile)` to encode/decode Bing tiles to/from bigints.  This is a more efficient storage format that also reduces bucket skew in some cases.

@jagill
Copy link
Contributor Author

jagill commented Feb 20, 2020

@mbasmanova

@jagill
Copy link
Contributor Author

jagill commented Feb 20, 2020

One question I have is whether we should think about versioning the encoding, now that it may be used to persist tiles. We have 13 bits left to play with...

@jagill
Copy link
Contributor Author

jagill commented Feb 27, 2020

cc @tdcmeehan

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Feb 27, 2020

The purpose of these functions seems to be casting from the BingTile type to/from BIGINT. If that's the case, why not just add operators to cast to/from BIGINT instead of these functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this in a loop of say 1000 just to make sure we get good signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is testing a query, the setup/teardown meant that doing 1000 (particularly for each level) was really slow (a couple minutes on my laptop). Is there a way to reduce the overhead, or is that kind of test length worth the coverage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of 1000, any nontrivial positive number would probably work

@jagill
Copy link
Contributor Author

jagill commented Feb 27, 2020

@mbasmanova and I chatted about whether to view this as a cast, or to use an explicit function. Our thinking was that since this is not a canonical transformation, but rather a encoding/decoding, it made sense to make the function explicit. What are your thoughts on that?

@tdcmeehan
Copy link
Contributor

We do similar things already. For example, internally HLL is represented as Slice, and we support cast to VARBINARY--which is also an encoding/decoding. To me this seems like the same sort of thing, except the internal data type is a long rather than a Slice. So if we accept this premise, then I think cast is the way to go for consistency (both in terms of how it works for other types, and also it being a consistent convention in Presto).

@jagill
Copy link
Contributor Author

jagill commented Feb 28, 2020

@tdcmeehan Ok, I replaced the functions with cast operations, and made the tests iterate 20 random tiles at each level (fewer for levels 0, 1, 2 when there are less than 20 tiles).

@jagill
Copy link
Contributor Author

jagill commented Feb 28, 2020

Did you have any thoughts on versioning the tile serialization?

@tdcmeehan
Copy link
Contributor

Did you have any thoughts on versioning the tile serialization?

I think it makes sense to reserve some of the bits for versioning. It's understandable that someone would want to leverage this feature to save storage space as well.

@jagill jagill changed the title Add bing_tile_to/from_bigint functions Add cast bingtile to/from bigint Feb 28, 2020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw new PrestoException(INVALID_FUNCTION_ARGUMENT,
throw new PrestoException(INVALID_CAST_ARGUMENT,

@jagill jagill force-pushed the tile_bigint branch 2 times, most recently from 08599ef to e1556a1 Compare March 2, 2020 14:23
@jagill
Copy link
Contributor Author

jagill commented Mar 2, 2020

@tdcmeehan I've addressed your comments by:

  1. Making the highest 5 bits a version, moving the zoom to the highest-of-32 bits. As a nice side benefit, this allows us to extend to zoom 27 by just changing a constant.
  2. Changing the error to INVALID_CAST_ARGUMENT
  3. Making an encoding unit test, which tests up to 1000 samples for each zoom level,
  4. Reducing the functional test to up to 5 samples per zoom level.

@arhimondr arhimondr self-requested a review March 10, 2020 14:50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unknown Bing Tile encoding version: %s"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up: Currently BingTile.decode must create an object of BingTile. The main reason for calling BingTile.decode is to validate the tile stored in a long. For efficiency I would recommend to have a dedicated method, e.g.: BingTile.validate or smthng.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: public

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extract these 3 lines into something like testRoundTrip(BingTile tile)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this code really has to be fuzzed.

How about standard test cases:

  • Min zoom
  • Max zoom
  • Min x and y
  • Max x and y
  • Several different combinations of values in between

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is only supposed to verify the integration of the tile encoding (that is tested in the test above), and the function mechanism. Instead of fuzzing i would recommend adding just a few simple test cases to verify the integration is in place.

jagill added 2 commits March 10, 2020 12:13
Externally, tiles are encoded in a string of chars '0' to '3' called a
quadkey.  Internally, Presto encodes a tile in 64 bits, represented by a
BIGINT.  Storing a tile as a bigint is not only more space/cpu efficient
than storing it as a quadkey, but it also avoids the bucket-skew problem
caused by the non-uniform distribution of `hash(quadkey) mod 2^k`.
Java hashes longs by XORing the first 32 bits with the second 32 bits.
Hive assigns buckets based on this hash.  If you have 2^k buckets, you
only keep the lowest k bits of the hash.  Often, k is 9 to 12, and the
previous encoding did not have much entropy in those low bits.  The
resulting first 5 bits were the zoom XORed with bits 5-9 of the x.
If the partition has a constant zoom (very common) and the zoom is less
than 9, several combinations of these bits would be missing, which would
mean empty buckets.

Checking the distribution over 1024 buckets for the old and new hash
function, we get:

Method | min   | mean    | max   | stddev
old    | 12659 | 50313.4 | 89397 | 22831.2
new    | 48344 | 50313.4 | 52626 | 1031.6

The stddev drops by a factor of 20x.
@jagill
Copy link
Contributor Author

jagill commented Mar 10, 2020

@arhimondr I've address your comments, modulo setting up the separate validation function. As per our conversation, doing that work require either annoying duplicated code, or additional heap allocations. So we'll deal with that if there are performance issues associated with the current method.

@arhimondr arhimondr merged commit 3977cbf into prestodb:master Mar 11, 2020
@caithagoras caithagoras mentioned this pull request Mar 29, 2020
9 tasks
@jagill jagill deleted the tile_bigint branch April 24, 2020 13:22
@caithagoras caithagoras mentioned this pull request May 4, 2020
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants