Add cast bingtile to/from bigint#14125
Conversation
|
One question I have is whether we should think about versioning the encoding, now that it may be used to persist tiles. We have 13 bits left to play with... |
|
cc @tdcmeehan |
|
The purpose of these functions seems to be casting from the BingTile type to/from BIGINT. If that's the case, why not just add operators to cast to/from BIGINT instead of these functions? |
There was a problem hiding this comment.
I would do this in a loop of say 1000 just to make sure we get good signal.
There was a problem hiding this comment.
Since this is testing a query, the setup/teardown meant that doing 1000 (particularly for each level) was really slow (a couple minutes on my laptop). Is there a way to reduce the overhead, or is that kind of test length worth the coverage?
There was a problem hiding this comment.
Instead of 1000, any nontrivial positive number would probably work
|
@mbasmanova and I chatted about whether to view this as a cast, or to use an explicit function. Our thinking was that since this is not a canonical transformation, but rather a encoding/decoding, it made sense to make the function explicit. What are your thoughts on that? |
|
We do similar things already. For example, internally HLL is represented as |
|
@tdcmeehan Ok, I replaced the functions with cast operations, and made the tests iterate 20 random tiles at each level (fewer for levels 0, 1, 2 when there are less than 20 tiles). |
|
Did you have any thoughts on versioning the tile serialization? |
I think it makes sense to reserve some of the bits for versioning. It's understandable that someone would want to leverage this feature to save storage space as well. |
There was a problem hiding this comment.
| throw new PrestoException(INVALID_FUNCTION_ARGUMENT, | |
| throw new PrestoException(INVALID_CAST_ARGUMENT, |
08599ef to
e1556a1
Compare
|
@tdcmeehan I've addressed your comments by:
|
There was a problem hiding this comment.
"Unknown Bing Tile encoding version: %s"
There was a problem hiding this comment.
follow up: Currently BingTile.decode must create an object of BingTile. The main reason for calling BingTile.decode is to validate the tile stored in a long. For efficiency I would recommend to have a dedicated method, e.g.: BingTile.validate or smthng.
There was a problem hiding this comment.
nit: extract these 3 lines into something like testRoundTrip(BingTile tile)
There was a problem hiding this comment.
I'm not sure if this code really has to be fuzzed.
How about standard test cases:
- Min
zoom - Max
zoom - Min
xandy - Max
xandy - Several different combinations of values in between
There was a problem hiding this comment.
This test is only supposed to verify the integration of the tile encoding (that is tested in the test above), and the function mechanism. Instead of fuzzing i would recommend adding just a few simple test cases to verify the integration is in place.
Externally, tiles are encoded in a string of chars '0' to '3' called a quadkey. Internally, Presto encodes a tile in 64 bits, represented by a BIGINT. Storing a tile as a bigint is not only more space/cpu efficient than storing it as a quadkey, but it also avoids the bucket-skew problem caused by the non-uniform distribution of `hash(quadkey) mod 2^k`.
Java hashes longs by XORing the first 32 bits with the second 32 bits. Hive assigns buckets based on this hash. If you have 2^k buckets, you only keep the lowest k bits of the hash. Often, k is 9 to 12, and the previous encoding did not have much entropy in those low bits. The resulting first 5 bits were the zoom XORed with bits 5-9 of the x. If the partition has a constant zoom (very common) and the zoom is less than 9, several combinations of these bits would be missing, which would mean empty buckets. Checking the distribution over 1024 buckets for the old and new hash function, we get: Method | min | mean | max | stddev old | 12659 | 50313.4 | 89397 | 22831.2 new | 48344 | 50313.4 | 52626 | 1031.6 The stddev drops by a factor of 20x.
|
@arhimondr I've address your comments, modulo setting up the separate validation function. As per our conversation, doing that work require either annoying duplicated code, or additional heap allocations. So we'll deal with that if there are performance issues associated with the current method. |
Externally, tiles are encoded in a string of chars '0' to '3' called a
quadkey. Internally, Presto encodes a tile in 64 bits, represented by a
BIGINT. Storing a tile as a bigint is not only more space/cpu efficient
than storing it as a quadkey, but it also avoids the bucket-skew problem
caused by the non-uniform distribution of
hash(quadkey) mod 2^k.