-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for better number representation #334
Comments
Should probably copy the code in, I don't know how long Go Playground posts last: // You can edit this code! import ( func Float64ToSortable(v float64) string { func main() { |
Hi @timbray, I'll gladly try. I might need a bit of help to find the best place in the existing code. And concerning go playground: I never saw something disappearing from there. I'm not a Quamina user (yet; it looks great! I'm interested in your protobuf support issue and can probably help a bit there, too). But your blog post sparked my intellectual curiosity. Thanks for your posts! Edit - if needed, you can also reach me at gmail with this login. |
I guess we never thought of that sort of approach because our automaton runs entirely on UTF-8 bytes and of course the bytes of your comparable are not UTF-8. Hmm... now I see a problem. We reserve a couple of byte values that can never appear in UTF-8 for a special meaning, the central core of Quamina's automaton is in small_table.go, have a look at that. |
smallTable is sufficiently non-obvious that it got a blog piece: https://www.tbray.org/ongoing/When/202x/2022/06/25/Small-Tables |
Yeah. It's a string, but not UTF8. And not NaN friendly. NaN might be smaller than negative infinity or larger than positive infinity. Depending on the binary representation. Which doesn't matter for JSON, though. |
For the state, there is some space. It might require reshuffling and will slow the approach down, but if we can map the NaN values into the machine states and normalize NaN to a single representation, this might still work. I don't yet know how well smallTable can be adapted... this is my first dive into the code. |
I think NaN and infinities aren't a problem, because all the numbers are read from JSON texts, which doesn't allow those values. |
I'm just now running out of time - dang. Offline is calling... I will try to work on this, but if you want to or if you have a great idea on how to integrate it, feel free. Finding my way through what's already here and finding the time to work on it will take me a while. |
No big hurry. The performance on numbers is, roughly, "good enough". But in a FA, the elapsed time is strongly related to the length of the representation, so this is worth doing. My intuition is that your idea will probably work, and the occurrence of our reserved byte values shouldn't get in the way. Also the unit-test suite is pretty strong so we'll probably discover any problems quickly. I have a couple of other things at the front of my queue so I'll leave this issue open for now in hopes that you get time to look into it. |
(Yes, I know you're away) I was thinking about this today and I think it should probably work. Yes, we have the two distinguished values byteCeiling 0xF6 and valueTerminator 0xF5, neither of which values can appear in UTF-8. Conventionally, when we make a byte-level dispatch table (see that smallTable blog) we auto-insert byteCeiling as a sentinel value. Since your comparable numbers can have bytes with values higher than 0xF5, when constructing tables, you might have to pick a better sentinel value, up to 0xFF I guess. Which leaves the problem of how to deal with byte 0xFF itself, which can appear in your numbers. I think a tiny piece of special-case code in smallTable can probably handle this. As for 0xF5, it's a sort of hack. For brevity, let's use "õ" (Unicode U+00F5) to represent it. When a Pattern looks like "foo", we actually store an automaton that matches Now, there's the issue of numeric range matching. Get yourself a big cup of coffee and look at https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/ByteMachine.java#L1195 - note that the big comment lies, wherever it says '9' it means 'F'. It seems clear we should be able to do these with your comparable numbers, but thinking about it makes my head hurt so that's all for now. |
I'll note that a trivial order-preserving way to encode an arbitrary uint64 into a valid UTF-8 string is to use big endian base 128. Makes the string two bytes longer, but presumably that is not a problem for quamina. [edit] or, reading the thread more closely, perhaps it is? I haven't look at the architecture of quamina [/edit] |
Memory matters, and Quamina is OK for memory until someone adds a million (literally) numeric patterns to an instance, which hasn't happened yet but, based on my experience with its predecessor, probably will happen. Then we'd be very happy to be spending 8 bytes rather than 14 or 16. So @arnehormann's approach is interesting. But what's also interesting is the CPU cost, because the transformation has to be applied to the incoming numbers in the records being filtered, which is a very hot path. So if we get a choice between faster/smaller… it's not obvious. |
@timbray To be clear, when i say "two bytes longer", I mean "than what @arnehormann is doing". The suggestion was to do the same thing, but instead of just interpreting the 8 byte uint64 as a string, to encode it into 10 bytes that are valid UTF-8. You can go even one step further and encode them in Base 65536 into UTF-8. This is probably a close to (if not the) optimal way to encode arbitrary One step further still: @arnehormann's process maps even relatively small integer numbers into relatively large However, if memory and CPU time really matter at that level, I'd recommend doing away with encoding things into My suggestion of base 128 above was basically meant as a very easy workaround to essentially do the same thing that @arnehormann does, but solve the UTF-8 compatibility problem in a cheap and easy to understand way, at very little cost. |
@timbray this is what happens when one manages to nerd-snipe @Merovius - brilliant ideas, usually coupled with excellent execution. Changing the encoding will definitely free up the control bytes (illegal UTF-8 bytes) you need. Concerning variable length encoding, I'm skeptical. The whole point is to process everything as a sequence of single bytes. It would usually work, but afaik the automata depend on the byte position to also express the numerical magnitude. Which is why the Big-Endian encoding is used at all. Still, either Axel finds a way or he'll report back. Also for the encoding. The first byte in UTF-8 is different, after all. An optimal encoding has to consider that. And design wise, I was thinking of introducing a CompressedNumbits with a possibility to use an adapted version of Normalize which not only converts -0 and NaN but drops them and infinity all together and squashes the rest to the Lower part of the numeric range. Giving you a smaller number to encode which might in the end free up another byte... Well, either way: good for you. That Q Numbers blogpost will probably prove to be quite valuable :) |
Thought about it on my run just now. I don't think you can do better than 10 bytes, with a fixed-length encoding, if the result is supposed to be UTF-8. Because UTF-8 has a maximum storage density of 7 bit per byte (multi-byte code points "waste" more bits, to make it self-synchronizing and give it other nice properties) and we need to store 64 bit, we are going to need 9⅐ bytes of UTF-8 on average. A tiny bit less if we ditch NaN, but not a full bit. The good news is that if we're stuck with 10 bytes/70 bits anyways, we have 6 bits of entropy left to put compression information in, which is a ton (relatively speaking). For example, 6 bits can (by coincidence) store the bit length of the actual payload, implying that it should be possible to have a variable-length encoding that never uses more than 10 bytes (but may use significantly less). In fact, I'm 90% confident I have something suitable. And yes, @arnehormann, I'm aware of the order-restriction :) Coincidentally, this isn't the first time I needed order-preserving variable-length encodings. The bad news is, that I find it complex/interesting enough that I might need/want to write up an explanation, before posting it 🙃 Assuming it works. |
Okay, I think I have to throw in the towel, this is taking too much time :) Here's the idea, in case it helps dislodge something: First, let's handle positive numbers. We split the bits into exponent and mantissa. Encode both using base128 as above (i.e. using 7 bits per byte). That will yield 2 bytes for the exponent (11 bits = 7+4) and 8 bytes for the mantissa (52 bits = 7*7+3). So far, this sorts exactly the same, lexicographically, as taking Arne's encoding and stuffing it into base128. Also, from now on, to avoid confusion, "by7e" will mean "7 bit encoded into a byte", i.e. the top-bit is always cleared. The theory was, that a lot of the high bits of the mantissa won't be set. So we can truncate the leading zero by7es, as they don't contain information. But this breaks the lexicographic order: 2 will encode as [2], 128 will encode as [1,0], the latter is larger, but sorts before the former. To remedy that, we put the total length (minus one) of the encoded mantissa into the top-bits of the first by7e. n-1 is at most 7, so this needs three bits. We store n-1, because the first by7e is always there. Now, the first by7e of the mantissa will be greater, if the encoding is longer. In the example above, 2 will encode as Now, let's look at negative numbers. Negative numbers need to do two things: 1. the encoding of any negative number must sort lower than the encoding of any non-negative number and 2. the ordering of two negative numbers must be inverted. That's the fundamental trick behind Arne's encoding. We can do the same thing: If the number is negative, we invert every by7e and clear the highest bit of the first encoded by7e. Otherwise, we set the highest bit of the first encoded by7e. This description is a bit complicated, but I think it's not that hard to understand and theoretically should be pretty efficient to implement. Here is a bad implementation, though. It shows that the encoding works. However, I also discovered that I made a faulty assumption: In practice, the highest bytes of the mantissa usually are not zero. On the contrary: The more "integer" a float is, the higher the mantissa bits, which are set (unless it is a power of two, in which case the mantissa is 0). Which means that for >98% of integers in ±1M, the encoding uses the full 10 bytes and doesn't save significant space or time 🙃 I think that could still be saved, if one wanted to, by inverting the assumption and saying that the low bytes of the mantissa are zero and cutting those off. That means you don't even need the trick with the length: The way lexicographic ordering works, is that missing trailing elements are effectively assumed to be 0 anyways. However, I think that breaks with negative numbers, because there the trailing bytes should be interpreted to be 0xff, which doesn't jibe with the lexicographic ordering anymore. Perhaps that can be saved by instead storing the number of trailing zeros in the top bits - possibly inverted. I'm not sure. That needs some thinking and I shouldn't. Also, after these experiments, I'm no longer sure splitting exponent and mantissa is useful. The exponent will, for small numbers, still use two bytes anyways (because of the bias its stored with) and fixing that is bound to screw with the order. And if you end up cutting trailing bits from the mantissa, it doesn't matter if you do it before or after splitting. So, if the ordering thing with trailing zeros can be figured out, I would probably just try applying that to the entire Another random thought: If you encode NaN as the empty string, you lose the bijection, but make them order the same as In any case, the tl;dr: Probably just use Arne's encoding, but put it into 10 by7es :) |
OK, so I think the consensus is that we should try integrating Arne's Are there any disadvantages? I can't think of any… there's the underlying assumption that the FA construction and traversal code in Integrating |
@timbray I realized that the variable length version of numbits is simpler than I thought. As you have a termination marker with Here's a sketch that should be more compact for e.g. integers stored in float64 (not yet tested, though):
|
Interesting idea. I'm close to posting a PR with fixed-length numbers with numbits+base128. I'd want to look at a sample of real-life numbers (not that far from 0, relatively few significant digits), and see if there are, on average, a lot of trailing zeroes. |
If this works, it will always be same length or shorter than the regular base128ed numbits. In every case. It will never be longer. |
I ran the following test to study whether trailing 0 sequences are common in the kinds of numbers that typically show up in Quamina patterns: func TestTrailingZeroesInBasicNumbers(t *testing.T) {
basics := []float64{-0.1, 0, 0.1, 2, 3, 3.00001, 123456}
for _, basic := range basics {
nb := numbitsFromFloat64(basic)
fmt.Printf("nb %d\n", nb)
}
} The results:
I guess the explanation is that the kinds of numbers that humans typically use are closely tied to our base-10 world, while numbits track the base-2 internal representation. So there's no obvious mapping between the way numbers look to us and the numbits representation. |
but you're printing it in decimal. That won't help. Try %x instead of %d. |
Oh of course, you are correct.
|
Now I will definitely have a close look at the variable-width numbits. First, do they preserve ordering? Because we will have queries like [On reflection, I guess they should. But will fuzz to be sure.] |
They preserve ordering. They only rely on no more bytes being smaller or equal to a sequence of 0 bytes. Otherwise, you're good. |
Resolved in #349 |
Posted by @arnehormann, see https://go.dev/play/p/gCzCrd3X0w1
from https://www.tbray.org/ongoing/When/202x/2024/07/09/Q-Numbers#c1720758803.307064
The text was updated successfully, but these errors were encountered: