Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for better number representation #334

Closed
timbray opened this issue Jul 12, 2024 · 26 comments
Closed

Proposal for better number representation #334

timbray opened this issue Jul 12, 2024 · 26 comments

Comments

@timbray
Copy link
Owner

timbray commented Jul 12, 2024

Posted by @arnehormann, see https://go.dev/play/p/gCzCrd3X0w1
from https://www.tbray.org/ongoing/When/202x/2024/07/09/Q-Numbers#c1720758803.307064

@timbray
Copy link
Owner Author

timbray commented Jul 12, 2024

Should probably copy the code in, I don't know how long Go Playground posts last:

// You can edit this code!
// Click here and start typing.
package main

import (
"cmp"
"encoding/binary"
"fmt"
"math"
"strconv"
)

func Float64ToSortable(v float64) string {
const sign uint64 = 1 << 63
bin := math.Float64bits(v)
if bin < sign {
// positive values gain high-bit
bin |= sign
} else if bin != sign {
// negative values have to be iverted to be ordered.
// special case for -0 - it stays the same binary representation
// and becomes 0
bin = ^bin
}
var dst [8]byte
binary.BigEndian.PutUint64(dst[:], bin)
return string(dst[:])
}

func main() {
vals := []string{"-1.0e120", "-1.0", "-1e16", "1e16", "10000000000000000", "-1E16", "0", "-0.0", "1.0e16", "-0.1e-16", "0.1e-16"}
tests := make([]float64, len(vals))
sortable := make([]string, len(vals))
for i, v := range vals {
f, err := strconv.ParseFloat(v, 64)
if err != nil {
panic(fmt.Errorf("error test %d. with %q: %v", i, v, err))
}
tests[i] = f
sortable[i] = Float64ToSortable(f)
}
for i, ti := range tests {
for j, tj := range tests {
si, sj := sortable[i], sortable[j]
if cmp.Compare(ti, tj) != cmp.Compare(si, sj) {
fmt.Printf("Error: %q (%d) and %q (%d):\n\t%x is %v\n\t%x is %v\n", vals[i], i, vals[j], j, si, ti, sj, tj)
}
}
}
fmt.Println("no errors? All is good")
}

@arnehormann
Copy link
Collaborator

arnehormann commented Jul 12, 2024

Hi @timbray, I'll gladly try. I might need a bit of help to find the best place in the existing code. And concerning go playground: I never saw something disappearing from there.
If I don't manage to do it today, I might take about two weeks. If you lack the patience, you can take over.

I'm not a Quamina user (yet; it looks great! I'm interested in your protobuf support issue and can probably help a bit there, too). But your blog post sparked my intellectual curiosity. Thanks for your posts!

Edit - if needed, you can also reach me at gmail with this login.

@timbray
Copy link
Owner Author

timbray commented Jul 12, 2024

I guess we never thought of that sort of approach because our automaton runs entirely on UTF-8 bytes and of course the bytes of your comparable are not UTF-8. Hmm... now I see a problem. We reserve a couple of byte values that can never appear in UTF-8 for a special meaning, the central core of Quamina's automaton is in small_table.go, have a look at that.

@timbray
Copy link
Owner Author

timbray commented Jul 12, 2024

smallTable is sufficiently non-obvious that it got a blog piece: https://www.tbray.org/ongoing/When/202x/2022/06/25/Small-Tables

@arnehormann
Copy link
Collaborator

arnehormann commented Jul 12, 2024

Yeah. It's a string, but not UTF8. And not NaN friendly. NaN might be smaller than negative infinity or larger than positive infinity. Depending on the binary representation. Which doesn't matter for JSON, though.
Also, -0 gets autoconverted to 0 and the information is lost. Otherwise, it's binary-reversible back to the original float64 and it coincidentally has the wonderful property that the zero-value will never be created, so all bytes 0 means unknown or undefined.

@arnehormann
Copy link
Collaborator

For the state, there is some space. It might require reshuffling and will slow the approach down, but if we can map the NaN values into the machine states and normalize NaN to a single representation, this might still work. I don't yet know how well smallTable can be adapted... this is my first dive into the code.

@timbray
Copy link
Owner Author

timbray commented Jul 12, 2024

I think NaN and infinities aren't a problem, because all the numbers are read from JSON texts, which doesn't allow those values.

@arnehormann
Copy link
Collaborator

I'm just now running out of time - dang. Offline is calling... I will try to work on this, but if you want to or if you have a great idea on how to integrate it, feel free. Finding my way through what's already here and finding the time to work on it will take me a while.
But if you sketch something up and if are interested in my feedback, I will find the time to provide it. From my phone, probably.

@timbray
Copy link
Owner Author

timbray commented Jul 12, 2024

No big hurry. The performance on numbers is, roughly, "good enough". But in a FA, the elapsed time is strongly related to the length of the representation, so this is worth doing. My intuition is that your idea will probably work, and the occurrence of our reserved byte values shouldn't get in the way. Also the unit-test suite is pretty strong so we'll probably discover any problems quickly.

I have a couple of other things at the front of my queue so I'll leave this issue open for now in hopes that you get time to look into it.

@timbray
Copy link
Owner Author

timbray commented Jul 13, 2024

(Yes, I know you're away) I was thinking about this today and I think it should probably work. Yes, we have the two distinguished values byteCeiling 0xF6 and valueTerminator 0xF5, neither of which values can appear in UTF-8. Conventionally, when we make a byte-level dispatch table (see that smallTable blog) we auto-insert byteCeiling as a sentinel value. Since your comparable numbers can have bytes with values higher than 0xF5, when constructing tables, you might have to pick a better sentinel value, up to 0xFF I guess. Which leaves the problem of how to deal with byte 0xFF itself, which can appear in your numbers. I think a tiny piece of special-case code in smallTable can probably handle this.

As for 0xF5, it's a sort of hack. For brevity, let's use "õ" (Unicode U+00F5) to represent it. When a Pattern looks like "foo", we actually store an automaton that matches "foo"õ (yes, with the quotes). Then when we start matching a JSON field value like "bar", we actually append an "õ" so it looks like "bar"õ. So matching code doesn't have to think about whether the match is at end of input on a pattern like *foo. So in practice you'd have to append a õ to the end of your comparable number.
A bit tricky, but it would make numeric equality matching run more or less twice as fast.

Now, there's the issue of numeric range matching. Get yourself a big cup of coffee and look at https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/ByteMachine.java#L1195 - note that the big comment lies, wherever it says '9' it means 'F'. It seems clear we should be able to do these with your comparable numbers, but thinking about it makes my head hurt so that's all for now.

@Merovius
Copy link

Merovius commented Jul 20, 2024

I'll note that a trivial order-preserving way to encode an arbitrary uint64 into a valid UTF-8 string is to use big endian base 128. Makes the string two bytes longer, but presumably that is not a problem for quamina. [edit] or, reading the thread more closely, perhaps it is? I haven't look at the architecture of quamina [/edit]

@timbray
Copy link
Owner Author

timbray commented Jul 20, 2024

Memory matters, and Quamina is OK for memory until someone adds a million (literally) numeric patterns to an instance, which hasn't happened yet but, based on my experience with its predecessor, probably will happen. Then we'd be very happy to be spending 8 bytes rather than 14 or 16. So @arnehormann's approach is interesting.

But what's also interesting is the CPU cost, because the transformation has to be applied to the incoming numbers in the records being filtered, which is a very hot path. So if we get a choice between faster/smaller… it's not obvious.

@Merovius
Copy link

@timbray To be clear, when i say "two bytes longer", I mean "than what @arnehormann is doing". The suggestion was to do the same thing, but instead of just interpreting the 8 byte uint64 as a string, to encode it into 10 bytes that are valid UTF-8.

You can go even one step further and encode them in Base 65536 into UTF-8. This is probably a close to (if not the) optimal way to encode arbitrary uint64 into UTF-8. This has the advantage of mapping smaller uint64 to shorter strings.

One step further still: @arnehormann's process maps even relatively small integer numbers into relatively large uint64 (because the fractional part always represents a number in [0,1), anything larger than 1 has a non-zero exponent, which sets high bits in the uint64. AIUI). You can alleviate that effect, by encoding the exponent and fractional part separately into UTF-8. I think you can do significantly better than even that by using some variable-length encoding into Base 65536. I'd have to think a bit about that (and "have to" here probably means my brain won't let me not do it). In any case, I think if you really want to, there are still relatively simple, far more space-efficient encodings that do what you want. You should probably be able to store most practical numbers (say, integers in the range up to ±1000) in one or two bytes.

However, if memory and CPU time really matter at that level, I'd recommend doing away with encoding things into string to begin with. Even @arnehormann's approach uses more memory by a factor of more than 3, compared to leaving numbers as float64 and just comparing those. Because it needs to keep the string header around and puts stuff on the heap, which also needs some extra bookkeeping-space. And it would use float64 comparison instructions, which will be quicker than string-comparisons. But this would require a bunch of other changes to the architecture of quamina.

My suggestion of base 128 above was basically meant as a very easy workaround to essentially do the same thing that @arnehormann does, but solve the UTF-8 compatibility problem in a cheap and easy to understand way, at very little cost.

@arnehormann
Copy link
Collaborator

arnehormann commented Jul 20, 2024

@timbray this is what happens when one manages to nerd-snipe @Merovius - brilliant ideas, usually coupled with excellent execution. Changing the encoding will definitely free up the control bytes (illegal UTF-8 bytes) you need.

Concerning variable length encoding, I'm skeptical. The whole point is to process everything as a sequence of single bytes. It would usually work, but afaik the automata depend on the byte position to also express the numerical magnitude. Which is why the Big-Endian encoding is used at all.

Still, either Axel finds a way or he'll report back. Also for the encoding. The first byte in UTF-8 is different, after all. An optimal encoding has to consider that. And design wise, I was thinking of introducing a CompressedNumbits with a possibility to use an adapted version of Normalize which not only converts -0 and NaN but drops them and infinity all together and squashes the rest to the Lower part of the numeric range. Giving you a smaller number to encode which might in the end free up another byte...

Well, either way: good for you. That Q Numbers blogpost will probably prove to be quite valuable :)

@Merovius
Copy link

Thought about it on my run just now.

I don't think you can do better than 10 bytes, with a fixed-length encoding, if the result is supposed to be UTF-8. Because UTF-8 has a maximum storage density of 7 bit per byte (multi-byte code points "waste" more bits, to make it self-synchronizing and give it other nice properties) and we need to store 64 bit, we are going to need 9⅐ bytes of UTF-8 on average. A tiny bit less if we ditch NaN, but not a full bit.

The good news is that if we're stuck with 10 bytes/70 bits anyways, we have 6 bits of entropy left to put compression information in, which is a ton (relatively speaking). For example, 6 bits can (by coincidence) store the bit length of the actual payload, implying that it should be possible to have a variable-length encoding that never uses more than 10 bytes (but may use significantly less). In fact, I'm 90% confident I have something suitable. And yes, @arnehormann, I'm aware of the order-restriction :) Coincidentally, this isn't the first time I needed order-preserving variable-length encodings.

The bad news is, that I find it complex/interesting enough that I might need/want to write up an explanation, before posting it 🙃 Assuming it works.

@Merovius
Copy link

Okay, I think I have to throw in the towel, this is taking too much time :) Here's the idea, in case it helps dislodge something:

First, let's handle positive numbers. We split the bits into exponent and mantissa. Encode both using base128 as above (i.e. using 7 bits per byte). That will yield 2 bytes for the exponent (11 bits = 7+4) and 8 bytes for the mantissa (52 bits = 7*7+3). So far, this sorts exactly the same, lexicographically, as taking Arne's encoding and stuffing it into base128. Also, from now on, to avoid confusion, "by7e" will mean "7 bit encoded into a byte", i.e. the top-bit is always cleared.

The theory was, that a lot of the high bits of the mantissa won't be set. So we can truncate the leading zero by7es, as they don't contain information. But this breaks the lexicographic order: 2 will encode as [2], 128 will encode as [1,0], the latter is larger, but sorts before the former. To remedy that, we put the total length (minus one) of the encoded mantissa into the top-bits of the first by7e. n-1 is at most 7, so this needs three bits. We store n-1, because the first by7e is always there. Now, the first by7e of the mantissa will be greater, if the encoding is longer. In the example above, 2 will encode as [0<<5 + 2] = [2], while 128 will encode as [1<<5+1, 0] = [33, 0] and sort higher. As the mantissa has at most 52 bits, we need 3 for the length and have 8*7=56, this gives us an order-preserving encoding that never takes more than 8 by7tes.

Now, let's look at negative numbers. Negative numbers need to do two things: 1. the encoding of any negative number must sort lower than the encoding of any non-negative number and 2. the ordering of two negative numbers must be inverted. That's the fundamental trick behind Arne's encoding. We can do the same thing: If the number is negative, we invert every by7e and clear the highest bit of the first encoded by7e. Otherwise, we set the highest bit of the first encoded by7e.

This description is a bit complicated, but I think it's not that hard to understand and theoretically should be pretty efficient to implement. Here is a bad implementation, though. It shows that the encoding works. However, I also discovered that I made a faulty assumption: In practice, the highest bytes of the mantissa usually are not zero. On the contrary: The more "integer" a float is, the higher the mantissa bits, which are set (unless it is a power of two, in which case the mantissa is 0). Which means that for >98% of integers in ±1M, the encoding uses the full 10 bytes and doesn't save significant space or time 🙃

I think that could still be saved, if one wanted to, by inverting the assumption and saying that the low bytes of the mantissa are zero and cutting those off. That means you don't even need the trick with the length: The way lexicographic ordering works, is that missing trailing elements are effectively assumed to be 0 anyways. However, I think that breaks with negative numbers, because there the trailing bytes should be interpreted to be 0xff, which doesn't jibe with the lexicographic ordering anymore. Perhaps that can be saved by instead storing the number of trailing zeros in the top bits - possibly inverted. I'm not sure. That needs some thinking and I shouldn't.

Also, after these experiments, I'm no longer sure splitting exponent and mantissa is useful. The exponent will, for small numbers, still use two bytes anyways (because of the bias its stored with) and fixing that is bound to screw with the order. And if you end up cutting trailing bits from the mantissa, it doesn't matter if you do it before or after splitting. So, if the ordering thing with trailing zeros can be figured out, I would probably just try applying that to the entire uint64.

Another random thought: If you encode NaN as the empty string, you lose the bijection, but make them order the same as cmp.Compare (i.e. all NaNs compare equal and lower than all non-NaNs).


In any case, the tl;dr: Probably just use Arne's encoding, but put it into 10 by7es :)

@timbray
Copy link
Owner Author

timbray commented Jul 20, 2024

OK, so I think the consensus is that we should try integrating Arne's numbits representation and see how that does. If it works, we should expect two benefits: 1. use less memory (8bytes vs 14) and 2. run faster, because the automaton has to traverse maximum 8 states to match.

Are there any disadvantages? I can't think of any… there's the underlying assumption that the FA construction and traversal code in value_matcher.go and nfa.go and small_table.go is always processing UTF-8 bytes. It wouldn't surprise me if there's something there that I can't think of now that will break if that's not true. Only one way to find out.

Integrating numbits really shouldn't be a lot of work, but I'm going to wait until we land a couple of the now-outstanding PRs.

@arnehormann
Copy link
Collaborator

@timbray I realized that the variable length version of numbits is simpler than I thought. As you have a termination marker with f5, you just have to end when the remaining bytes are all 0.
Note: This relies on shorter sequences that are prefixes of longer ones to always be considered smaller.

Here's a sketch that should be more compact for e.g. integers stored in float64 (not yet tested, though):

func encodeFloat(f float64) []byte {
	u := math.Float64bits(f)
	mask := (u>>63)*^uint64(0) | (1 << 63)
	num := u ^ mask
	// conversions. NaNs are not handled because they are not valid JSON
	if num == (1 << 63) - 1 {
		// change -0.0 to 0.0 because it's equal in comparisons
		num = 1 << 63
	}
	// big endian conversion with termination mark included
	const terminationMark = 0xf5
	be128 := [11]byte {
		byte(num >> 57) & 0x7f, // 57 bits remaining
		byte(num >> 50) & 0x7f, // 50 bits remaining
		byte(num >> 43) & 0x7f, // 43 bits remaining
		byte(num >> 36) & 0x7f, // 36 bits remaining
		byte(num >> 29) & 0x7f, // 29 bits remaining
		byte(num >> 22) & 0x7f, // 22 bits remaining
		byte(num >> 15) & 0x7f, // 15 bits remaining
		byte(num >> 8) & 0x7f, //  8 bits remaining
		byte(num >> 1) & 0x7f, //  1 bit  remaining
		byte(num & 1) << 6, // last bit (high bit of lower 7 bits)
		terminationMark,
	}
	// look for cutoff point for variable length, terminal 0 bytes can be ignored.
	for i := len(be128) - 2; i >= 0; i-- {
		if be128[i] != 0 {
			// insert termination mark after last non-0 byte and return
			be128[i+1] = terminationMark
			return be128[:i+2]
		}
	}
	// all zero
	be128[0] = terminationMark
	return be128[:1]
}

@timbray
Copy link
Owner Author

timbray commented Aug 18, 2024

Interesting idea. I'm close to posting a PR with fixed-length numbers with numbits+base128.

I'd want to look at a sample of real-life numbers (not that far from 0, relatively few significant digits), and see if there are, on average, a lot of trailing zeroes.

@arnehormann
Copy link
Collaborator

If this works, it will always be same length or shorter than the regular base128ed numbits. In every case. It will never be longer.
Essentially, it is exactly equivalent to numbits in base128 encoding. Just additionally truncate all trailing 0 bytes because they do not influence comparisons. And if I got you right, you need the 0xf5 marker anyway.

@timbray
Copy link
Owner Author

timbray commented Aug 19, 2024

I ran the following test to study whether trailing 0 sequences are common in the kinds of numbers that typically show up in Quamina patterns:

func TestTrailingZeroesInBasicNumbers(t *testing.T) {
	basics := []float64{-0.1, 0, 0.1, 2, 3, 3.00001, 123456}
	for _, basic := range basics {
		nb := numbitsFromFloat64(basic)
		fmt.Printf("nb %d\n", nb)
	}
}

The results:

nb 4631501856787818085
nb 9223372036854775808
nb 13815242216921733530
nb 13835058055282163712
nb 13837309855095848960
nb 13837309877613847097
nb 13906592281785270272

I guess the explanation is that the kinds of numbers that humans typically use are closely tied to our base-10 world, while numbits track the base-2 internal representation. So there's no obvious mapping between the way numbers look to us and the numbits representation.

@arnehormann
Copy link
Collaborator

but you're printing it in decimal. That won't help. Try %x instead of %d.

@timbray
Copy link
Owner Author

timbray commented Aug 19, 2024

Oh of course, you are correct.

nb 4046666666666665
nb 8000000000000000
nb bfb999999999999a
nb c000000000000000
nb c008000000000000
nb c00800053e2d6239
nb c0fe240000000000

@timbray
Copy link
Owner Author

timbray commented Aug 19, 2024

Now I will definitely have a close look at the variable-width numbits. First, do they preserve ordering? Because we will have queries like {"range": [ ">", 0, "<="3]} - it's possible if a bit difficult to build an automaton that matches this.

[On reflection, I guess they should. But will fuzz to be sure.]

@arnehormann
Copy link
Collaborator

They preserve ordering. They only rely on no more bytes being smaller or equal to a sequence of 0 bytes. Otherwise, you're good.

@timbray
Copy link
Owner Author

timbray commented Sep 10, 2024

Resolved in #349

@timbray timbray closed this as completed Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants