api: add support for a new compact float64 representation #336

arnehormann · 2024-07-19T21:38:43Z

Numbits can be constructed from float64 and can be losslessly converted back to float64.
Doing so required upgrading go.mod to 1.20.

The code is not used yet and still has to be integrated. I can try to work on that, but it will probably be faster if somebody else (Tim 😇 ) does it.

The current fast branchless conversion is possible due to a nerd-snipe of @Merovius. He also threw it a godbolt and gave it some scrutiny. Thanks, Axel!

arnehormann · 2024-07-19T21:40:17Z

... resolves #334

arnehormann · 2024-07-19T21:41:54Z

ugh. Test errors due to Go 1.19 used although go.mod specifies 1.20. I need at least 1.20 for this code. Will try to sort it out tomorrow

timbray · 2024-07-19T21:47:08Z

Thanks for this! I'm in the middle of implementing equals-ignore-case matching, which turns out to be a messy tangle of corner cases and Unicode weirdness. So I'm going to hold off on seriously studying this until I get that landed. I wonder if there is any cost of moving from Go19 to 20? Hey @embano1, got an opinion? I see that some of the unit tests failed, I suppose this is the 19/20 problem.

timbray · 2024-07-19T21:53:37Z

One note to myself or whoever… we are going to need a decent benchmark to exhibit the runtime performance impact of moving from Q numbers to numbits. Right now our test that makes heaviest use of numeric matching is Benchmark_JsonFlattner_Evaluate_ContextFields.

Merovius · 2024-07-20T04:36:10Z

ISTM the only reason to use Go 1.20 here is that you use cmp.Compare - and only in a test, at that? I'll note that you could also copy the ten lines or so (potentially behind a build tag, with post Go 1.20 versions delegating to the standard library version).

arnehormann · 2024-07-20T04:58:47Z

@Merovius good point. I'll change it.

arnehormann · 2024-07-20T05:59:38Z

Okay. Rebased on main, copied compare code from cmp in Go 1.20, added a new NumbitsFromBinaryString func to sidestep problems with slice to array conversions not available in Go 1.19 and replaced the tests Compare invocations with those of the copied code. Parts of that might have be done more elegantly, but this seems to work.
Is something else required to restart the runners (e.g. a new commit instead of a full force-pushed replacement)?

arnehormann · 2024-07-20T06:13:49Z

Also @timbray: this is not integrated anywhere, yet. If you still require the "invalid UTF-8 bytes" property, the Numbits can be "compressed" after calling Normalize().
NaNs fill 1<<52 bits of space at both the highest and lowest bit patterns (0 upward and ^0 downward). By default, Numbits preserves those patterns and makes them restorable (e.g. for usage in scenarios where NaN-boxing is used). For something directed at JSON, that's not needed. So you could drop NaNs (and even infinity) and subtract 1<<52 from the Numbits. And add it back before you put it into NumbitsFromXYZ.
Integration will be a lot faster when you do it, but I wanted to describe the how and I wanted to add the format itself.
Ping me for questions :-)

Oh, also... I'm calling it "Numbits". The pattern is obvious enough (at least in hindsight), it probably existed before. Still, I did not know it and did not search a lot. I just hope the name is not too far off... if required, we can rename it later to align with the rest of the world.

arnehormann · 2024-07-20T10:01:50Z

@timbray I made the required changes. Can you trigger the CI again? I didn't do that myself for GitHub actions yet, but this looks like the relevant docs: https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs

codecov-commenter · 2024-07-20T17:47:34Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.64%. Comparing base (be1752d) to head (3251f29).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #336      +/-   ##
==========================================
+ Coverage   96.59%   96.64%   +0.04%     
==========================================
  Files          19       20       +1     
  Lines        1940     1967      +27     
==========================================
+ Hits         1874     1901      +27     
  Misses         37       37              
  Partials       29       29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Merovius

Closes: 334 Numbits can be constructed from float64 and can be losslessly converted back to float64. The current fast branchless conversion is possible due to a nerd-snipe of @Merovius. He also threw it a godbolt and gave it some scrutiny. Thanks, Axel!

timbray · 2024-07-21T17:17:34Z

So what I think I should do is clone this branch and use that as a workspace to see if I can (a) integrate numbits with the finite-automaton code and (b) evaluate any performance differences. The Go benchmarking tool claims to track memory allocations, which is difficult; I wonder how accurate that is? I have not used that, will investigate. The result of replacing 14-byte with 8-byte quantities should be un-subtle. @arnehormann in your view is this now stable, or is there anything else you want to do before I grab it?

arnehormann · 2024-07-21T19:26:16Z

@timbray I am very very happy with this now and love the api. One thing that could be added is the proposal of the encoding Axel described, but that could also live in another file. And there might always be more tests - but the existing ones should cover pretty much everything including exotic edge cases.

And memory tracked by benchmarks is very usable and accurate.

It might help to wed number decoding very closely to this code. That can and should also happen externally, I chose an api that pretty much has no error cases and is a poster child of "make bad states not representable". But due to the no-error approach, it expects a float64 instead of a []byte of decimal utf8 number string currently. Or the binary version (post conversion) as [8]byte or string.

timbray · 2024-07-21T20:31:35Z

OK. The bad news is that, while the API is nice, Quamina needs very little of it. My thinking now is that when flattenJSON encounters a number, we will still use strconv.ParseDouble(), which is quite optimized. Then I would use NumbitsFromFloat64().Bytes() to get 8 (non-UTF8) bytes and try to run the automaton on that. Does that make sense?

arnehormann · 2024-07-21T22:40:16Z

That's exactly right. And if -0 is possible after parsing numbers, also Normalize(). And you have to store the Bytes() result in a variable before you can slice it with [:] (has to be addressable). If used in more places, there can also be a convenience func combining it. But it all inlines well and the assembly looks good. And I made it to be usable with more than JSON.
A major part of the reason is that I could vastly better test it with this API.

timbray · 2024-07-21T23:36:12Z

It turns out that JSON allows -0 but Quamina patterns don't, because the Go JSON tokenizer doesn't. Is that a bug I think?

Hm, I guess we need a policy decision. If I make a pattern

{"x": [ -0 ]}

Should that match this event?

{"x": 0}

or vice versa?

The answer is not obvious to me. In Go (and most programming languages I think?) -0.0==0.0. But presumably if someone creates such a pattern, they care about the sign.

arnehormann · 2024-07-22T02:06:04Z

Concerning comparison (which is of higher interest to the matcher I think), both should be equal - at least according to https://en.m.wikipedia.org/wiki/Signed_zero ... concerning representation, they are distinct. But think of your blog post that started me on this - 1e1 vs 10 vs 10.0 etc are the same number.

timbray · 2024-07-22T16:41:40Z

OK, so my conclusion is that, since Go's built-in json tokenizer apparently doesn't support -0 and JSON itself doesn't support NaN, I never need to call Normalize()

arnehormann · 2024-07-22T18:49:21Z

sounds good. As long as it's for both paths, patterns and events.

ah-quant · 2024-08-06T15:27:53Z

Hi Tim, I just checked - that assumption is flawed, Normalize is still required (well, at least the -0 handling part of it is): https://go.dev/play/p/q0SdRoPJUMp

Only the compilers automatically converts -0 to 0; the runtime behavior is different.

timbray · 2024-08-06T21:32:24Z

(should be getting back to this next week)

We'd need a unit test to show the potential problem with -0, because at the moment I don't think one can get through either Go's JSON reader or Quamina's custom parser.

timbray · 2024-08-09T23:28:48Z

OK, I have cloned Arne's branch and will replace qNumber with numbits and do some benchmarking.

I like numbits and it would be good if numeric comparisons had real 64-bit accuracy
Including numbits is going to complexify some central code paths, especially smallTable, which rely on an assumption that all the bytes are UTF-8 bytes and thus can't have values greater than 0xF4, allowing 0xF5 and 0xF6 to be used as sentinels and end-of-data markers. So, research goal 1: Determine how much complexity is added.
Assuming that the numbits basically work and don't lead to horrible complexifying, I'm going to create a benchmark with patterns that include multiple numeric matches, so we understand what we win/lose. My prediction is that numbits should be noticeably faster and burn less memory simply because they are 8 bytes long and qNumbers are 14, but I'm too old to believe in my own predictions.
The result of all this should show us the costs and benefits of numbits and allow us to make an evidence-based decision on whether to adopt.

timbray · 2024-08-16T03:04:08Z

Now I'm leaning against using numbits. First of all, as I think about redesigning smallTable to survive in the presence of non-UTF8 characters, I keep thinking of corner cases - this type's step and especially dStep are Q's tight central loop, they are very tiny and have excellent memory locality, and rely on stupid UTF-8 tricks, in particular the fact that the byte values higher than 0xF4 can’t appear. I’m not 100% on this and will keep thinking.

But there's another problem which is tiny and unlikely to happen that I don't know how to fix.

Suppose I have the rule

{"x": ["foo"] }

which is intended to match only JSON data that has a top-level "x":"foo" member. However, with numbits, this will also match:

{"x":-77936130622565622861113617120445728215253349187385567291125040555718211817982934064273708603570865020116281354366693784516070188526811668807680}

Because its numbits representation is [34 102 111 111 34 245 122 122] which can also be thought of as ['"','f','o','o','"',0xF5,'z','z']

(Don’t ask me how I found that number, it’s too painful. And I’ve never seen a binary search iterate 600 times before.]
[Edit: The reason I did that is that I was convinced I had a number where NumbitsFromFloat64(x).Float64() did not round-trip. But now I can't reproduce.]

The reason why? Because we use 0xF5, which can never appear in UTF-8, as an end-of-data signal. We add it to the automata for all the matching values and also to each value when we match it. It simplifies a lot of code because you never have to check whether you're at the end of the input data. But that byte value can of course appear in a Numbits.

The annoying thing is that this will almost certainly never happen, the Numbits that could possibly match strings are only these very huge negative values, this one is approximately -7.793613e+142. But it would be very hard to explain if a user managed to encounter it.

My mind isn’t 100% made up. If I could work around the second problem here I'd be more motivated to go after the redesign-smallTable problem.

I will think some more.

ah-quant · 2024-08-16T09:23:02Z

This could be fixed with the strategy outlined by Axel in #334 (comment) and following.
A different encoding can make numbits utf8-safe.
By encoding numbits with base128, 10 bytes are used per numbit. base64 would use 11 bytes. If you need something from the standard lib and are averse to copying the encoding code from someplace else into quamina, you can also use #334 (comment) - probably slower than base128, but also in 10 bytes.

timbray · 2024-08-16T15:22:59Z

OK, that's sensible, I'll have a look. Staying inside UTF-8 bounds is a massive simplification. So we'd get arbitrary precision and simultaneously reduce numeric fields from 14 bytes per numeric field to 10. Sorry, you mentioned #334 twice, did you mean to mention something else the second time?

arnehormann · 2024-08-16T17:04:43Z

No, I did not. Sorry. I initially considered linking his follow-up comment, but base128 is the best choice here and probably not overly difficult to write - or copy and adapt from Axels playground link.

arnehormann · 2024-08-16T17:47:50Z

Ugh. That was wrong. The second link should have been https://pkg.go.dev/encoding/[email protected] - but I'd still prefer base128.

timbray · 2024-08-16T18:22:47Z

I agree. I just finished implementing and unit-testing Numbits.toUTF8(), stealing from Axel's code, and it seems flawless and fast. I think this is now going to be easy. Wow, that base128 text is ugly, but no human should ever see it.

timbray · 2024-08-16T18:41:25Z

Sometimes Numbits.Float64() can return NaN

func TestBrokenFLoat(t *testing.T) {
	var nb Numbits = 18445811164593300620
	f := nb.Float64()
	if math.IsNaN(f) {
		t.Error("NaN")
	}
}

Is this a problem? This Numbits was created with Numbits(rand2.Uint64()) - maybe that creates values that couldn't be generated by NumbitsFromFloat() or NumbitsFromBytes()?

arnehormann · 2024-08-16T22:08:57Z

I don't think it's a problem. NaN and infinity are representable and can be created this way, but will never occur with valid json inputs and patterns.

timbray · 2024-08-16T22:59:32Z

That's my belief too, so I won't let it worry me.

timbray · 2024-08-18T21:13:56Z

FYI: I have finished up replacing Quamina's 14-hex-byte Qnumber construct with 10-byte numbits+base128. I still have a bunch of work to do on the README and docs before I can PR, but here are some numbers. I created a new benchmark, BenchmarkNumberMatching that does simple number mathcing, of which half succeed. The proportion that succeeds is crucial, because to match, the code has to go through the whole 10 bytes to verify all match. Failed matching is much cheaper.

Using the Go benchmarking, here are before and after results.
Before:

BenchmarkNumberMatching-12    	 1586095	       741.6 ns/op	    2810 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 1541496	       752.8 ns/op	    2810 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 1554729	       746.1 ns/op	    2810 B/op	       6 allocs/op

After:

BenchmarkNumberMatching-12    	 2164224	       537.1 ns/op	    1596 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 2162497	       537.5 ns/op	    1596 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 2144856	       548.9 ns/op	    1596 B/op	       6 allocs/op

So, that's pretty good!

timbray · 2024-08-18T21:14:44Z

(Although TBH I don't like that each step is allocating this much memory, must investigate…)

timbray · 2024-08-28T21:30:08Z

Thanks for this idea, which turns out to have been very fruitful.

arnehormann force-pushed the numbits branch 2 times, most recently from 20e9f7c to 1ceb8b6 Compare July 20, 2024 05:51

arnehormann force-pushed the numbits branch 3 times, most recently from 877be83 to 6a81b27 Compare July 20, 2024 06:45

arnehormann force-pushed the numbits branch 2 times, most recently from 8d1175f to 87f4576 Compare July 20, 2024 13:22

arnehormann changed the title ~~Add support for a new compact float64 representation~~ api: add support for a new compact float64 representation Jul 20, 2024

arnehormann force-pushed the numbits branch from 87f4576 to 415335a Compare July 20, 2024 17:45

arnehormann force-pushed the numbits branch from 415335a to 14ea45c Compare July 20, 2024 17:52

embano1 force-pushed the numbits branch from 14ea45c to af097d7 Compare July 21, 2024 06:39

timbray mentioned this pull request Aug 18, 2024

numbits+base128 8-byte full-precision numbers #349

Merged

Merge branch 'main' into numbits

3251f29

timbray closed this Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api: add support for a new compact float64 representation #336

api: add support for a new compact float64 representation #336

arnehormann commented Jul 19, 2024

arnehormann commented Jul 19, 2024

arnehormann commented Jul 19, 2024

timbray commented Jul 19, 2024

timbray commented Jul 19, 2024

Merovius commented Jul 20, 2024

arnehormann commented Jul 20, 2024

arnehormann commented Jul 20, 2024 •

edited

Loading

arnehormann commented Jul 20, 2024 •

edited

Loading

arnehormann commented Jul 20, 2024

codecov-commenter commented Jul 20, 2024 •

edited

Loading

timbray commented Jul 21, 2024

arnehormann commented Jul 21, 2024

timbray commented Jul 21, 2024

arnehormann commented Jul 21, 2024 •

edited

Loading

timbray commented Jul 21, 2024

arnehormann commented Jul 22, 2024 •

edited

Loading

timbray commented Jul 22, 2024

arnehormann commented Jul 22, 2024

ah-quant commented Aug 6, 2024

timbray commented Aug 6, 2024

timbray commented Aug 9, 2024

timbray commented Aug 16, 2024 •

edited

Loading

ah-quant commented Aug 16, 2024

timbray commented Aug 16, 2024

arnehormann commented Aug 16, 2024

arnehormann commented Aug 16, 2024

timbray commented Aug 16, 2024

timbray commented Aug 16, 2024

arnehormann commented Aug 16, 2024

timbray commented Aug 16, 2024

timbray commented Aug 18, 2024 •

edited

Loading

timbray commented Aug 18, 2024

timbray commented Aug 28, 2024

api: add support for a new compact float64 representation #336

api: add support for a new compact float64 representation #336

Conversation

arnehormann commented Jul 19, 2024

arnehormann commented Jul 19, 2024

arnehormann commented Jul 19, 2024

timbray commented Jul 19, 2024

timbray commented Jul 19, 2024

Merovius commented Jul 20, 2024

arnehormann commented Jul 20, 2024

arnehormann commented Jul 20, 2024 • edited Loading

arnehormann commented Jul 20, 2024 • edited Loading

arnehormann commented Jul 20, 2024

codecov-commenter commented Jul 20, 2024 • edited Loading

Codecov Report

timbray commented Jul 21, 2024

arnehormann commented Jul 21, 2024

timbray commented Jul 21, 2024

arnehormann commented Jul 21, 2024 • edited Loading

timbray commented Jul 21, 2024

arnehormann commented Jul 22, 2024 • edited Loading

timbray commented Jul 22, 2024

arnehormann commented Jul 22, 2024

ah-quant commented Aug 6, 2024

timbray commented Aug 6, 2024

timbray commented Aug 9, 2024

timbray commented Aug 16, 2024 • edited Loading

ah-quant commented Aug 16, 2024

timbray commented Aug 16, 2024

arnehormann commented Aug 16, 2024

arnehormann commented Aug 16, 2024

timbray commented Aug 16, 2024

timbray commented Aug 16, 2024

arnehormann commented Aug 16, 2024

timbray commented Aug 16, 2024

timbray commented Aug 18, 2024 • edited Loading

timbray commented Aug 18, 2024

timbray commented Aug 28, 2024

arnehormann commented Jul 20, 2024 •

edited

Loading

arnehormann commented Jul 20, 2024 •

edited

Loading

codecov-commenter commented Jul 20, 2024 •

edited

Loading

arnehormann commented Jul 21, 2024 •

edited

Loading

arnehormann commented Jul 22, 2024 •

edited

Loading

timbray commented Aug 16, 2024 •

edited

Loading

timbray commented Aug 18, 2024 •

edited

Loading