Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: add support for a new compact float64 representation #336

Closed
wants to merge 2 commits into from

Conversation

arnehormann
Copy link
Collaborator

Numbits can be constructed from float64 and can be losslessly converted back to float64.
Doing so required upgrading go.mod to 1.20.

The code is not used yet and still has to be integrated. I can try to work on that, but it will probably be faster if somebody else (Tim 😇 ) does it.

The current fast branchless conversion is possible due to a nerd-snipe of @Merovius. He also threw it a godbolt and gave it some scrutiny. Thanks, Axel!

@arnehormann
Copy link
Collaborator Author

... resolves #334

@arnehormann
Copy link
Collaborator Author

ugh. Test errors due to Go 1.19 used although go.mod specifies 1.20. I need at least 1.20 for this code. Will try to sort it out tomorrow

@timbray
Copy link
Owner

timbray commented Jul 19, 2024

Thanks for this! I'm in the middle of implementing equals-ignore-case matching, which turns out to be a messy tangle of corner cases and Unicode weirdness. So I'm going to hold off on seriously studying this until I get that landed. I wonder if there is any cost of moving from Go19 to 20? Hey @embano1, got an opinion? I see that some of the unit tests failed, I suppose this is the 19/20 problem.

@timbray
Copy link
Owner

timbray commented Jul 19, 2024

One note to myself or whoever… we are going to need a decent benchmark to exhibit the runtime performance impact of moving from Q numbers to numbits. Right now our test that makes heaviest use of numeric matching is Benchmark_JsonFlattner_Evaluate_ContextFields.

@Merovius
Copy link

ISTM the only reason to use Go 1.20 here is that you use cmp.Compare - and only in a test, at that? I'll note that you could also copy the ten lines or so (potentially behind a build tag, with post Go 1.20 versions delegating to the standard library version).

@arnehormann
Copy link
Collaborator Author

@Merovius good point. I'll change it.

@arnehormann arnehormann force-pushed the numbits branch 2 times, most recently from 20e9f7c to 1ceb8b6 Compare July 20, 2024 05:51
@arnehormann
Copy link
Collaborator Author

arnehormann commented Jul 20, 2024

Okay. Rebased on main, copied compare code from cmp in Go 1.20, added a new NumbitsFromBinaryString func to sidestep problems with slice to array conversions not available in Go 1.19 and replaced the tests Compare invocations with those of the copied code. Parts of that might have be done more elegantly, but this seems to work.
Is something else required to restart the runners (e.g. a new commit instead of a full force-pushed replacement)?

@arnehormann
Copy link
Collaborator Author

arnehormann commented Jul 20, 2024

Also @timbray: this is not integrated anywhere, yet. If you still require the "invalid UTF-8 bytes" property, the Numbits can be "compressed" after calling Normalize().
NaNs fill 1<<52 bits of space at both the highest and lowest bit patterns (0 upward and ^0 downward). By default, Numbits preserves those patterns and makes them restorable (e.g. for usage in scenarios where NaN-boxing is used). For something directed at JSON, that's not needed. So you could drop NaNs (and even infinity) and subtract 1<<52 from the Numbits. And add it back before you put it into NumbitsFromXYZ.
Integration will be a lot faster when you do it, but I wanted to describe the how and I wanted to add the format itself.
Ping me for questions :-)

Oh, also... I'm calling it "Numbits". The pattern is obvious enough (at least in hindsight), it probably existed before. Still, I did not know it and did not search a lot. I just hope the name is not too far off... if required, we can rename it later to align with the rest of the world.

@arnehormann arnehormann force-pushed the numbits branch 3 times, most recently from 877be83 to 6a81b27 Compare July 20, 2024 06:45
@arnehormann
Copy link
Collaborator Author

@timbray I made the required changes. Can you trigger the CI again? I didn't do that myself for GitHub actions yet, but this looks like the relevant docs: https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs

@arnehormann arnehormann force-pushed the numbits branch 2 times, most recently from 8d1175f to 87f4576 Compare July 20, 2024 13:22
@arnehormann arnehormann changed the title Add support for a new compact float64 representation api: add support for a new compact float64 representation Jul 20, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.64%. Comparing base (be1752d) to head (3251f29).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #336      +/-   ##
==========================================
+ Coverage   96.59%   96.64%   +0.04%     
==========================================
  Files          19       20       +1     
  Lines        1940     1967      +27     
==========================================
+ Hits         1874     1901      +27     
  Misses         37       37              
  Partials       29       29              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Closes: 334

Numbits can be constructed from float64 and can be
losslessly converted back to float64.
The current fast branchless conversion is possible due to
a nerd-snipe of @Merovius. He also threw it a godbolt and
gave it some scrutiny. Thanks, Axel!
@timbray
Copy link
Owner

timbray commented Jul 21, 2024

So what I think I should do is clone this branch and use that as a workspace to see if I can (a) integrate numbits with the finite-automaton code and (b) evaluate any performance differences. The Go benchmarking tool claims to track memory allocations, which is difficult; I wonder how accurate that is? I have not used that, will investigate. The result of replacing 14-byte with 8-byte quantities should be un-subtle. @arnehormann in your view is this now stable, or is there anything else you want to do before I grab it?

@arnehormann
Copy link
Collaborator Author

@timbray I am very very happy with this now and love the api. One thing that could be added is the proposal of the encoding Axel described, but that could also live in another file. And there might always be more tests - but the existing ones should cover pretty much everything including exotic edge cases.

And memory tracked by benchmarks is very usable and accurate.

It might help to wed number decoding very closely to this code. That can and should also happen externally, I chose an api that pretty much has no error cases and is a poster child of "make bad states not representable". But due to the no-error approach, it expects a float64 instead of a []byte of decimal utf8 number string currently. Or the binary version (post conversion) as [8]byte or string.

@timbray
Copy link
Owner

timbray commented Jul 21, 2024

OK. The bad news is that, while the API is nice, Quamina needs very little of it. My thinking now is that when flattenJSON encounters a number, we will still use strconv.ParseDouble(), which is quite optimized. Then I would use NumbitsFromFloat64().Bytes() to get 8 (non-UTF8) bytes and try to run the automaton on that. Does that make sense?

@arnehormann
Copy link
Collaborator Author

arnehormann commented Jul 21, 2024

That's exactly right. And if -0 is possible after parsing numbers, also Normalize(). And you have to store the Bytes() result in a variable before you can slice it with [:] (has to be addressable). If used in more places, there can also be a convenience func combining it. But it all inlines well and the assembly looks good. And I made it to be usable with more than JSON.
A major part of the reason is that I could vastly better test it with this API.

@timbray
Copy link
Owner

timbray commented Jul 21, 2024

It turns out that JSON allows -0 but Quamina patterns don't, because the Go JSON tokenizer doesn't. Is that a bug I think?

Hm, I guess we need a policy decision. If I make a pattern

{"x": [ -0 ]}

Should that match this event?

{"x": 0}

or vice versa?

The answer is not obvious to me. In Go (and most programming languages I think?) -0.0==0.0. But presumably if someone creates such a pattern, they care about the sign.

@arnehormann
Copy link
Collaborator Author

arnehormann commented Jul 22, 2024

Concerning comparison (which is of higher interest to the matcher I think), both should be equal - at least according to https://en.m.wikipedia.org/wiki/Signed_zero ... concerning representation, they are distinct. But think of your blog post that started me on this - 1e1 vs 10 vs 10.0 etc are the same number.

@timbray
Copy link
Owner

timbray commented Jul 22, 2024

OK, so my conclusion is that, since Go's built-in json tokenizer apparently doesn't support -0 and JSON itself doesn't support NaN, I never need to call Normalize()

@arnehormann
Copy link
Collaborator Author

sounds good. As long as it's for both paths, patterns and events.

@ah-quant
Copy link

ah-quant commented Aug 6, 2024

Hi Tim, I just checked - that assumption is flawed, Normalize is still required (well, at least the -0 handling part of it is): https://go.dev/play/p/q0SdRoPJUMp

Only the compilers automatically converts -0 to 0; the runtime behavior is different.

@timbray
Copy link
Owner

timbray commented Aug 6, 2024

(should be getting back to this next week)

We'd need a unit test to show the potential problem with -0, because at the moment I don't think one can get through either Go's JSON reader or Quamina's custom parser.

@timbray
Copy link
Owner

timbray commented Aug 9, 2024

OK, I have cloned Arne's branch and will replace qNumber with numbits and do some benchmarking.

  1. I like numbits and it would be good if numeric comparisons had real 64-bit accuracy
  2. Including numbits is going to complexify some central code paths, especially smallTable, which rely on an assumption that all the bytes are UTF-8 bytes and thus can't have values greater than 0xF4, allowing 0xF5 and 0xF6 to be used as sentinels and end-of-data markers. So, research goal 1: Determine how much complexity is added.
  3. Assuming that the numbits basically work and don't lead to horrible complexifying, I'm going to create a benchmark with patterns that include multiple numeric matches, so we understand what we win/lose. My prediction is that numbits should be noticeably faster and burn less memory simply because they are 8 bytes long and qNumbers are 14, but I'm too old to believe in my own predictions.
  4. The result of all this should show us the costs and benefits of numbits and allow us to make an evidence-based decision on whether to adopt.

@timbray
Copy link
Owner

timbray commented Aug 16, 2024

Now I'm leaning against using numbits. First of all, as I think about redesigning smallTable to survive in the presence of non-UTF8 characters, I keep thinking of corner cases - this type's step and especially dStep are Q's tight central loop, they are very tiny and have excellent memory locality, and rely on stupid UTF-8 tricks, in particular the fact that the byte values higher than 0xF4 can’t appear. I’m not 100% on this and will keep thinking.

But there's another problem which is tiny and unlikely to happen that I don't know how to fix.

Suppose I have the rule

{"x": ["foo"] }

which is intended to match only JSON data that has a top-level "x":"foo" member. However, with numbits, this will also match:

{"x":-77936130622565622861113617120445728215253349187385567291125040555718211817982934064273708603570865020116281354366693784516070188526811668807680}

Because its numbits representation is [34 102 111 111 34 245 122 122] which can also be thought of as ['"','f','o','o','"',0xF5,'z','z']

(Don’t ask me how I found that number, it’s too painful. And I’ve never seen a binary search iterate 600 times before.]
[Edit: The reason I did that is that I was convinced I had a number where NumbitsFromFloat64(x).Float64() did not round-trip. But now I can't reproduce.]

The reason why? Because we use 0xF5, which can never appear in UTF-8, as an end-of-data signal. We add it to the automata for all the matching values and also to each value when we match it. It simplifies a lot of code because you never have to check whether you're at the end of the input data. But that byte value can of course appear in a Numbits.

The annoying thing is that this will almost certainly never happen, the Numbits that could possibly match strings are only these very huge negative values, this one is approximately -7.793613e+142. But it would be very hard to explain if a user managed to encounter it.

My mind isn’t 100% made up. If I could work around the second problem here I'd be more motivated to go after the redesign-smallTable problem.

I will think some more.

@ah-quant
Copy link

This could be fixed with the strategy outlined by Axel in #334 (comment) and following.
A different encoding can make numbits utf8-safe.
By encoding numbits with base128, 10 bytes are used per numbit. base64 would use 11 bytes. If you need something from the standard lib and are averse to copying the encoding code from someplace else into quamina, you can also use #334 (comment) - probably slower than base128, but also in 10 bytes.

@timbray
Copy link
Owner

timbray commented Aug 16, 2024

OK, that's sensible, I'll have a look. Staying inside UTF-8 bounds is a massive simplification. So we'd get arbitrary precision and simultaneously reduce numeric fields from 14 bytes per numeric field to 10. Sorry, you mentioned #334 twice, did you mean to mention something else the second time?

@arnehormann
Copy link
Collaborator Author

No, I did not. Sorry. I initially considered linking his follow-up comment, but base128 is the best choice here and probably not overly difficult to write - or copy and adapt from Axels playground link.

@arnehormann
Copy link
Collaborator Author

Ugh. That was wrong. The second link should have been https://pkg.go.dev/encoding/[email protected] - but I'd still prefer base128.

@timbray
Copy link
Owner

timbray commented Aug 16, 2024

I agree. I just finished implementing and unit-testing Numbits.toUTF8(), stealing from Axel's code, and it seems flawless and fast. I think this is now going to be easy. Wow, that base128 text is ugly, but no human should ever see it.

@timbray
Copy link
Owner

timbray commented Aug 16, 2024

Sometimes Numbits.Float64() can return NaN

func TestBrokenFLoat(t *testing.T) {
	var nb Numbits = 18445811164593300620
	f := nb.Float64()
	if math.IsNaN(f) {
		t.Error("NaN")
	}
}

Is this a problem? This Numbits was created with Numbits(rand2.Uint64()) - maybe that creates values that couldn't be generated by NumbitsFromFloat() or NumbitsFromBytes()?

@arnehormann
Copy link
Collaborator Author

I don't think it's a problem. NaN and infinity are representable and can be created this way, but will never occur with valid json inputs and patterns.

@timbray
Copy link
Owner

timbray commented Aug 16, 2024

That's my belief too, so I won't let it worry me.

@timbray
Copy link
Owner

timbray commented Aug 18, 2024

FYI: I have finished up replacing Quamina's 14-hex-byte Qnumber construct with 10-byte numbits+base128. I still have a bunch of work to do on the README and docs before I can PR, but here are some numbers. I created a new benchmark, BenchmarkNumberMatching that does simple number mathcing, of which half succeed. The proportion that succeeds is crucial, because to match, the code has to go through the whole 10 bytes to verify all match. Failed matching is much cheaper.

Using the Go benchmarking, here are before and after results.
Before:

BenchmarkNumberMatching-12    	 1586095	       741.6 ns/op	    2810 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 1541496	       752.8 ns/op	    2810 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 1554729	       746.1 ns/op	    2810 B/op	       6 allocs/op

After:

BenchmarkNumberMatching-12    	 2164224	       537.1 ns/op	    1596 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 2162497	       537.5 ns/op	    1596 B/op	       6 allocs/op
BenchmarkNumberMatching-12    	 2144856	       548.9 ns/op	    1596 B/op	       6 allocs/op

So, that's pretty good!

@timbray
Copy link
Owner

timbray commented Aug 18, 2024

(Although TBH I don't like that each step is allocating this much memory, must investigate…)

@timbray
Copy link
Owner

timbray commented Aug 28, 2024

Thanks for this idea, which turns out to have been very fruitful.

@timbray timbray closed this Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants