-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
huff0: translate asm implementation into avo program #543
Conversation
Some notes:
The "Compiling" version: https://gist.github.com/klauspost/8f8dbbd9745662464dfac37d00cbd5f6 |
418c6e8
to
45efab9
Compare
There are significant regressions. The asm code cached more values in GPRs.
|
@WojciechMula There must be something else. There is no way reading/writing L1 cached values will slow down that much, maybe a percent or two. It will be a couple of days before I can look at this. |
I managed to fix obvious mistakes, but still, there are regressions. I will investigate it further, just dumping the current state.
|
Hint: Disabling BMI2 brings back most of the performance. 8 bit version also appears worse. |
It seems like BMI just isn't a gain here. It seems we have enough regs to move this back out of the main loop:
Also 8 bit variant is a tiny bit slower here. Does it improve things on your side? |
If you want a good speedup, decode directly to Technically you don't even need to return between loops with that. |
I need to investigate it, there might be some strange code generated.
True, but we had to introduce these dereferences due to reg. allocation failure.
Will check it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you still having problems with the register allocator?
Constraint(buildtags.Not("appengine").ToConstraint()) | ||
Constraint(buildtags.Not("noasm").ToConstraint()) | ||
Constraint(buildtags.Term("gc").ToConstraint()) | ||
Constraint(buildtags.Not("noasm").ToConstraint()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ConstraintExpr
and just provide a string (in the old +build
syntax).
huff0/_generate/gen.go
Outdated
peekBits := GP64() | ||
buffer := GP64() | ||
table := GP64() | ||
|
||
Comment("Preload values") | ||
{ | ||
Load(Param("peekBits"), peekBits) | ||
Load(Param("buf"), buffer) | ||
Load(Param("tbl"), table) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Load
returns the register. You can do it like this:
peekBits := Load(Param("peekBits"), GP64())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw this in the examples. For me having separate allocation and the use is a bit cleaner.
First of all, thank you for such a great tool! I haven't worked on this PR recently, I'm getting back to it this week. |
Let me know. As @klauspost alluded to sometimes it's just a matter of writing the code to limit the number of live variables. However, there's a chance there are bugs/inefficiencies in |
4d03601
to
1b8091b
Compare
So my today's findings are quite strange. BMI2 functions are significantly slower, despite the fact I managed to keep all the values in registers as in the master's version now. I spent some time comparing the generated assembly with the current version and couldn't notice any differences (other than different registers used). I'm continuing tomorrow. @mmcloughlin Would you please explain how I can convert an arbitrary pointer stored in a reg into a |
Sorry, can you elaborate on what you're trying to do? |
Oh, sorry for not being precise. When we have a pointer to a struct as a parameter, then reading fields is obvious:
But we read the pointer directly:
And want to interpret that bare pointer as structure, to use
|
@WojciechMula Do you mean as we did in zstd: https://github.com/klauspost/compress/blob/master/zstd/_generate/gen.go#L165-L166
|
@klauspost No, something else. OK, this is an actual snippet from this branch:
The |
Ah, ok. Don't know how to do that.
I find the same, so shouldn't enable them. Let's analyze with numbers from https://github.com/InstLatx64/InstLatx64
BMI: X86: MOVs are just register renames, So these can be expected to perform the same.
Should be the same again, though BMI should doesn't have to wait for CX if it can't be executed ahead-of-time (which I expect it would). So I don't really see this being the cause. Either way, I trust the benchmarks in this. Let's keep BMI disabled. |
I'm comparing the old assembly with avo generated one, instruction by instruction. For the non-BMI version the assembly is almost identical with a bit different encoding in a few cases due to using different registers. |
@WojciechMula The "old" is not using BMI unless you set the v3 env var for Go 1.18 |
I didn't figure out the reasons for regressions. The assembly code generated by |
I see at most a 1-3% regression. Nothing to really worry about. |
Instead if you can eliminate the memcopies that would make a much bigger diff:
Instead of decoding to You could also look into branchless filling similar to #550 - I can't remember if I already looked at this. It is not that I don't care about 2%, but I think there are bigger fish to catch. |
Hm, on my Ice Lake machines there are 15-20% regressions in a few cases. But if you are happy with the current shape, we may merge this PR. Then I'll eliminate mem copying as you suggested. |
Running this branch, I get these numbers on
|
Just tried removing the "peekBits" variable shift and creating a version that has 9,10 and 11 bits of fixed peek. That was 200MB/s worse than the variable shift for relevant benchmarks. THAT is a surprise. |
It's weird. I'm starting to suspect that there's something odd in the tests. |
You are welcome to check, but I am pretty sure it holds up. It just shows you can never trust intuition, and what "makes sense" to be true, and always benchmark every small change. (see edit) Simplifying and comparing these 3:
The first is by far the fastest:
The existing code gave EDIT: Actually it seems I should have picked this up. Looking at Zen 2 timings:
There are more shifting pipelines with variable shifts - see how throughput is a bit higher. It may be able to do 2 fixed shifts/cycle or 3 variable shifts per cycle, indicating different pipelines are used. It could also be the pipelines with fixed shifts are already doing work. EDIT 2: Seems like Intel (Tiger Lake here) has the opposite:
Fixed shifts 2x throughput than variable... |
@WojciechMula I factored out the bit filling code and made versions for 9,10 and 11 bits peek. https://gist.github.com/klauspost/617e149f31f8967bc184f5a48c3834f4 This is the same speed, but mainly for future extensions. I would like to unconditionally fill 56 bits, so we have enough for 4x11 but I haven't gotten it to work yet. There should also be less register use. |
@klauspost Great, and thank you for the answers. Yeah, I keep forgetting that intuition too often does not match reality. :) Let me recheck the code on Ice Lake once more -- I'm giving myself 2-3 hours. If I don't find any obvious mistake, I propose to merge this code. And then I will pick #576. It may give a significant boost, as you wrote. |
This PR got replaced by #577. |
Currently register allocation fails for some reason. Fixes #529