-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: enhancements to the gc inliner and optimizer #9337
Comments
gc does recognize constant rotates. It's not done by the inliner, Fortunately, most modern crypto/hashing algorithms use constant I agree that inlining of function closure that doesn't escape will |
@minux Thanks for the info about the rotates, much appreciated. I think the comment in walk.c says what I was trying to say. Still I think it bears repeating that I saw it a comment somewhere that GCC recognizes this I guess I can get what I want today for the ROT instruction if I am willing to flush the function call to So Could the compiler do the &31 or &63 for the shift/rotate amounts for you? |
I expect the deficiencies in the current gc compiler will go away when |
One of my points was a lot could be gained from the current compiler by increasing I would like to see this added to 1.5. It makes sense (to me) to do it at the same time as the rewrite. |
We want the rewrite to be purely mechanical. Let's revisit this after the rewrite is complete. |
FYI, you can use -l=5 instead of repeating -l 5 times. |
cc @randall77 |
How much do I have to pay to get just a switch set In return I'd have a much easier time getting speedups in Go for some hash functions. My kingdom for a switch. |
Command-line flags are tricky. The problem is "go get". How do you tell it what flags you want? What if two packages you depend on want different flags? Incompatible flags? The ecosystem is much nicer if there is just one compiler for everything. The current compiler does have some flags (like -l), but they are for debugging the compiler only, not for any production use. We can certainly revisit the built-in settings for hariness and level. |
As I said above I understand the desire to eschew all flags. On the other hand optimizers often have bugs and I have found the ability to have some control over them vital to my productivity over the years. e.g. I would have been blocked but I could proceed with the optimizer turned off. You point about go get and packages it well taken and true. My opinion is you will have a very hard time finding universal constants for hariness and level and people like me will be frustrated with your choice because you end up saying, "geez if hariness was just 1o more this would inline" and ditto for level. Or there is some onscure case that's broken for level n and the expediant fix is to decrement level by one until some other issue is fixed. Here is a proposal. Could source files contain a comment ala build tags that describes what kind of optimizations and other flags are needed? The benefits are probably obvious but I'll state them to be complete.
"Comment control" was just added for "go generate" so there is a precendent. |
No, thanks. Let's not introduce more magic comments. //go:generate is not for a precedent for this, because the |
|
|
|
Turning off optimization is easy, -gcflags "-N -l" will do. Extreme inlining will bloat the binary, we need to do more investigation to Also, just because a function could be inlined, that doesn't mean we should PS: any non-leaf function will likely remain to be not inlined, otherwise |
I'll summarize the issues and requests below and sign off on this one. I apologize if this issue wandered around a bit. Everything here is related to the inliner and optimization.
Must upsetting to me is the perception that I am complaining. I think frustrated would be better word. With a little effort the Go inliner could be much more effective. It's low hanging fruit.
|
That's work for debug info. A single PC can be associated with a whole stack of inlined functions, and DWARF supports this. That will improve current situation as well, because currently we miss frames in CPU profiles and during nil derefs in inlined functions. |
On Sun, Jan 11, 2015 at 3:50 AM, Dmitry Vyukov [email protected]
If we support showing backtraces for inlined functions, ideally I still |
It's easy to do it our own traceback format. Yes, arguments of inlined functions will be missing. |
Sorry I was gone for awhile. Thanks to @rsc for keeping this one open. A few comments:
What can I do to help? |
Correct backtraces pretty much is a prerequisite, because things like the standard library's log package rely on being able to count the number of frames they get from runtime.Callers. These can't be casually broken. We might be able to use some different approach, but that would have to be invented, proposed, planned, and implemented. Or, we could implement inlining such that backtraces work. |
IDK. I always thought that inlining and a stack frame were antithetical. From my perspective if function A is inlined into B then function A ceases to exist as a function. Function A becomes more like a textual hygienic macro that was expanded. Do you agree with that? The whole idea of inlining is twofold:
I guess the question is what's the overhead for maintaining a backtrace without a stack frame or will an inlined function require a stack frame just for debugging? If the latter, I will be bummed out. I have one idea that comes to mind. Suppose functions declared inside other functions could be inlined without a backtrace entry? Obviously for this to work closures would need to be eligible for inlining. Perhaps this would offer the right amount of flexibility and control without adding any kind of syntax adornments to the language. Top level functions always get a valid stack trace but local functions don't have the guarantee. That would work for me. Comments? |
I ran into this issue while trying to improve the speed of bcrypt. Running something like the following benchmarks (https://bitbucket.org/snippets/SamWhited/aKK6k): package main
import (
"sync"
"testing"
"golang.org/x/crypto/bcrypt"
)
var pass = []byte(`superfly`)
const cost = 11
func BenchmarkBcrypt(b *testing.B) {
for n := 0; n < b.N; n++ {
crypt, _ := bcrypt.GenerateFromPassword(pass, cost)
_ = crypt
}
}
// One should probably generally limit CPU bound work to a handful of goroutines/threads,
// but for the purpose of example, do it the naive way and just emulate someone who has
// run their bcrypts directly in the requests without gating the max number that can run at once…
func runParallelBcrypts(num int) {
var wg sync.WaitGroup
wg.Add(num)
for i := 0; i < num; i++ {
go func() {
crypt, _ := bcrypt.GenerateFromPassword(pass, cost)
_ = crypt
wg.Done()
}()
}
wg.Wait()
}
func BenchmarkBcrypt10Parallel(b *testing.B) {
for n := 0; n < b.N; n++ {
runParallelBcrypts(10)
}
}
func BenchmarkBcrypt100Parallel(b *testing.B) {
for n := 0; n < b.N; n++ {
runParallelBcrypts(100)
}
} resulted in a small, but noticeable (especially at higher levels of concurrency) improvement in performing lots of bcrypts (the improvement mostly appears to come from the blowfish Numbers and graph: https://docs.google.com/spreadsheets/d/1aUpR62qS7CJkPEO2CzNwY4R0MHFxM1WbdBJZAeRmYK8/edit?usp=sharing the only change between benchmarks here was setting maxBudget to 1000 (which is enough to inline the EDIT: Slightly different format for the report I put together that is self contained and doesn't fall down hard on low bandwidth connections: http://bl.ocks.org/SamWhited/ebe4f5923526c0d9220f1b5b23b56eba |
Change https://golang.org/cl/72490 mentions this issue: |
|
We also now do non-constant rotates. You need a &31 or &63 in the right place, so in practice you should use |
Background
I recently tried optimizing some hash functions in Go to find out what is possible without resorting to using assembler. I knew gc had an "inliner" but I didn't have any experience with it. I spent some time reading the "inl.c" code.
Summary
Go's inliner seems to work well and I got speed increase of 30% on some hash functions using it. However, the way it is currently configured and controlled effectively turns if off for most cases. Most users won't figure out how to set the
level
(-gcflags="-l -l -l -l -l -l -m") and even less will be willing to recompile the tool chain to change the hard codedhairiness
variable set to more than 40. The inliner would be more effective if it could handle closures and variadic functions. Finally, a simple idiom recognizer would allow a simple function such asrot()
to be turned into a single instruction. Taken together these changes could yield significant performance improvements to certain kinds of Go packages.Here are my impressions, suggestions, and requests:
A simple code example is here, it's just one one Jenkin's Hash functions:
https://play.golang.org/p/rmDG7MR5F9
This comes from:
https://github.com/tildeleb/hashland/blob/master/jenkins/jenkins.go
The main project is:
https://github.com/tildeleb/hashland/
Observations and Impressions
go build -gcflags="-l -l -l -l -l -l -m"
hairiness
. It is current set to 40 and that's not very hairy. In the example above. I had to breakmix()
andfinal()
into 3 functions of two lines each, to get them to inline. 40 really seems like a very small budget to me. Hard coding a parameter like this is just wrong.mix()
andfinal()
which close over the hash state variables a, b, and c need to moved out of the function "HashWords". Since the state variables a, b, and c are lo longer closed over they need to be passed to and returned from themix()
andfinal() functions
. For example:a, b, c = mix(a, b, c)
. If all hash functions had 3 state variables that might be OK.However, other more advanced hash functions have more state. For example in Jenkin's SpookyV2 I had to write:
h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11 = Mix(inul, h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11)
Which I hope we can all agree is out of control and is why I believe why closures are the preferred solution to many inlining based use cases.
The downside of the closures is that if you have multiple versions of a hash function that use the same mix() and final() code, those closures need to be duplicated inside the top level functions.
Again, I prefer the closures. The source code is cleaner, easier to read, and probably easier to optimize because the optimizer doesn't need to optimize out the calling and return stack push/pop sequences to get good code quality.
Go needs an idiom recognizer. Consider the example again. The function
rot(x, k)
gets inlined but it doesn't generate a rot insturction. Other compilers are able to do this. Actually, I think a case can be made to add a rotate (no carry) operator (>>>
or<<<
) to the language for this particular case, but that's another "issue" as they say. BTW Go generates about 22 instructions to get the rotate done.Variadic functions aren't inlined. I am considering proposing the following as a new interface for hash functions that would be complementary to the current hash streaming interface.
Hash32(b []byte, seeds ...uint32) uint32
Hash64(b []byte, seeds ...uint64) uint64
Hash128(b []byte, seeds ...uint64) (uint64, uint64)
Hash(in []byte, out []byte, seeds ...uint64) []byte
Because of this I would like to see inlining of variadic functions.
FWIW I still have to benchmark this vs other alternative calling sequences.
Suggestions
In the short term add temporary native command line flags to set
hairiness
andlevel.
I know the Go team eschews command line flags but some control over the optimization process has been around since the unix epoch and is probably inevitable.For example:
% go build -ilbudget=120 illevel=6
rot
in the example above can be turned into single instructions, really (2-3 instructions counting theMOVQ
to the registers instead of the 20+ now generated to get a rot.The text was updated successfully, but these errors were encountered: