Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ideas for more efficient arithmetic on 64 bit #20

Open
cfbolz opened this issue May 21, 2015 · 2 comments
Open

ideas for more efficient arithmetic on 64 bit #20

cfbolz opened this issue May 21, 2015 · 2 comments

Comments

@cfbolz
Copy link
Contributor

cfbolz commented May 21, 2015

@arigo was suggesting some improvements to get more efficient 32bit arithmetic on 64bit machines. The IRC logs of the discussion are here:

https://github.com/cornell-brg/pydgin/issues/new

The most immediately practical idea is to write overflow_from_add (and ..._sub) in a branchless way, and maybe special-case them on 64bit machines.

@berkinilbeyi
Copy link
Contributor

Thanks for the ideas @cfbolz and @arigo! The link doesn't seem to work, so here are the logs:

11:06 < berkin> an aside: supporting arithmetic ops on r_uint32 would actually be *very* useful for emulating 32-bit machines
11:06 < arigato> berkin: types like r_uint32 were probably a bad idea; we should have said that local variables must be full-sized integers, and have the smaller types only for storing into structures
11:06 < arigato> (or arrays)
11:07 < arigato> berkin: yes, and also for languages like Javascript whose efficient jitting requires 32-bit overflow checks
11:07 < arigato> but it's all unclear
11:07 < arigato> we could solve the latest problem by adding "ovfcheck32(a + b)" for example, which does the addition and overflow checking as 32-bit numbers
11:08 -!- amaury [amauryfa@nat/google/x-hjbtmjuzradhlheq] has quit [Ping timeout: 265 seconds]
11:08 < arigato> and keeping full 64-bit numbers as local variables
11:08 < arigato> simply ignoring the higher bits
11:08 -!- tifo [[email protected]] has quit [Ping timeout: 246 seconds]
11:08 < arigato> that solution would be less work in the JIT's backends
11:08 < berkin> arigato: i'm wondering if it's possible to do such tricks in the backend, where 32-bit ops on a 64-bit machine usually involves some operation and then masking back to 32-bit etc
11:09 < berkin> so use more efficient asm for such patterns
11:09 < arigato> maybe, but I think we can simply ignore the high bits most of the time
11:09 < arigato> and use the existing "int_signext" operation in the rare cases were we can't (e.g. before a right shift)
11:10 < arigato> or even just add "int_rshift32"
11:10 < arigato> yes, probably a better idea to add int_rshift32 because we'd still need "int_eq32" etc.
11:11 < cfbolz> interesting idea to just ignore the higher bits
11:11 -!- Remi_M [[email protected]] has quit [Quit: See you!]
11:12 < arigato> easier than sprinkling int_signext everywhere... and it works if we're careful in producing the jitcodes (e.g. we need int_signext when we cast a 32-bit int to 64-bit)
11:12 < arigato> maybe we could do the same with bools, which is a good idea at least on x86
11:12 -!- krono_ [~krono@unaffiliated/krono] has joined #pypy
11:13 < arigato> ...well, unclear if it's a good idea actually
11:13 < arigato> if we compare the simple operation "load a byte from memory" and the apparently more complex operation "load a byte and zero-extend it", then actually the latter might be more efficient
11:14 < arigato> because the former adds some dependency on the old content of the rest of the target register, or something
11:14 < cfbolz> arigato: I'm not sure in the concrete use case of the ARM emulator this aproach would work though
11:14 < cfbolz> e.g the memory should really be 32bit ints, for space reasons
11:15 -!- marky1991 [[email protected]] has joined #pypy
11:15 -!- marky1991 [[email protected]] has quit [Changing host]
11:15 -!- marky1991 [~lgfdev@unaffiliated/marky1991] has joined #pypy
11:15 < arigato> ah, all accesses by the emulator are done via an array and a zero-extended 32-bit offset
11:15 < cfbolz> yes
11:15 < arigato> but 32-bit overflow checks are signed
11:16 < arigato> bit of a mess
11:16 < cfbolz> also, implementing the ADD opcode is really not fun, because you need to set 32bit overflow flags etc
11:16 < cfbolz> and the carry, and whatnot
11:16 -!- krono [~krono@unaffiliated/krono] has quit [Ping timeout: 265 seconds]
11:17 -!- krono_ [~krono@unaffiliated/krono] has quit [Ping timeout: 265 seconds]
11:17 < arigato> right, but at least you hope that the next ADD opcode just afterwards will overwrite the flags and make the earlier computation not needed
11:17 < arigato> but you kind of have to write it all in a branch-less way
11:17 < cfbolz> true
11:18 -!- DragonSA [[email protected]] has joined #pypy
11:18 < cfbolz> interesting point!
11:18 < cfbolz> I think atm there is a branch
11:18 -!- DragonSA [[email protected]] has quit [Changing host]
11:18 -!- DragonSA [~DragonSA@freebsd/developer/dbn] has joined #pypy
11:18 -!- amaury [amauryfa@nat/google/x-jgtiaxqgkctebabk] has joined #pypy
11:19 < cfbolz> arigato: yes: https://github.com/cornell-brg/pydgin/blob/master/arm/utils.py#L184
11:19 < cfbolz> arigato: so the theory is that replacing the "and" with "|" might improve stuff
11:19 < arigato> "&"
11:19 < arigato> but there is "==" in RPython
11:20 < cfbolz> ?
11:20 < cfbolz> I thought we stopped making that use a guard?
11:20 < arigato> right, ok
11:20 < arigato> fwiw, a possibly more efficient way to write it now would be to shift all the values left 32 bits
11:21 < arigato> then the natural int_add_ovf applies
11:21 < arigato> (obscure)
11:21 < cfbolz> "all the values"?
11:22 < cfbolz> like right before you operate on them, but also store them like that in the register file?
11:22 < arigato> unclear
11:22 < cfbolz> :-)
11:23 -!- mstuchli [[email protected]] has quit [Ping timeout: 244 seconds]
11:23 < arigato> actually you can check overflow_from_add() more easily if you know you are running on 64-bit
11:23 < arigato> with code like "x != (x << 32) >> 32"
11:24 < arigato> or even better:
11:24 < arigato> x != intmask(r_int32(x))
11:25 < arigato> the last line JITs to one "int_signext" operation followed by "int_eq"
11:26 < cfbolz> arigato: overflow_from_sub is the same, I guess?
11:26 < arigato> yes
11:26 < arigato> you can even store "x - intmask(r_int32(x))" in the overflow "flag", which is interpreted later as either zero or non-zero
11:27 -!- DragonSA [~DragonSA@freebsd/developer/dbn] has quit [Ping timeout: 258 seconds]
11:28 < cfbolz> why is that better?
11:28 < cfbolz> int_eq vs int_sub
11:28 < arigato> int_eq is more costly in terms of assembler
11:28 < cfbolz> eh, ok
11:28 < arigato> at least if it's not immediately followed by the guard on it
11:29 < cfbolz> which it isn't, we hope
11:29 < arigato> ah, or it is
11:29 < arigato> unclear if it is
11:29 < arigato> I mean in common cases we'd expect the whole int_signext/int_eq to be killed
11:29 < arigato> but not if there is an unrelated guard
11:30 < arigato> in that case indeed a int_sub is better :-)
11:30 < cfbolz> I think we should look at traces, really
11:30 < arigato> yes
11:30 < cfbolz> which I don't have right now
11:30 < cfbolz> dmlockhart: should I make an issue out of these ideas?
11:31 < arigato> obviously, a good enough backend should be able to optimize "int_sub / int_nonzero / guard_true" the same as "int_ne / guard_true"...
11:31 < cfbolz> yes, well, at some point in the future we may need a better backend :-P
11:32 < arigato> or else it can be the optimizer's job to replace "int_sub / int_nonzero" with "int_ne"
11:32 < arigato> :-)
11:41  * arigato -> dinner
11:41 -!- arigato [[email protected]] has quit [Quit: Leaving]
11:47 -!- tinyblak_ [[email protected]] has quit [Remote host closed the connection]
11:52 -!- tinyblak [[email protected]] has joined #pypy
11:53 < berkin> cfbolz, arigato: thanks for the ideas
11:54 < berkin> cfbolz: does "branchless" mean converting "and" to "&" so that it doesn't have a guard?
11:54 < cfbolz> berkin: yes 
11:54 < cfbolz> Because and short circuits
11:55 -!- MrScout [~MrScout@unaffiliated/mrscout] has joined #pypy
11:56 < berkin> cfbolz: ah yeah, short circuiting is probably worse in jit because of the guards
11:56 < berkin> interesting
11:56 < cfbolz> Yes
11:56 < cfbolz> Bit non intuitive 
11:56 < cfbolz> But we should have noticed when looking at the traces 
11:57 < berkin> cfbolz: i'm sure it was there somewhere, this will probably improve the performance slightly
11:58 < cfbolz> berkin: maybe a bit more than slightly, because it means that the flags computation can be fully removed if the flag is never read 
12:00 < berkin> cfbolz: ah i see, when there is a guard, you can't optimize the flag computation away
12:00 -!- tinyblak [[email protected]] has quit [Remote host closed the connection]
12:00 < cfbolz> Yes 

@arigo
Copy link

arigo commented May 21, 2015

If I'm reading the code correctly, I think it assumes anyway that the host is 64-bit. With a 32-bit host the existing overflow detection doesn't work. In that case, the most efficient way is to use "x == intmask(r_int32(x))"... well, or at least if r_int32 existed in rarithmetic. Add it there with the line:

r_int32 = build_int('r_int32', True, 32)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants