ideas for more efficient arithmetic on 64 bit #20

cfbolz · 2015-05-21T15:39:47Z

@arigo was suggesting some improvements to get more efficient 32bit arithmetic on 64bit machines. The IRC logs of the discussion are here:

https://github.com/cornell-brg/pydgin/issues/new

The most immediately practical idea is to write overflow_from_add (and ..._sub) in a branchless way, and maybe special-case them on 64bit machines.

The text was updated successfully, but these errors were encountered:

berkinilbeyi · 2015-05-21T16:03:31Z

Thanks for the ideas @cfbolz and @arigo! The link doesn't seem to work, so here are the logs:

11:06 < berkin> an aside: supporting arithmetic ops on r_uint32 would actually be *very* useful for emulating 32-bit machines
11:06 < arigato> berkin: types like r_uint32 were probably a bad idea; we should have said that local variables must be full-sized integers, and have the smaller types only for storing into structures
11:06 < arigato> (or arrays)
11:07 < arigato> berkin: yes, and also for languages like Javascript whose efficient jitting requires 32-bit overflow checks
11:07 < arigato> but it's all unclear
11:07 < arigato> we could solve the latest problem by adding "ovfcheck32(a + b)" for example, which does the addition and overflow checking as 32-bit numbers
11:08 -!- amaury [amauryfa@nat/google/x-hjbtmjuzradhlheq] has quit [Ping timeout: 265 seconds]
11:08 < arigato> and keeping full 64-bit numbers as local variables
11:08 < arigato> simply ignoring the higher bits
11:08 -!- tifo [[email protected]] has quit [Ping timeout: 246 seconds]
11:08 < arigato> that solution would be less work in the JIT's backends
11:08 < berkin> arigato: i'm wondering if it's possible to do such tricks in the backend, where 32-bit ops on a 64-bit machine usually involves some operation and then masking back to 32-bit etc
11:09 < berkin> so use more efficient asm for such patterns
11:09 < arigato> maybe, but I think we can simply ignore the high bits most of the time
11:09 < arigato> and use the existing "int_signext" operation in the rare cases were we can't (e.g. before a right shift)
11:10 < arigato> or even just add "int_rshift32"
11:10 < arigato> yes, probably a better idea to add int_rshift32 because we'd still need "int_eq32" etc.
11:11 < cfbolz> interesting idea to just ignore the higher bits
11:11 -!- Remi_M [[email protected]] has quit [Quit: See you!]
11:12 < arigato> easier than sprinkling int_signext everywhere... and it works if we're careful in producing the jitcodes (e.g. we need int_signext when we cast a 32-bit int to 64-bit)
11:12 < arigato> maybe we could do the same with bools, which is a good idea at least on x86
11:12 -!- krono_ [~krono@unaffiliated/krono] has joined #pypy
11:13 < arigato> ...well, unclear if it's a good idea actually
11:13 < arigato> if we compare the simple operation "load a byte from memory" and the apparently more complex operation "load a byte and zero-extend it", then actually the latter might be more efficient
11:14 < arigato> because the former adds some dependency on the old content of the rest of the target register, or something
11:14 < cfbolz> arigato: I'm not sure in the concrete use case of the ARM emulator this aproach would work though
11:14 < cfbolz> e.g the memory should really be 32bit ints, for space reasons
11:15 -!- marky1991 [[email protected]] has joined #pypy
11:15 -!- marky1991 [[email protected]] has quit [Changing host]
11:15 -!- marky1991 [~lgfdev@unaffiliated/marky1991] has joined #pypy
11:15 < arigato> ah, all accesses by the emulator are done via an array and a zero-extended 32-bit offset
11:15 < cfbolz> yes
11:15 < arigato> but 32-bit overflow checks are signed
11:16 < arigato> bit of a mess
11:16 < cfbolz> also, implementing the ADD opcode is really not fun, because you need to set 32bit overflow flags etc
11:16 < cfbolz> and the carry, and whatnot
11:16 -!- krono [~krono@unaffiliated/krono] has quit [Ping timeout: 265 seconds]
11:17 -!- krono_ [~krono@unaffiliated/krono] has quit [Ping timeout: 265 seconds]
11:17 < arigato> right, but at least you hope that the next ADD opcode just afterwards will overwrite the flags and make the earlier computation not needed
11:17 < arigato> but you kind of have to write it all in a branch-less way
11:17 < cfbolz> true
11:18 -!- DragonSA [[email protected]] has joined #pypy
11:18 < cfbolz> interesting point!
11:18 < cfbolz> I think atm there is a branch
11:18 -!- DragonSA [[email protected]] has quit [Changing host]
11:18 -!- DragonSA [~DragonSA@freebsd/developer/dbn] has joined #pypy
11:18 -!- amaury [amauryfa@nat/google/x-jgtiaxqgkctebabk] has joined #pypy
11:19 < cfbolz> arigato: yes: https://github.com/cornell-brg/pydgin/blob/master/arm/utils.py#L184
11:19 < cfbolz> arigato: so the theory is that replacing the "and" with "|" might improve stuff
11:19 < arigato> "&"
11:19 < arigato> but there is "==" in RPython
11:20 < cfbolz> ?
11:20 < cfbolz> I thought we stopped making that use a guard?
11:20 < arigato> right, ok
11:20 < arigato> fwiw, a possibly more efficient way to write it now would be to shift all the values left 32 bits
11:21 < arigato> then the natural int_add_ovf applies
11:21 < arigato> (obscure)
11:21 < cfbolz> "all the values"?
11:22 < cfbolz> like right before you operate on them, but also store them like that in the register file?
11:22 < arigato> unclear
11:22 < cfbolz> :-)
11:23 -!- mstuchli [[email protected]] has quit [Ping timeout: 244 seconds]
11:23 < arigato> actually you can check overflow_from_add() more easily if you know you are running on 64-bit
11:23 < arigato> with code like "x != (x << 32) >> 32"
11:24 < arigato> or even better:
11:24 < arigato> x != intmask(r_int32(x))
11:25 < arigato> the last line JITs to one "int_signext" operation followed by "int_eq"
11:26 < cfbolz> arigato: overflow_from_sub is the same, I guess?
11:26 < arigato> yes
11:26 < arigato> you can even store "x - intmask(r_int32(x))" in the overflow "flag", which is interpreted later as either zero or non-zero
11:27 -!- DragonSA [~DragonSA@freebsd/developer/dbn] has quit [Ping timeout: 258 seconds]
11:28 < cfbolz> why is that better?
11:28 < cfbolz> int_eq vs int_sub
11:28 < arigato> int_eq is more costly in terms of assembler
11:28 < cfbolz> eh, ok
11:28 < arigato> at least if it's not immediately followed by the guard on it
11:29 < cfbolz> which it isn't, we hope
11:29 < arigato> ah, or it is
11:29 < arigato> unclear if it is
11:29 < arigato> I mean in common cases we'd expect the whole int_signext/int_eq to be killed
11:29 < arigato> but not if there is an unrelated guard
11:30 < arigato> in that case indeed a int_sub is better :-)
11:30 < cfbolz> I think we should look at traces, really
11:30 < arigato> yes
11:30 < cfbolz> which I don't have right now
11:30 < cfbolz> dmlockhart: should I make an issue out of these ideas?
11:31 < arigato> obviously, a good enough backend should be able to optimize "int_sub / int_nonzero / guard_true" the same as "int_ne / guard_true"...
11:31 < cfbolz> yes, well, at some point in the future we may need a better backend :-P
11:32 < arigato> or else it can be the optimizer's job to replace "int_sub / int_nonzero" with "int_ne"
11:32 < arigato> :-)
11:41  * arigato -> dinner
11:41 -!- arigato [[email protected]] has quit [Quit: Leaving]
11:47 -!- tinyblak_ [[email protected]] has quit [Remote host closed the connection]
11:52 -!- tinyblak [[email protected]] has joined #pypy
11:53 < berkin> cfbolz, arigato: thanks for the ideas
11:54 < berkin> cfbolz: does "branchless" mean converting "and" to "&" so that it doesn't have a guard?
11:54 < cfbolz> berkin: yes 
11:54 < cfbolz> Because and short circuits
11:55 -!- MrScout [~MrScout@unaffiliated/mrscout] has joined #pypy
11:56 < berkin> cfbolz: ah yeah, short circuiting is probably worse in jit because of the guards
11:56 < berkin> interesting
11:56 < cfbolz> Yes
11:56 < cfbolz> Bit non intuitive 
11:56 < cfbolz> But we should have noticed when looking at the traces 
11:57 < berkin> cfbolz: i'm sure it was there somewhere, this will probably improve the performance slightly
11:58 < cfbolz> berkin: maybe a bit more than slightly, because it means that the flags computation can be fully removed if the flag is never read 
12:00 < berkin> cfbolz: ah i see, when there is a guard, you can't optimize the flag computation away
12:00 -!- tinyblak [[email protected]] has quit [Remote host closed the connection]
12:00 < cfbolz> Yes

arigo · 2015-05-21T21:15:07Z

If I'm reading the code correctly, I think it assumes anyway that the host is 64-bit. With a 32-bit host the existing overflow detection doesn't work. In that case, the most efficient way is to use "x == intmask(r_int32(x))"... well, or at least if r_int32 existed in rarithmetic. Add it there with the line:

r_int32 = build_int('r_int32', True, 32)

as suggested in issue #20

berkinilbeyi added a commit that referenced this issue May 22, 2015

[arm] using bitwise "and" and "not" to avoid guards due to flags

94e8432

as suggested in issue #20

berkinilbeyi mentioned this issue May 22, 2015

Crash in 94e8432ae1b4fdddb1ba99dbbd5c8eaf740e6802 #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ideas for more efficient arithmetic on 64 bit #20

ideas for more efficient arithmetic on 64 bit #20

cfbolz commented May 21, 2015

berkinilbeyi commented May 21, 2015

arigo commented May 21, 2015

ideas for more efficient arithmetic on 64 bit #20

ideas for more efficient arithmetic on 64 bit #20

Comments

cfbolz commented May 21, 2015

berkinilbeyi commented May 21, 2015

arigo commented May 21, 2015