-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make greater use of MADs #34
Comments
I need to add that I'm aware of the discussion on #16 and the corresponding explanation in a Medium post (need to rediscover and re-read it). So yes, the 32-bit integer MULs currently performed may be multiple underlying MUL instructions on some GPUs, meaning that at instruction level the current MUL/s throughput might be better than what I estimated from the hashrate. However, with my proposed much greater focus on MADs in general I think we also need to revisit this and once again try to switch to MAD or MUL operation(s) that has/have full throughput (one per cycle). Latency doesn't matter much. |
Regarding floating-point, I don't buy the argument about NaNs given in https://medium.com/@ifdefelse/the-cost-of-asic-design-a44f9a065b72 - it would apply for arbitrary mix of operations including floating-point ones, but we can instead limit ourselves to FP32 MADs and we can occasionally (such as once per a full sequence of tens of Math/Merge) limit the range (the exponents) just enough for special values to never occur. I think use of FP is worthy of research and experimentation, and is wrong to dismiss just yet. |
As I said in #16 I'm pretty sure such an operation doesn't exist across AMD and Nvidia. We easily target Turing's single-cycle IMAD instruction, but that would hurt Polaris GPUs which are already compute limited. If you have a suggested change that includes perf numbers we'd be happy to consider it.
The random values in the cache will naturally include some NaNs. This means that either the floating point instructions sequence can't interact with the cache loads, significantly limiting the point of the cache loads, or every cache load will need to be checked if it's a NaN and converted. Even if a range check is done only once per loop it needs to be done per register - that's a lot of additional instructions. An ASIC could trivially implement the clamp as an input modifier to their FP ALU, which would be significantly higher performance than the equivalent GPU sequence. |
Thank you for the comments, @ifdefelse. I think we're on the same page. Yes, it's tough to target multiple GPU architectures with integer MADs - they might very well need to be of different size for maximum throughput. To me, that's one more reason to consider FP32. The published FP32 GFLOPS figures for modern GPUs are universally the number of 32-bit "cores" times 2 times the clock rate, suggesting these instructions have a throughput of one per cycle across all modern GPUs. And IEEE gives us more similarity between different GPUs than we have for integer multiplication. Yes, I also already arrived at the "cache issue" - if we want to stay within valid and stable floating-point, then can't efficiently use content of the existing cache in those computations. I have two ideas on how to tackle this issue:
I think I like the second idea much better. (And we can then approach possibly re-adding some INT32 processing to the mix as a separate task.) Yes, I also realized that "one per loop" is actually costly since we use many registers. We'd need to look into a way to make this "range check" less frequent. One way to achieve that might be through tracking worst-case minimum and maximum exponent per register during our code generation, and adjusting the current set of random numbers until the generated instruction is such that it'd fit the constraints on never overflowing until our planned next "range check" (a do-while loop per instruction generated). The runtime "range check" should not actually be a check - not in the form of an "if this do that" - but rather some kind of masking (easy at low-level, not so great at CUDA and OpenCL level where we'd have to use a I see no show-stoppers there - it's just significant effort to design this well, then test, tweak, then rinse and repeat. |
Another issue with a chain of FP operations is that FADD with random inputs will normally do nothing. For an FP32 operation the exponent is 8 bits so can have 256 values. When calculating FADD(A, B) if the exponent of B is more than +/-24 from the exponent of A then the smaller value gets squashed to 0 and the result is just MAX(A, B). Actual addition will only happen 48/256 or <20% of the time. We handle this in ProgPoW by structuring the merge() functions to produce good, random values in registers even if the input to the merge has low entropy (is all 1s, all 0s, or just a pass-through of one input). Creating an equivalent merge() function in FP I'm not sure is possible. A sequence of FP ops can also easily accumulate a bias, causing certain elements of the DAG to be accessed more often than others. |
@ifdefelse All good points, thanks! I think we're effectively starting to compile a list of constraints or a checklist for a good design using FP, which will be far from trivial to come up with, yet maybe practical. |
Likewise unconstrained random FMUL(A, B) has a ~25% chance of over/under flowing. If exp_a is 127 then any exp >= 1 will cause overflow - 127/256 I might be off-by-1 in this, but you get the idea. The average comes out to about 25% of random FMULs will over/underflow. This is why you can't clamp once per loop. If you clamp exponents to [63, -63] it just takes 2 sequential MULs to get over/underflow. You could do something like clamp the exponent to an extremely short range, like [7, -7]. However that would drastically simplify a hardware floating point unit, providing a significant ASIC optimization opportunity. |
Right. So should have few sequential MULs (a code generator constraint) and/or clamp to short range of initial exponent, which I think isn't that bad if our alternative is building upon at most 24-bit integer MULs of same throughput. A more advanced code generator tracking worst cases might end up e.g. interleaving MULs by registers that it knows were initialized with positive vs. negative exponents, to allow for slightly longer sequences. Tricky stuff indeed. |
So I experimented with FP in a ProgPoW hack a little bit. I am currently leaning towards the following design aspects if we actually switch to FP:
I'd appreciate comments. |
JFYI: Staying with integers but maximizing the ratio of arbitrary 32-bit MULs, I am able to get up to ~630G MUL/s on the Vega 64 while also doing the cache lookups, having merge() maintain entropy of its destination, and maintaining adequate memory bandwidth usage (slightly lower than default ProgPoW's, though). This is 7x+ higher than ProgPoW 0.9.2's defaults, but 9x+ lower than the theoretical maximum throughput for FP32 MULs. |
In #26, I obtained/confirmed some benchmark results for ProgPoW 0.9.2, including e.g. 22.7M at block 7M on Vega 64 actually running at 1401 MHz when running that benchmark.
Simulating this version with c-progpow, I get 36864 Merge and 20480 Math operations per hash computed. This gives
22.7M*(20480+36864) = ~1302G
of those operations per second on Vega 64. While this is a sane number considering that many of those operations correspond to multiple instructions, let's recall that a Vega 64 at 1401 MHz is capable of computing4096*1401 = ~5738G
operations per cycle (or virtually ~11477G FP32 FLOPS due to how those are traditionally counted with MUL and ADD halves of a MAD as separate FLOPS).Besides register file and local memory lookups, MULs or MADs are pretty much the only other operations we have on GPUs that consume relatively non-trivial resources in custom hardware. As once again confirmed in a recent comment in #24, all other operations of ProgPoW add relatively little.
I think ProgPoW would be greatly improved by cutting down on its limited programmability and instead making greater use of MADs. Even if we count a MAD as just one operation (unlike it's historically done for FLOPS), we have up to a 4.4x potential improvement here (as ~1302G to ~5738G).
Further, Math only performs MULs 2/11 of the time, and Merge only performs specialized multiplication by 33 (equivalent to shifting the number left by 5 bits and adding that to itself). With this, ProgPoW currently performs only
22.7M*20480*2/11 = ~84.5G
arbitrary MULs per second. The potential improvement in terms of arbitrary MUL/s is thus up to 68x (~84.5G to ~5738G).This is much more important than the limited programmability that ProgPoW now has. That said, some programmability can be retained (thus justifying the name) - it will remain in routing of different registers as inputs and outputs of each MAD. Such routing, and not the choice of instruction (which is just a MUX or such), is the relatively heavier component.
I recognize that we might not be able to run a lengthy stream consisting exclusively of MADs without many intermediate values degrading to zeroes, etc. too often (although frequent register file and local memory lookups should help here - we have lots of internal state, so it won't degrade fully too quickly). So the actual improvement we can achieve will be somewhat less than those theoretical maximums I listed. But we can probably get close to them, certainly much closer than we're now.
We might also want to revisit use of FP32 MADs in addition to integer ones. (IEEE might provide us sufficient guarantees.) This should double the throughput on NVIDIA Turing and Volta, but then we'd need to decide whether we'd make use of that throughput (maybe not fully?) not to slow down other GPUs too much. On NVIDIA's diagrams, the FP32 units are shown as occupying 2x the area of the INT32 ones; if that corresponds to actual die area, that's yet another reason to use them.
Edit: changed "M" (was an error) to "G" in op/s figures.
The text was updated successfully, but these errors were encountered: