-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rough draft of "good parallel computer" post #104
base: main
Are you sure you want to change the base?
Conversation
This is a *very* rough draft and will be substantially rewritten. But I'm sharing it in this form because I'm not sure when I'll finish it, and there might be some useful discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope this helps!
Much of my research over the past few years has been 2D vector graphics rendering on GPUs. That work goes well, but I am running into the limitations of GPU hardware and programming interfaces, and am starting to see hints that a much better parallel computer may be possible. At the same time, I see some challenges regarding actually getting there. This essay will explore both in depth. | ||
|
||
I should qualify, the workload I care about is unusual in a number of respects. Most game workloads involve rasterization of a huge number of triangles, and most AI workloads involve multiplication of large matrices, both very conceptually simple operations. By contrast, 2D rendering has a lot of intricate, conditional logic, and is very compute intensive compared with the raw memory bandwidth needed. Compute shaders on modern GPUs can handle the conditional logic quite well, but lack *agility,* which to me means the ability to make fine-grained scheduling decisions. I believe agility | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI unfinished sentence.
YYY gpu vs agility. Does this kind of code also hamper parallelism?
The complexity of the GPU ecosystem has many downstream effects. Drivers and shader compilers are buggy and [insecure], and there is probably no path to really fixing that. Core APIs tend to be very limited in functionality and performance, so there's a dazzling array of extensions that need to be detected at runtime, and the most appropriate permuation selected. This in turn makes it far more likely to run into bugs that appear only with specific combinations of features, or on particular hardware. | ||
|
||
All this is in fairly stark contrast to the CPU world. A modern CPU is also dazzlingly complex, with billions of transistors, but it is rooted in a much simpler computational model. From a programmer perspective, coding for an Apple M3 isn't that different than, say, a Cortex M0, which can be made with about 48,000 transistors. Similarly, a low performance RISC-V implementation is a reasonable student project. Obviously the M3 is doing a lot more with branch prediction, superscalar issue, memory hierarchies, op fusion, and other performance tricks, but it's recognizably doing the same thing as a vastly similar chip. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specialized coprocessors are typically less generic, precisely because they serve to optimize special cases. It's kind of a tautology? Consider the behavior and interface of other ASICs...
## Big grid of RISC-V | ||
|
||
There are many, many AI accellerators in the pipeline – see the [New Silicon for Supercomputers] talk for a great survey. One approach (definitely the one taken by the original [Google TPU]) is to sacrifice agility and make hardware that's specialized just for doing big matrix multiplications and essentially nothing else. Another approach, suitable for the low end, is a fairly vanilla VLIW microprocessor with big vector units, an architecture actually quite similar to existing DSPs. That is the approach taken by the [Qualcomm Hexagon]. Neither of these is suitable for running a workload like Vello. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sp: accelerators
All this is in fairly stark contrast to the CPU world. A modern CPU is also dazzlingly complex, with billions of transistors, but it is rooted in a much simpler computational model. From a programmer perspective, coding for an Apple M3 isn't that different than, say, a Cortex M0, which can be made with about 48,000 transistors. Similarly, a low performance RISC-V implementation is a reasonable student project. Obviously the M3 is doing a lot more with branch prediction, superscalar issue, memory hierarchies, op fusion, and other performance tricks, but it's recognizably doing the same thing as a vastly similar chip. | ||
|
||
In the past, there were economic pressures towards replacing special-purpose circuitry with general purpose compute performance, but those incentives are shifting. Basically, if you're optimizing for number of transistors, then somewhat less efficient general purpose compute can be kept busy almost all the time, while special purpose hardware is only justified if there is high enough utilization in the workload. However, as Dennard scaling has ended and we're more constrained by power than transistor count, special purpose hardware starts winning more; it can simply be powered down if it isn't used by the workload. The days of a purely RISC computational model are probably over. What I'd *like* to see replacing it is an agile core (likely RISC-V) serving as the control function for a bunch of special-purpose accelerator extensions. That certainly is the model of the [Vortex] project among others. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a counterpoint, in the world of Big Data, we kinda have this. You write kernel-like functions, upload them to your favorite parallel system (e.g. Snowflake) and run them on zillions of cores and I/O channels. It works because the logic is kept pretty simple, at least in the inner loop, and the universal goal is to radically reduce the size of the dataset as early as possible in the pipeline. Query optimizers hide complexity on one end, and application logic on the other.
"Agility" is compromised but that's taken as fundamental to parallel computing.
### Esperanto | ||
|
||
Another approach is [Esperanto], which is about 1000 efficiency RISC-V cores on a chip. The company way founded by Dave Ditzel, previously of Transmeta. The linked paper goes into a fair amount of detail and quantitative measurement. Not surprisingly, it focuses on the AI acceleration use case, but it also appears suitable for HPC workloads. Because it is at heard many CPUs, each running a program independently, it promises great agility. Unfortunately, there's no indication their software stack is open source, so it's hard for me to find out more. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: it is at heard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each running a portal independently => the same program? Same lifetime? IPC? How is this more agile?
Also, from the paper: "Four of these 8-core neighborhoods are put together along with 4 MB of memory to form a 32-core “Minion Shire.” The memory is implemented as four 1 MB SRAM banks connected to the neighborhoods through a 512-bit crossbar switch. These SRAM banks operate near the process-nominal supply voltage to allow higher density than the smaller caches within each core. Each bank can be partitioned by software to provide a mix of scratchpad memory, L2 cache private to the Shire, or L3 cache globally accessible across the entire chip via a global shared address space" => I have questions!! :-)
This is a very rough draft and will be substantially rewritten. But I'm sharing it in this form because I'm not sure when I'll finish it, and there might be some useful discussion.