-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathicestorm.html
79 lines (54 loc) · 4.17 KB
/
icestorm.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
<html><head><meta charset="UTF-8"><title>Icestorm Overview</title><style>body { background-color: #E7F2F8 }</style></head><body><pre><strong>Apple Microarchitecture Research</strong> by <a href="https://twitter.com/dougallj">Dougall Johnson</a>
M1/A14 P-core (Firestorm): <a href="firestorm.html">Overview</a> | <a href="firestorm-int.html">Base Instructions</a> | <a href="firestorm-simd.html">SIMD and FP Instructions</a>
M1/A14 E-core (Icestorm): <a href="icestorm.html">Overview</a> | <a href="icestorm-int.html">Base Instructions</a> | <a href="icestorm-simd.html">SIMD and FP Instructions</a>
</pre><p>This is an attempt at microarchitecture documentation for the CPU in the Apple M1, inspired by and building on the amazing work of <a href="https://uops.info/">Andreas Abel</a>,
<a href="https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2">Andrei Frumusanu</a>, <a href="https://github.com/Veedrac/microarchitecturometer">@Veedrac</a>,
<a href="https://github.com/travisdowns/robsize">Travis Downs</a>, <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">Henry Wong</a>,
<a href="https://agner.org/optimize/">Agner Fog</a> and <a href="https://twitter.com/handleym99/status/1437537535018684417">Maynard Handley</a>.
This documentation is my best effort, but it is based on black-box reverse engineering, and there are definitely mistakes.
No warranty of any kind (and not just as a legal technicality). To make it easier to verify the information and/or identify such errors,
entries in the instruction tables link to the experiments and results (~35k tables of counter values).</p>
<p>Icestorm is the high-efficiency microarchitecture used by the four E-cores in the M1. Low-power ARM cores are generally a bit less novel, so the notes here are a bit less thorough.</p>
<h2>Icestorm Units (ports)</h2>
<p>These are refered to as "units", to try to avoid confusion if Apple releases official documentation, as they
probably refer to them as "ports" or "pipes", and order them differently. (If this just causes more confusion,
I apologise.)</p>
<pre>
Integer units:
1: alu + br + mrs
2: alu + br + div + ptrauth
3: alu + mul + bfm + crc
Load and store units (up to 128-bit loads and stores, including address generation with shifts up to LSL #3):
4: load/store + amx
5: load
FP/SIMD units:
6: fp/simd
7: fp/simd + fdiv + to-int + div + recp + sqrt + sha + jcvtzs
</pre>
<h2>Elimination</h2>
<p>Icestorm eliminates <code>movz</code> + <code>movk</code> (pair only, but any shift on both) and <code>adr</code>/<code>adrp</code> as well as the Firestorm things.</p>
<h2>Instruction Fusion</h2>
<p>Mostly the same as Firestorm. Icestorm also has <code>movz</code> + <code>movk</code> elimination, but still
<strong>not</strong> <code>adrp</code> + <code>add</code> fusion (although it is one uop on account of <code>adr</code>/<code>adrp</code> elimination).</p>
<h2>Complex Latencies</h2>
<p>Several instructions have latencies that aren't adequately described in the instruction tables:</p>
<ul>
<li>MADD's output can be passed to its third operand (the addend) with 1c latency, but if it's chained with other instructions it has 3c latency.</li>
<li>Loads may be passed to the base address of other loads with 3c latency (which is nice for linked lists), but chaining with ALU operations gives a latency of 4c. (Although the second destination register in an LDP always has 4c latency.)</li>
</ul>
<h2>Other limits</h2>
<ul>
<li>Retires (uops that retire) per cycle: 4</li>
<li>Coalesced retire queue size: ~60</li>
<li>In-flight renames: ~111</li>
<li>Integer physical register file size: ~79</li>
<li>FP/SIMD physical register file size: ~87</li>
<li>Load buffers: 30</li>
<li>Store buffers: 18</li>
</ul>
<p>These numbers mostly come from my <a href="https://gist.github.com/dougallj/5bafb113492047c865c0c8cfbc930155">M1 buffer size measuring tool</a>.
The M1 seems to use something other than an entirely conventional reorder buffer, which
complicates measurements a bit. So these may or may not be accurate.
(This paragraph previously said "it seems to use something along the lines of a validation buffer".
I think the VB hypothesis has since been disproven.)</p>
</body></html>