-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathopcode speeds
103 lines (82 loc) · 2.77 KB
/
opcode speeds
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
i-cache appears to always be on?
nop takes 1 cycle
v32ld HY(0++,0),(r1+=r2) REPx, 11 cycle startup (for L1 hit), plus 2*x, given that (r2%64)==0
vector loads of uint32_t[16][64]
VPU at 216mhz:
from 0-alias and 4-alias, 139(128+11) clocks
from 8-alias, 307 clocks
from c-alias, 300 clocks
VPU at 432mhz:
from 0-alias and 4-alias, 139 clocks
from 8-alias, 600 clocks
from c-alias, 593 clocks
from the 0 alias, repeating to garantee a hit
v16ld HX(0++,0),(r1+=r2) REP64 == 165-170 clocks?
v16ld HX(0++,0),(r1+=r2) REP32 == 85-90 clocks?
v16ld HX(0++,0),(r1+=r2) REP16 == 45-50 clocks?
v16ld HX(0++,0),(r1+=r2) REP8 == 25-30 clocks?
v16ld HX(0++,0),(r1+=r2) REP4 == 15-20 clocks?
v16ld HX(0++,0),(r1+=r2) REP2 == 10-15 clocks?
v16ld HX(0++,0),(r1+=r2) == 7-12 clocks?
nop wrap removed, still 0 alias, counts for whole loop
v16ld HX(0++,0),(r1+=r2) == 12
v16ld HX(0++,0),(r1+=r2) REP64 == 170
v16ld HY(0++,0),(r1+=r2) REP64 == 170
v32ld HY(0++,0),(r1+=r2) REP64 == 171
adjusting overlap of each slice of v32ld HY(0++,0),(r1+=r2) REP64
stride 0 == 139 clocks
1-6 = 195
32 = 202
36 = 210
64 = 139
68 = 243
proper 64 byte stride, v32ld HY(0++,0),(r1+=r2) REP64, but adjusting the alignment of the whole set
0-1 = 139
sub r0, bne r0 == 3 cycles
sub r5, sub r0, bne r0 == 3 cycles, pipeline gave a free cycle
sub r5, sub r5, sub r0, bne r0 == 5 cycles, pipeline stalled out due to read after write
sub r5, sub r6, sub r0, bne r0 == 4 cycles, pipeline stalled less
sub r5, sub r6, sub r7, sub r5, sub r0, bne r0, 6 opcodes in 5 clocks
2021-04-02 03:22:05 < doug16k> pair, pair, single, 2 cycle branch would explain it
sub r5, sub r6, sub r7, sub r5, sub r6, sub r0, bne r0, 7 opcodes in 5 clocks
2021-04-02 03:23:25 < clever> i think your right, i added another sub, and it stayed the same
2021-04-02 03:24:20 < doug16k> it's similar to Pentium pairing
sub r0, bne r0 == 3 cycles
nop, sub r0, bne r0 == 4
nop, nop, sub r0, bne r0 == 5 cycles
nop, v16ld HX(0++,0),(r1+=r2) REP64, nop, sub r0, bne r0 == 170 cycles
nop, v16ld HX(0++,0),(r1+=r2) REP32, nop, sub r0, bne r0 == 90 cycles
stm r0-lr,(--sp)
stm r0-pc,lr,(--sp)
stm r0-sp,lr,(--sp)
stm gp,lr,(--sp)
stm gp-r28,lr,(--sp)
stm gp-r5,lr,(--sp)
1 clock cycle:
shl r0, r1, 0x3
lsr r0, 0x3
add r0, r1
add r0, r1, 42
subscale r0,r0,r1<<1
nop
v32mov HY(0,0),0x0
v32add HY(0,0),HY(1,0),r3
v32add HY(0,0),HY(0,0),42
v32dist HY(0,0),HY(0,0),HY(0,0)
v32shl HY(0,0),HY(0,0),r3
v32shl HY(0,0),HY(0,0),42
v32min HY(0,0),HY(0,0),HY(0,0)
2 clock cycles:
mulhd.uu
vmul32.uu HY(0,0),HY(0,0),HY(0,0)
3 clock cycles:
repeated loads from the 0 alias
4 clock cycles:
v32mov HY(0++,0),0 REP4
10 clock cycles:
repeated loads from the 4 and 8 alias
13 clock cycles:
a load from GPIO_GPLEV0
21 clock cycles:
div.uu
div.ss