You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The article titled "From slow to SIMD: A Go optimization story" on Sourcegraph's blog dives into how Camden Cheek optimized a language feature of Go using the SIMD technology. The article details their performance improvement journey, retracing the steps the author took at each stage. It begins by recognizing a slow yet crucial part of code involving cosine similarity calculation. The author then explores loop unrolling, bounds-checking elimination, quantization, and SIMD itself to enhance speed. The article provides helpful resource links for further reading at each step of the optimization process.
This article offers a detailed, content-rich insight into the subject of performance optimization, with its primary focus being the enhancement of a relatively straightforward dot product function.
The definition of the dot product will not be expanded on as individuals who have a basic understanding of Go would comprehend the function depicted above. One may initially puzzle over what possible optimization can be attained with these 7 lines of code entailing multiplication and addition calculations.
Following the first section titled "Loop unrolling", the context begins to get intriguing. The author rewrote the loop, enabling numerous high-latency multiplication instructions to execute concurrently (for related research, refer to Latency vs. Throughput). Furthermore, the method spreads regular loop costs over multiple operations (increments and comparisons), and by employing benchmark testing, the author discerned the optimal iteration count as 4 (within the author's runtime environment). This alteration elevated throughput by 37%.
In the ensuing section titled "Bounds-checking elimination", it ventures towards more uncharted territory for me.
In order to keep out-of-bounds slice accesses from being a security vulnerability (like the famous Heartbleed exploit), the go compiler inserts checks before each read. You can check it out in the generated assembly (look for runtime.panic).
The compiled code makes it look like we wrote something like this:
funcDotUnroll4(a, b []float32) float32 {
sum:=float32(0)
fori:=0; i<len(a); i+=4 {
ifi>=cap(b) {
panic("out of bounds")
}
s0:=a[i] *b[i]
ifi+1>=cap(a) ||i+1>=cap(b) {
panic("out of bounds")
}
s1:=a[i+1] *b[i+1]
ifi+2>=cap(a) ||i+2>=cap(b) {
panic("out of bounds")
}
s2:=a[i+2] *b[i+2]
ifi+3>=cap(a) ||i+3>=cap(b) {
panic("out of bounds")
}
s3:=a[i+3] *b[i+3]
sum+=s0+s1+s2+s3
}
returnsum
}
The author adopted a strategy that asserts equal lengths and shifts all boundary checks to the commencement of the loop, in an attempt to pare down boundary checks.
funcDotBCE(a, b []float32) float32 {
iflen(a) !=len(b) {
panic("slices must have equal lengths")
}
iflen(a)%4!=0 {
panic("slice length must be multiple of 4")
}
sum:=float32(0)
fori:=0; i<len(a); i+=4 {
aTmp:=a[i : i+4 : i+4]
bTmp:=b[i : i+4 : i+4]
s0:=aTmp[0] *bTmp[0]
s1:=aTmp[1] *bTmp[1]
s2:=aTmp[2] *bTmp[2]
s3:=aTmp[3] *bTmp[3]
sum+=s0+s1+s2+s3
}
returnsum
}
Great, the minimizing of bounds checking nets a 9% improvement.
Embarking on the third segment - "Quantization", I was left somewhat perplexed. Here, the focal point is on compressing vectors through type conversion implemented in advance (float32 => int8, for details see: integer quantization).
This change yields a 4x reduction in memory usage at the cost of some accuracy.
Within the fourth segment - "SIMD", the author re-crafted this section of the code utilizing AVX2 instructions, ensuring performance optimization whilst keeping portability intact. The outcome was a whopping 530% increase in throughput! Subsequently, the implementation of superior instructions (VPDPBUSD) triggered a further 21% improvement.
Ultimately, the outcomes of the optimization are truly astounding, presenting a 9.3-fold increase in throughput and a fourfold reduction in memory usage. There's a fitting Chinese idiom "精益求精", akin to the English 'the pursuit of excellence' that aptly illustrates this context. This deeply insightful paper has been a substantial source of learning, along with providing numerous inspirations and fresh perspectives. Thanks are indeed for such a profound sharing of knowledge.
🤖 AI Summary
The article titled "From slow to SIMD: A Go optimization story" on Sourcegraph's blog dives into how Camden Cheek optimized a language feature of Go using the SIMD technology. The article details their performance improvement journey, retracing the steps the author took at each stage. It begins by recognizing a slow yet crucial part of code involving cosine similarity calculation. The author then explores loop unrolling, bounds-checking elimination, quantization, and SIMD itself to enhance speed. The article provides helpful resource links for further reading at each step of the optimization process.
🖇️ Details
🔖 Note
This article offers a detailed, content-rich insight into the subject of performance optimization, with its primary focus being the enhancement of a relatively straightforward dot product function.
The definition of the dot product will not be expanded on as individuals who have a basic understanding of Go would comprehend the function depicted above. One may initially puzzle over what possible optimization can be attained with these 7 lines of code entailing multiplication and addition calculations.
Following the first section titled "Loop unrolling", the context begins to get intriguing. The author rewrote the loop, enabling numerous high-latency multiplication instructions to execute concurrently (for related research, refer to Latency vs. Throughput). Furthermore, the method spreads regular loop costs over multiple operations (increments and comparisons), and by employing benchmark testing, the author discerned the optimal iteration count as 4 (within the author's runtime environment). This alteration elevated throughput by 37%.
In the ensuing section titled "Bounds-checking elimination", it ventures towards more uncharted territory for me.
The compiled code makes it look like we wrote something like this:
The author adopted a strategy that asserts equal lengths and shifts all boundary checks to the commencement of the loop, in an attempt to pare down boundary checks.
Great, the minimizing of bounds checking nets a 9% improvement.
Embarking on the third segment - "Quantization", I was left somewhat perplexed. Here, the focal point is on compressing vectors through type conversion implemented in advance (float32 => int8, for details see: integer quantization).
This change yields a 4x reduction in memory usage at the cost of some accuracy.
Within the fourth segment - "SIMD", the author re-crafted this section of the code utilizing
AVX2
instructions, ensuring performance optimization whilst keeping portability intact. The outcome was a whopping 530% increase in throughput! Subsequently, the implementation of superior instructions (VPDPBUSD
) triggered a further 21% improvement.Ultimately, the outcomes of the optimization are truly astounding, presenting a 9.3-fold increase in throughput and a fourfold reduction in memory usage. There's a fitting Chinese idiom "精益求精", akin to the English 'the pursuit of excellence' that aptly illustrates this context. This deeply insightful paper has been a substantial source of learning, along with providing numerous inspirations and fresh perspectives. Thanks are indeed for such a profound sharing of knowledge.
🧭 Related
The text was updated successfully, but these errors were encountered: