# Go performance tricks: How I Learned to Stop Worrying and Love Go in production Pavel Tišnovský https://github.com/RedHatOfficial/GoCourse ## Simple but useful performance improvements in Go * Go compiler is pretty straightforward (= dumb) * It needs some help from developers ## Some interesting areas backed by experience 1. Pass structures by reference, not by value 1. Pass receivers by reference, not by value 1. Maps pre-allocation 1. Arrays vs slices * Most things mentioned above are not idiomatic! ## Pass structures by reference, not by value * All data types are passed by value into functions and methods - that's good from "state-space" perspective - not so good from performance point of view * Java/Python have different semantic! * And no nice solution like `const` and `mut` exists in Go ## Problem definition * Large structures passed by value everywhere in the code - even within inner loops * It causes overhead that is super easy to avoid - but at cost: wrong semantic is used ## Real example ```toml [logging] debug = true log_level = "info" [kafka_broker] enabled = true address = "kafka:29092" #provide in deployment env or as secret security_protocol = "PLAINTEXT" cert_path = "not-set" sasl_mechanism = "PLAIN" sasl_username = "not-used" sasl_password = "not-used" topic = "platform.notifications.ingress" #provide in deployment env or as secret timeout = "60s" likelihood_threshold = 0 impact_threshold = 0 severity_threshold = 0 total_risk_threshold = 3 event_filter = "totalRisk >= totalRiskThreshold" [service_log] client_id = "a-service-id" client_secret = "a-secret" token_url = "" enabled = false url = "https://api.openshift.com/api/service_logs/v1/cluster_logs/" timeout = "15s" likelihood_threshold = 0 impact_threshold = 0 severity_threshold = 0 total_risk_threshold = 0 event_filter = "totalRisk >= totalRiskThreshold" rule_details_uri = "https://console.redhat.com/openshift/insights/advisor/recommendations/{module}|{error_key}" [storage] db_driver = "postgres" pg_username = "postgres" #provide in deployment env or as secret pg_password = "postgres" #provide in deployment env or as secret pg_host = "localhost" #provide in deployment env or as secret pg_port = 5432 #provide in deployment env or as secret pg_db_name = "notification" #provide in deployment env or as secret pg_params = "sslmode=disable" log_sql_queries = true [dependencies] content_server = "localhost:8082" #provide in deployment env or as secret content_endpoint = "/api/v1/content" #provide in deployment env or as secret template_renderer_server = "localhost:8083" #provide in deployment env or as secret template_renderer_endpoint = "/v1/rendered_reports" #provide in deployment env or as secret [notifications] insights_advisor_url = "https://console.redhat.com/openshift/insights/advisor/clusters/{cluster_id}" cluster_details_uri = "https://console.redhat.com/openshift/details/{cluster_id}#insights" rule_details_uri = "https://console.redhat.com/openshift/details/{cluster_id}/insights/{module}/{error_key}" # valid units are SQL epoch time units: months days hours minutes seconds" cooldown = "24 hours" [metrics] job_name = "ccx_notification_service" # The metrics in Prometheus will be $namespace_$subsystem_$name namespace = "ccx_notification_service" subsystem = "to_notification_backend" gateway_url = "localhost:9091" gateway_auth_token = "" retries = 3 # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". retry_after = "60s" [cleaner] # valid units are SQL epoch time units: months days hours minutes seconds" max_age = "90 days" [processing] filter_allowed_clusters = false allowed_clusters = [] filter_blocked_clusters = false blocked_clusters = [] ``` ## Representation of such structure in Go .code optimizations/config_struct_1.go ## Sub-structure(s) .code optimizations/config_struct_2.go .code optimizations/config_struct_3.go ## 'Getter' used in the code ```go func GetStorageConfigurationByValue(configuration ConfigStruct) StorageConfiguration { return configuration.Storage } ``` * Pros - pretty idiomatic - basically say the struct is immutable on 1st level * Cons - structure is passed by value - structure size is about 700 bytes! ## Dirty solution * the only solution possible in Go - different semantic: - passing by reference -> struct can be changed within * unfortunately we can't distinguish between `const` and `mut` - mutation (or no mutation) has to be checked by unit tests ```go func GetStorageConfigurationByReference(configuration *ConfigStruct) StorageConfiguration { return configuration.Storage } ``` ## Is it worth to do it? .code optimizations/benchmark_1_1.go ## Helper function to load configuration .code optimizations/benchmark_1_2.go ## Helper function to load configuration .code optimizations/benchmark_1_3.go ## Actual benchmarks .code optimizations/benchmark_1_4.go ## Benchmark results ``` $ go test -bench=. -benchtime=1000000000x -cpuprofile profile.out -v config-struct/conf goos: linux goarch: amd64 pkg: config-struct/conf cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz BenchmarkGetStorageConfigurationFunctionByValue BenchmarkGetStorageConfigurationFunctionByValue-8 1000000000 13.20 ns/op BenchmarkGetStorageConfigurationFunctionByReference BenchmarkGetStorageConfigurationFunctionByReference-8 1000000000 0.2405 ns/op PASS ok config-struct/conf 27.166s ``` ## Benchmark results * 13.20 ns/op vs 0.2405 ns/op * That is 5488% speed increase :) - memcpy operation elimination ## Graphical output ![Benchmark1](images/benchmark1.png) ## Table output ![Benchmark1](images/benchmark2.png) ## Is it still relevant? * 0.2405 ns vs. 13.20 ns from CPU point of view vs. human point of view ![CPU](images/computer_latency.png) ## Memory is getting "slower" ![CPU2](images/cpu.png) ## What about methods? * The same issue + simple fix ```go func (configuration ConfigStruct) GetStorageConfigurationByValue() StorageConfiguration { return configuration.Storage } func (configuration *ConfigStruct) GetStorageConfigurationByReference() StorageConfiguration { return configuration.Storage } ``` ## Benchmark for methods ```go func BenchmarkGetStorageConfigurationMethodByValue(b *testing.B) { b.StopTimer() configuration := mustLoadBenchmarkConfiguration(b) b.StartTimer() for i := 0; i < b.N; i++ { configuration.GetStorageConfigurationByValue() } } func BenchmarkGetStorageConfigurationMethodByReference(b *testing.B) { b.StopTimer() configuration := mustLoadBenchmarkConfiguration(b) b.StartTimer() for i := 0; i < b.N; i++ { configuration.GetStorageConfigurationByReference() } } ``` ## Run the benchmark ``` $ go test -bench=. -benchtime=1000000000x -cpuprofile profile.out -v config-struct/conf goos: linux goarch: amd64 pkg: config-struct/conf cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz BenchmarkGetStorageConfigurationFunctionByValue BenchmarkGetStorageConfigurationFunctionByValue-8 1000000000 13.20 ns/op BenchmarkGetStorageConfigurationFunctionByReference BenchmarkGetStorageConfigurationFunctionByReference-8 1000000000 0.2405 ns/op BenchmarkGetStorageConfigurationMethodByValue BenchmarkGetStorageConfigurationMethodByValue-8 1000000000 13.24 ns/op BenchmarkGetStorageConfigurationMethodByReference BenchmarkGetStorageConfigurationMethodByReference-8 1000000000 0.3596 ns/op PASS ok config-struct/conf 27.166s ``` ## Maps pre-allocation * In Go it is possible to specify number of map items during map allocation ```go m1 := make(map[UUID]time.Time) m2 := make(map[UUID]time.Time, b.N) ``` * Not strictly needed, so developers took "Python approach" * Does it make sense to try to estimate number of items? * Who knows? Probably only the benchmark... ## Benchmark ```go type UUID string func BenchmarkInsertIntoPreallocatedMapUUIDKey(b *testing.B) { m := make(map[UUID]time.Time, b.N) t := time.Now() for i := 0; i < b.N; i++ { b.StopTimer() id := UUID(uuid.New().String()) b.StartTimer() m[id] = t } } func BenchmarkInsertIntoEmptyMapUUIDKey(b *testing.B) { m := map[UUID]time.Time{} t := time.Now() for i := 0; i < b.N; i++ { b.StopTimer() id := UUID(uuid.New().String()) b.StartTimer() m[id] = t } } ``` ## Benchmark results ``` goos: linux goarch: amd64 pkg: map-bench cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz BenchmarkInsertIntoEmptyMapUUIDKey-8 1000000 354.7 ns/op BenchmarkInsertIntoPreallocatedMapUUIDKey-8 1000000 163.9 ns/op PASS ``` ## Graphical output ![Profiler1](images/profiler1.png) ## Maps with large keys * Keys can be of any type in Go * So keys might be pretty large (imagine structs) ```go type key struct { ID int payload [100]byte } m1 := make(map[key]value) m2 := make(map[key]value, capacity) ``` ## Benchmark ```go type key struct { ID int payload [100]byte } type value struct{} func BenchmarkInsertIntoPreallocatedMapCompoundKey(b *testing.B) { m := make(map[key]value, b.N) for i := 0; i < b.N; i++ { k := key{ ID: i, } m[k] = value{} } } func BenchmarkInsertIntoEmptyMapCompoundKey(b *testing.B) { m := map[key]value{} for i := 0; i < b.N; i++ { k := key{ ID: i, } m[k] = value{} } } ``` ## Benchmark results for maps with large keys ``` goos: linux goarch: amd64 pkg: map-bench cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz BenchmarkInsertIntoEmptyMapCompoundKey-8 1000000 332.2 ns/op BenchmarkInsertIntoPreallocatedMapCompoundKey-8 1000000 177.7 ns/op PASS ``` ## Memory requirements * For us much more relevant - as pods might be killed due to OOM - and we have to pay for more memory ## Go profiler to the rescue ![Profiler2](images/profiler2.png) ## Go profiler to the rescue ![Profiler3](images/profiler3.png) ## Set implementations * Implementation - first devel instinct: "map" of course - but we tried linear data structure too! ## Set implementations (1/2) ```go type ID int func genID(i int) ID { return ID(3*i + 1) } func fillInMap(b *testing.B, items int) map[ID]struct{} { b.StopTimer() m := make(map[ID]struct{}, items) for i := 0; i < items; i++ { id := genID(i) m[id] = struct{}{} } b.StartTimer() return m } ``` ## Set implementations (2/2) ```go func fillInSlice(b *testing.B, items int) []ID { b.StopTimer() s := make([]ID, items) for i := 0; i < items; i++ { id := genID(i) s[i] = id } b.StartTimer() return s } ``` ## Benchmark for read/in-set operation ```go func performBenchmarkFindInMap(b *testing.B, m map[ID]struct{}) { items := len(m) for i := 0; i < b.N; i++ { _, found := m[genID(i%items)] if !found { b.Fatal("not found") } } } ``` ## Benchmark for read/in-set operation ```go func performBenchmarkFindInSlice(b *testing.B, s []ID) { items := len(s) for i := 0; i < b.N; i++ { found := false id := genID(i % items) for _, p := range s { if p == id { found = true break } } if !found { b.Fatal("not found") } } } ``` ## Benchmark results ``` BenchmarkFindInMap1-8 100000000 12.01 ns/op BenchmarkFindInSlice1-8 100000000 7.208 ns/op BenchmarkFindInMap5-8 100000000 12.61 ns/op BenchmarkFindInSlice5-8 100000000 8.346 ns/op BenchmarkFindInMap10-8 100000000 14.57 ns/op BenchmarkFindInSlice10-8 100000000 9.498 ns/op BenchmarkFindInMap20-8 100000000 14.28 ns/op BenchmarkFindInSlice20-8 100000000 11.61 ns/op BenchmarkFindInMap100-8 100000000 14.63 ns/op BenchmarkFindInSlice100-8 100000000 35.57 ns/op BenchmarkFindInMap1000-8 100000000 22.53 ns/op BenchmarkFindInSlice1000-8 100000000 281.4 ns/op ``` ## Graphical interpretation ## Low number of items ![Set1](images/set1.png) ## Higher number of items ![Set2](images/set2.png) ## Map vs slice for set implementation ![Set3](images/set3.png) ## Use slices instead of arrays (if possible) * C-proselytes and Java-proselytes seems to use arrays a lot * But in fact arrays are not used much in idiomatic Go code * Array in Go - passed by value (COPY!) - not the case in C/Java/Python - semantic is totally different ## Slices in Go * Struct/record with three items - pointer to data (array) - length - capacity * Passed by value - but it is not deep clone ## Benchmark (1/4) ```go func changeMe1(values []int) { values[0] = FIRST_VALUE values[MAX_VALS-1] = LAST_VALUE } func changeMe2(values [MAX_VALS]int) { values[0] = FIRST_VALUE values[MAX_VALS-1] = LAST_VALUE } func changeMe3(values *[MAX_VALS]int) { values[0] = FIRST_VALUE values[MAX_VALS-1] = LAST_VALUE } ``` ## Benchmark (2/4) ```go func BenchmarkPassSlice(b *testing.B) { var values []int = make([]int, MAX_VALS) for i := 0; i < b.N; i++ { changeMe1(values) } if values[0] != FIRST_VALUE { b.Fatal() } if values[MAX_VALS-1] != LAST_VALUE { b.Fatal() } } ``` ## Benchmark (3/4) ```go func BenchmarkPassArrayByValue(b *testing.B) { var values [MAX_VALS]int = [MAX_VALS]int{DEFAULT_VALUE} for i := 0; i < b.N; i++ { changeMe2(values) } if values[0] != DEFAULT_VALUE { b.Fatal() } if values[MAX_VALS-1] != DEFAULT_VALUE { b.Fatal() } } ## Benchmark (4/4) ```go func BenchmarkPassArrayByReference(b *testing.B) { var values [MAX_VALS]int = [MAX_VALS]int{DEFAULT_VALUE} for i := 0; i < b.N; i++ { changeMe3(&values) } if values[0] != FIRST_VALUE { b.Fatal() } if values[MAX_VALS-1] != LAST_VALUE { b.Fatal() } } ``` ## Benchmark results ``` BenchmarkPassSlice-8 100000000 0.4799 ns/op BenchmarkPassArrayByValue-8 100000000 0.2371 ns/op BenchmarkPassArrayByReference-8 100000000 0.4740 ns/op ``` * What? - compiler detected that local changes are not visible outside function - and the whole function was dropped ## Is it really the case? ![Profiler4](images/profiler4.png) ## Now with side effect ```go func changeMe1(values []int) int { values[0] = FIRST_VALUE values[MAX_VALS-1] = LAST_VALUE return values[MAX_VALS/2] } func changeMe2(values [MAX_VALS]int) int { values[0] = FIRST_VALUE values[MAX_VALS-1] = LAST_VALUE return values[MAX_VALS/2] } func changeMe3(values *[MAX_VALS]int) int { values[0] = FIRST_VALUE values[MAX_VALS-1] = LAST_VALUE return values[MAX_VALS/2] } ``` ## Benchmark (1/3) ``` func BenchmarkPassSlice(b *testing.B) { var values []int = make([]int, MAX_VALS) for i := 0; i < b.N; i++ { r := changeMe1(values) if r != DEFAULT_VALUE { b.Fatal() } } if values[0] != FIRST_VALUE { b.Fatal() } if values[MAX_VALS-1] != LAST_VALUE { b.Fatal() } } ``` ## Benchmark (2/3) ``` func BenchmarkPassArrayByValue(b *testing.B) { var values [MAX_VALS]int = [MAX_VALS]int{DEFAULT_VALUE} for i := 0; i < b.N; i++ { r := changeMe2(values) if r != DEFAULT_VALUE { b.Fatal() } } if values[0] != DEFAULT_VALUE { b.Fatal() } if values[MAX_VALS-1] != DEFAULT_VALUE { b.Fatal() } } ``` ## Benchmark (3/3) ``` func BenchmarkPassArrayByReference(b *testing.B) { var values [MAX_VALS]int = [MAX_VALS]int{DEFAULT_VALUE} for i := 0; i < b.N; i++ { r := changeMe3(&values) if r != DEFAULT_VALUE { b.Fatal() } } if values[0] != FIRST_VALUE { b.Fatal() } if values[MAX_VALS-1] != LAST_VALUE { b.Fatal() } } ``` ## Benchmark results ``` BenchmarkPassSlice-8 100000000 0.4768 ns/op BenchmarkPassArrayByValue-8 100000000 135.3 ns/op BenchmarkPassArrayByReference-8 100000000 0.5629 ns/op ``` ## Benchmark results ![benchmark3](images/benchmark3.png) ## for-each implementation (arrays, slices) * idiomatic * more performant ## for-each implementations ```go for _, item := range items { sum += item.value } ``` ```go for j := 0; j < len(items); j++ { sum += items[j].value } ``` ## Benchmark results ``` BenchmarkCountValues1-8 100000 70310 ns/op BenchmarkCountValues2-8 100000 10687 ns/op ``` ## Benchmark results ![benchmark4](images/benchmark4.png) ## for-each implementation (maps) * idiomatic * more performant ## for-each implementations ```go for key, value := range items { ... } ``` ```go for key, _ := range items { ... items[key] ... } ``` ## Benchmark results ``` BenchmarkCountValues1-8 100000 144225 ns/op BenchmarkCountValues2-8 100000 12300 ns/op ``` ## Benchmark results ![benchmark5](images/benchmark5.png) `` ## Synchronization * using channels * using mutexes ## Synchronization by mutexes ```go func (value *valueWithMutex) lock() { value.mutex.Lock() } func (value *valueWithMutex) unlock() { value.mutex.Unlock() } ``` ## Synchronization by using channel ```go func (value *valueWithChannel) lock() { // blocking <-value.channel } func (value *valueWithChannel) unlock() { value.channel <- struct{}{} } ``` ## Benchmarks ``` BenchmarkMutexVariant-8 100000 29044 ns/op BenchmarkChannelVariant-8 100000 49570 ns/op ``` ## String builders * very frequent operation - templates - JSON and XML serializers * usually 'hidden' inside libraries ## Strings are immutable ```go type stringStruct struct { str unsafe.Pointer len int } ``` ## String concatenation is slow ```go func concatstrings(buf *tmpBuf, a []string) string { idx := 0 l := 0 count := 0 for i, x := range a { n := len(x) if n == 0 { continue } if l+n < l { throw("string concatenation too long") } l += n count++ idx = i } if count == 0 { return "" } // If there is just one string and either it is not on the stack // or our result does not escape the calling frame (buf != nil), // then we can return that string directly. if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) { return a[idx] } s, b := rawstringtmp(buf, l) for _, x := range a { copy(b, x) b = b[len(x):] } return s } ``` ## So how to append strings? ```go s := "" for i := 0; i < n; i++ { s += "*" } ``` ## Bytes buffer object ```go var bb bytes.Buffer for i := 0; i < n; i++ { bb.WriteRune('*') } return bb.String() ``` ## String builder object ```go var sb strings.Builder for i := 0; i < n; i++ { sb.WriteRune('*') } return sb.String() ``` ## Memory pre-allocation (buffer) ```go var bb bytes.Buffer bb.Grow(n) for i := 0; i < n; i++ { bb.WriteRune('*') } return bb.String() ``` ## Memory pre-allocation (builder) ```go var bb bytes.Builder bb.Grow(n) for i := 0; i < n; i++ { bb.WriteRune('*') } return bb.String() ``` ## Benchmark results ![benchmark6](images/benchmark6.png) ## Benchmark results ![benchmark7](images/benchmark7.png) ## Benchmark results ![benchmark8](images/benchmark8.png) ## SIMD instructions * Sets of special instructions - operate on short fixed-length arrays - specialized data-processing operations - (not available outside SIMD world) - min, max, abs, saturated arithmetic * Very important feature of today CPUs - sometimes more important than multicore(s) ## SIMD extensions on x86-64 CPUs * MMX * 3Dnow! * SSE - SSE 2, SSE 3, SSE 4.1, SSE 4.2 * AVX - split into multiple sets - F16C etc. ## How are these used? * Auto-vectorization - performed by compiler * Intrinsic - available made by compiler/headers/packages * Assembler - built-in (if available) - external one ## Typical vectors used in SIMD ![SIMD](images/simd.png) ## Operations with vectors (GCC) ```c typedef signed int v16si __attribute__((vector_size(16))); void add16ui(v16si x, v16si y, v16si * z) { *z = x + y; } v16si x = { 0, 1, 2, 3 }; v16si y = { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }; v16si z; add16ui(x, y, z); int i; puts("vector of signed ints"); for (i = 0; i < sizeof(v16si) / sizeof(signed int); i++) { printf("%d %d\n", i, z[i]); } ``` ## Intrinsic ``` Technology Header file -- MMX mmintrin.h SSE1 xmmintrin.h SSE2 emmintrin.h SSE4.1 smmintrin.h AVX2 avx2intrin.h ``` ## Intrinsic usage ```c #include <xmmintrin.h> int main(void) { __v4sf x = { 1.0, 2.0, 3.0, 4.0 }; __v4sf y = { 0.1, 0.1, 0.1, 0.1 }; __v4sf z; int i; z = __builtin_ia32_addps(x, y); for (i = 0; i < sizeof(x) / sizeof(float); i++) { printf("%2d %f %f %f\n", i, x[i], y[i], z[i]); } } ``` ## Generated machine code ```asm z = __builtin_ia32_addps(x, y); 31: 0f 28 45 d0 movaps xmm0,XMMWORD PTR [rbp-0x30] 35: 0f 28 4d c0 movaps xmm1,XMMWORD PTR [rbp-0x40] 39: 0f 58 c1 addps xmm0,xmm1 3c: 0f 29 45 e0 movaps XMMWORD PTR [rbp-0x20],xmm0 ``` ## SIMD support in Go * Auto-vectorization - not provided by Google's compiler - gccgo can use it * Intrinsic - not available in Go - and very probably never will! * Assembler - built-in N/A in Go - external one is cumbersome to use - gccgo + GNU assembler (gas) ## SIMD support in Go * Not ideal situation today - we don't invest into this approach - anyway most processing is performed against UTF-8 chars # Thank you! # Questions?