Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non cryptographic hash for unique ID instead of MD5 #169

Open
tegk opened this issue Aug 8, 2019 · 10 comments
Open

Use non cryptographic hash for unique ID instead of MD5 #169

tegk opened this issue Aug 8, 2019 · 10 comments

Comments

@tegk
Copy link
Contributor

tegk commented Aug 8, 2019

I would like to discuss to replace md5 with a non cryptographic hash function like xxhash or murmur3 as they are much faster.

For example we use md5 to generate an ID for the queued envelop.

func queuedID(clientID uint64) string {
	return fmt.Sprintf("%x", md5.Sum([]byte(string(time.Now().Unix())+string(clientID))))
}

func NewEnvelope(remoteAddr string, clientID uint64) *Envelope {
	return &Envelope{
		RemoteIP: remoteAddr,
		Values:   make(map[string]interface{}),
		QueuedId: queuedID(clientID),
	}
}

If we agree on the usage of xxhash I can send a pull request.

@flashmob
Copy link
Owner

Sounds good.

Could this be configurable and controlled using the config?

An idea would be:

in the mail package, we define something like this:

var QueuedIDGenerator QuidGenerator

type QuidGenerator func(clientID uint64) string

func defaultQueuedID(clientID uint64) string {
	return fmt.Sprintf("%x", md5.Sum([]byte(string(time.Now().Unix())+string(clientID))))
}

// (add whatever other generator function here, eg murmur3)

// change queuedID to delegate to our desired function
func queuedID(clientID uint64) string {
	return QueuedIDGenerator(clientID)
}

and to the init() at the top of envelope.go, we add QueuedIDGenerator = defaultQueuedID

Then we can also modify api.go to add a convenience func that allows us set to whatever func we pass (but don't let it change it once the server is running)

Then use the API call itself in server.go to set it to whatever value is in the config.

How does that sound?

@flashmob
Copy link
Owner

flashmob commented Aug 15, 2019

oh, on second look - the value from queuedID() gets ignored in the Redis, SQL and GuerrillaDBRedis example processors. Forgot about, they overwrite this value. So maybe the above suggested way is not needed. Perhaps we can just queuedID() to use whatever is the fatster golang hashing function
and make a comment that this would most likely be set by the processor anyway.

Btw, would prefer to use a hasing function that doesn't introduce new external dependency. Which one to pick? https://golang.org/pkg/hash/#pkg-subdirectories

@lord-alfred
Copy link
Contributor

we add QueuedIDGenerator = defaultQueuedID

I think, its wrong way. Need to hardcode faster hashing function and all.
Nobody need to configure it, all members who use this package - need fast and stable smtp server.

@tegk
Copy link
Contributor Author

tegk commented Aug 15, 2019

we add QueuedIDGenerator = defaultQueuedID

I think, its wrong way. Need to hardcode faster hashing function and all.
Nobody need to configure it, all members who use this package - need fast and stable smtp server.

I agree, this not something anybody really wants to configure. We should just replace MD5 with xxHash and leave the current logic in place.

Not introducing an external dependency is a fair reason not to do it though as there is not a non crypto hash function in the standard lib.

@lord-alfred
Copy link
Contributor

Btw, would prefer to use a hasing function that doesn't introduce new external dependency

And what is the problem of adding a new dependency? Anyway need to compile the executable file.

@flashmob
Copy link
Owner

@lord-alfred new dependencies can add technical debt. Also, since this package is made to be used as a library, it's better if the user of the library can decide on their own dependencies. (And maintain them) while keeping our dependencies as small as possible.

@tegk please ignore my first comment. It appears that the queuedID function is actually somewhat redundant. The hashing has been delegated to some the backends, and you can see that the SQL processor sets its own queuedID value. So let's keep this issue open to fix this inconsistency.

As for any performance issues, it would be great if we could start using pprof tool to generate some graphs/reports where we can identify any bottlenecks. Right now, indentifying without profiling is kind of like shooting in the dark.

@tegk
Copy link
Contributor Author

tegk commented Aug 15, 2019

Here is a benchmark of hashing strings 3 to 254 characters long with md5 and xxhash.
3 is the min and 254 max length of an email address.

The change to xxHash would make sense in terms of latency (-84.8% less!) for sure.

goos: linux
goarch: amd64
pkg: md5vsxxhash
BenchmarkMD5minEmailAddress-12    	 5000000	       279 ns/op	     136 B/op	       4 allocs/op
BenchmarkXXminEmailAddress-12     	20000000	        66.6 ns/op	      32 B/op	       1 allocs/op
BenchmarkMD5maxEmailAddress-12    	 2000000	       665 ns/op	     384 B/op	       4 allocs/op
BenchmarkXXmaxEmailAddress-12     	20000000	       101 ns/op	      32 B/op	       1 allocs/op
PASS
ok  	md5vsxxhash	7.262s


package main

import (
	"crypto/md5"
	"io"
	"math/rand"
	"strconv"
	"testing"

	"github.com/cespare/xxhash"
)

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

var minEmailAddress string = randSeq(3)
var maxEmailAddress string = randSeq(254)

func randSeq(n int) string {
	b := make([]rune, n)
	for i := range b {
		b[i] = letters[rand.Intn(len(letters))]
	}
	return string(b)
}

func genMD5(s string) string {
	h := md5.New()
	io.WriteString(h, s)
	return string(h.Sum(nil))
}

func genXXhash(s string) string {
	h := xxhash.Sum64String(s)
	return strconv.FormatUint(h, 10)
}

func BenchmarkMD5minEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genMD5(minEmailAddress)
	}
}

func BenchmarkXXminEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genXXhash(minEmailAddress)
	}
}

func BenchmarkMD5maxEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genMD5(maxEmailAddress)
	}
}

func BenchmarkXXmaxEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genXXhash(maxEmailAddress)
	}
}

The results also are confirmed when bench-marking the hash backed:

goos: linux
goarch: amd64
pkg: md5vsxxhash
BenchmarkMD5hash-12    	    2000	    800720 ns/op	  304740 B/op	    3009 allocs/op
BenchmarkXXhash-12     	   10000	    150928 ns/op	  256604 B/op	    1006 allocs/op
PASS
ok  	md5vsxxhash	3.214s
package main

import (
	"crypto/md5"
	"fmt"
	"io"
	"math/rand"
	"strconv"
	"strings"
	"testing"
	"time"

	"github.com/cespare/xxhash"
)

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

var maxEmailAddress string = randSeq(254)
var maxSubject string = randSeq(78)

func randSeq(n int) string {
	b := make([]rune, n)
	for i := range b {
		b[i] = letters[rand.Intn(len(letters))]
	}
	return string(b)
}

func genMD5(s string) string {
	h := md5.New()
	io.WriteString(h, s)
	return string(h.Sum(nil))
}

func genXXhash(s string) string {
	h := xxhash.Sum64String(s)
	return strconv.FormatUint(h, 10)
}

func genMD5forEmail() {
	h := md5.New()
	ts := fmt.Sprintf("%d", time.Now().UnixNano())
	_, _ = io.Copy(h, strings.NewReader(maxEmailAddress))
	_, _ = io.Copy(h, strings.NewReader(maxSubject))
	_, _ = io.Copy(h, strings.NewReader(ts))
	// using the base hash, calculate a unique hash for each recipient
	for i := 0; i < 1000; i++ {
		h2 := h
		_, _ = io.Copy(h2, strings.NewReader(maxEmailAddress))
		h2.Sum([]byte{})
	}
}

func genXXforEmail() {
	h := xxhash.New()
	ts := fmt.Sprintf("%d", time.Now().UnixNano())
	h.Write([]byte(maxEmailAddress))
	h.Write([]byte(maxSubject))
	h.Write([]byte(ts))
	// using the base hash, calculate a unique hash for each recipient
	for i := 0; i < 1000; i++ {
		h2 := h
		h2.Write([]byte(maxEmailAddress))
		h2.Sum64()
	}
}

func BenchmarkMD5hash(b *testing.B) {
	for n := 0; n < b.N; n++ {
		genMD5forEmail()
	}
}

func BenchmarkXXhash(b *testing.B) {
	for n := 0; n < b.N; n++ {
		genXXforEmail()
	}
}


@flashmob
Copy link
Owner

flashmob commented Aug 15, 2019 via email

@h1z1
Copy link

h1z1 commented Feb 20, 2020

You can use xor too and go REALLY fast.. doesn't mean you should though. At least maybe warn users?

@flashmob
Copy link
Owner

flashmob commented Feb 20, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants