sdhash is a tool that processes binary data and produces similarity digests using bloom filters. Two binary files with common parts produces two similar digests. sdhash is able to compare the similarity digests to produce a score. A score close to 0 means that two file are very different, a score equals to 100 means that two file are equal.
- calculate similarity digests of many files in a short time
- compare a large amount of digests using precalculated indexes
- the comparison can also be made during the digest process
- same results of original sdhash with similar performance, but entirely rewritten in go language
The sdhash package is available as binaries and as a library.
The binaries for all platforms are available on the Releases page.
- Install sdhash package with the command below
$ go get -u github.com/eciavatta/sdhash
- Import it in your code and start play around
package main
import (
"fmt"
"github.com/eciavatta/sdhash"
)
func main() {
factoryA, _ := sdhash.CreateSdbfFromFilename("a.bin")
sdbfA := factoryA.Compute()
factoryB, _ := sdhash.CreateSdbfFromFilename("b.bin")
sdbfB := factoryB.Compute()
fmt.Println(sdbfA.String())
fmt.Println(sdbfB.String())
fmt.Println(sdbfA.Compare(sdbfB))
}
The library documentation is published at pkg.go.dev/github.com/eciavatta/sdhash. How sdhash works is described in this paper, and here you can find a tutorial of the original version of sdhash.
sdhash is originally created by Vassil Roussev and Candice Quates and is licensed under Apache-2.0 License. The implementation in golang was made by Emiliano Ciavatta and is also licensed under Apache-2.0 License.