-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor registry system: no direct dependencies; expose standard hash.Hash; be a data carrier. #136
Conversation
(The test that's currently failing is for murmur3. Input needed on what we're going to do with that thing.) |
96aa53c
to
fdbce65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM but we need to do something about those constants everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easy enough to fix the binary in the multihash
folder? Seems like we'd just need it to import all the registry entries here (including miniosha256).
registry.go
Outdated
// | ||
// Hashers which are available in the golang stdlib will be registered automatically. | ||
// Others can be added using the Register function. | ||
var registry = make(map[uint64]func() hash.Hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps a stupid question, but why are we preferring to use uint64
, instead of depending on the Code
type from go-multicodec which aliases uint64, throughout the registry code? Does it help us out with backwards compatibility/consistency things (e.g. the existing Names = map[string]uint64
)?
I was thinking that in the future if go-multicodec added support for typed metadata (e.g. names) then you could do things like code.Name()
and it would look it up for you behind the scenes which might be nice. Feel free to say that this is a terrible idea though 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bringing up this whole range of questions. (They are good. So far, I made these diffs while trying to evade them! But that does not represent a stance so much as it was just short-term diff-scope-minimization.)
This came up in enough places throughout this PR that I opened a whole new issue to talk through it: #137
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... okay, my time & energy budget for recursing up through that concern tree has expired.
- I'm going to keep
uint64
as a map key (rather than any typedefs) because that's what this repo did before; - I'm not going to add the https://github.com/multiformats/go-multicodec repo for symbols right now because it's not in this repo right now;
- I'm going to replace all the current use of bare constants with the named constants already in this package and leave them as they are.
- I'm going to make a concerted effort to do nothing to the
Names
andCodes
maps in the rest of this PR because the things we'd probably want to do with them would almost certainly involve bringing in the above, which... E_SCOPE_LIMIT.
Additional boat rocking about constants can come some other day and some other PR.
fdbce65
to
abbba7f
Compare
…h.Hash; be a data carrier. --- Dependencies: Previously, depending on the go-mulithash package would create direct dependencies to several other modules for their various hash function implementations. This meant that instead of go-multihash being a lightweight, easy-to-accept dependency itself, it became something which would noticably increase the size of your go.mod file, your package graph, your download sizes during development, and most concerningly, your compile output size in final products. Now, there is a registry system (see the Register function), and the main go-multihash package *only* populates the registry with hashes that are available from the golang standard library by default. This means you gain no transitive dependencies on other libraries by importing the go-multihash package, and your binaries will not be bloated by hashers you don't use. (Your go.mod file may still show more repos; but they don't end up in your builds unless you actually refer to them). There are now several new packages in `go-multihash/register/*`. These can be imported to register the hashes in those packages. If you want all the hashes that were previously available, just make sure to import "go-mulithash/register/all" somewhere in your program. (You can register hashes, too, without making PRs to this library; these packages are just here for convenience and easy use.) **This is a breaking change** if you used hashes not found in the golang stdlib, such as blake2 and sha3. However, to update, all you need to do is ensure the relevant `go-multihash/register/*` package is imported anywhere in your program -- an easy change. A `go-multihash/register/all` package can be imported to get a hasher registered for all of the same multihash codes as before (but will correspondingly add the dependency weight back too, of course). --- Standard hash.Hash: Previously, go-multihash had its own definition of a `HashFunc` interface, and only exposed hashing through the `multihash.Sum` method. The problem with this was these definitions did not support streaming: one had to have an entire chunk of memory loaded at once, in a single contiguous byte slice, in order to hash it. (A second, admittedly much more minor, problem with this was that one often had to write glue code to turn a `hash.Hash` into a `multihash.HashFunc`, and since most of golang uses the standard lib `hash.Hash` definition already, this was generally avoidable friction.) Now, the Register function operates in terms of standard `hash.Hash`, and there is a `GetHasher` function which can get you a `hash.Hash`. (Okay, to be more precise, these functions take and return a factory function for a `hash.Hash`. You get the idea.) Since the standard `hash.Hash` interface can operate streamingly, now it's easy to use go-multihash in a streaming way. The `multihash.Sum` method works the same as always. --- Be a data carrier: Previously, go-multihash contained checks that any multihash indicator codes being handled were required to have a hash function registered for them. This made it very difficult to use go-multihash in a "forward compatible" way (and it also made a lot of practical bumps for this dependency-extraction refactor). Now, go-multihash is willing to carry data, even if it doesn't know what kind of hash function would be associated with an indicator code. (Methods that you'd expect to parse things do still parse the varints, making sure they're sanely formatted. They just don't inspect and whitelist the actual integers anymore.) I removed the `ValidCode` predicate entirely. It doesn't seem to serve any good purpose anymore. --- Other: I have not touched the `Codes` and `Names` maps in this diff. I think we should probably review (and probably remove) these, and instead direct people to use the go-multicodec package instead, which has the two advantages of decoupling registration of an implementation versus simply having a description, and also being automatically generated from the multiformats table. However, I wanted to check on feelings about this before doing the work (especially because they're somewhat entangled with a bunch of the tests in this package, making their removal somewhat nontrivial). Most of the test files are now `package multihash_test`. This makes for some colorful diffs, but is not otherwise interesting. The reason for this is because the dependency separation process now requires the tests to import those `register/*` packages, and to avoid a cycle, that means, well, `package mulithash_test`. I think there's probably more work to be done in making this library really shine. For example, in reviewing the `Encode` function, I see some allocations that look very likely to be avoidable... if the function was redesigned to be more aware of how it's likely to be used. However, I took no action on this, in part because this diff is big enough already, and in part because I think it might be reasonable to re-examine the relationship of this code to go-cid at the same time. I dropped `TestSmallerLengthHashID`. It appeared to be testing an API that wasn't actually exported... and the nearest API that *is* exported (Sum) has a general contract of truncating a hash upon short length, so it was overall unclear what this test should be checking. Review might be needed on this. The situation for murmur3 is still in need of resolution. It's commented out entirely for now. Questions are noted in the diff. There's a 'register/miniosha256' package which sets the sha2-256 implementation to a non-stdlib one. If you don't import this package, you still get a sha2-256; it's just the stdlib one. I did not include this in the 'register/all' group. (Maybe it's faster; maybe it's not; but it's definitely not required, and I'm getting some reports it also shows weird on profiles, so I tend to think maybe one should really have to explicitly ask for this one.)
It seems likely that we should replace many of these with values from https://github.com/multiformats/go-multicodec in the future, however, we have a new issue to track that: #137 And as noted in #136 (comment) it's time to reign in the amount of work going on in this PR.
Avoid allocations with more reuse the same slice for results.
Agreed upon in discussion at: #136 (comment) The "REVIEW" comments in the outgoing diff identify the concerns that led to this choice.
Per discussion in: #136 (comment)
abbba7f
to
10afd22
Compare
is this ready for a round of review? are there any loose things that still need to happen? |
Leaving the error return in, per #136 (comment) May also be interesting to note: I did check if ValidCode (or, roughly the same thing but perhaps renamed to KnownCode) could be reintroduced, and the answer is... actually no, not without broaching other issues. A body of `_, ok := registry[code]` is enough to open a can of worms. See #136 (comment)
This creates a hasher long enough to ask it its properties. This is arguably creating garbage; on the other hand, I don't think this is a codepath ever likely to be used in a hot loop anywhere. We can extract this to something that caches the properties later if it proves necessary. It quietly defaults to zero for unknown codes. I don't know if this makes sense, but it's what the old code would have done, so it's what the new code will do, and I'm not looking deeper into it. At this point I'm just trying to make surgically minimal alterations and get this changeset as a whole wrapped up so things can move on.
Surely no one would try to do this, nor then be surprised if it creates problems. On the other hand: if someone *does* do this, and the error doesn't appear until arbitrarily far away when the map is read, it's a pain to diagnose... and checking it up front is cheap. So, here we go: check it up front.
The number that's missing is now reported in the error. This is done using the golang error wrap feature. We're keeping a constant value for ErrSumNotSupported for compatibility and change avoidance reasons. Code that cares about this exact error will still have to change; it is now necessary to use `errors.Is` to detect it. The text in the ErrSumNotSupported value is also updated, because the text no longer made sense given an open registry system. As a target-of-opportunity fix, it also now follows golang normative conventions for error messages (no capitalization, no punctuation).
…ultihash that rejects truncations. See #136 (comment) for discussion. This change means Sum behaves slightly differently for identity multihashes than it does for any other multihash. I'm not keeping score on the number of ways identity multihash is weird anymore, just documenting it and keeping tests passing. The error message is lifted from the old `sumID` function verbatim.
Okay! This is finally green. I believe all comments are addressed. Also note that I updated a bunch of the transitive dependencies. It was forced (see 877240b / https://travis-ci.com/github/multiformats/go-multihash/builds/219591152 ). But none of our fixtures reported effects, so I presume this to be safe and joyously uninteresting. Last call for any more reviews? |
I've attempted to push this downstream through go-cid, and here's what that might look like: ipfs/go-cid@26b387a#diff-88c2c4b0892fb0172322544e3937e937bb18beccecb1b24a1b5028758a43685bR346-R375 There's some questions worth a quick review there. Namely: it's entirely possible to make all the code work without a need for Everything else looks fine -- although we do again end up seeing the |
We do need DefaultLengths, we use it in quite a few places. You can find these places by:
Let's make this PR minimally breaking so we can merge it ASAP and move on. |
…ultihash that rejects truncations. See #136 (comment) for discussion. This change means Sum behaves slightly differently for identity multihashes than it does for any other multihash. I'm not keeping score on the number of ways identity multihash is weird anymore, just documenting it and keeping tests passing. The error message is lifted from the old `sumID` function verbatim.
We'll need something a little more recent than 1.11.x to be able to use the fmt "%w" and errors.Is features.
I swear I did not mean to be doing this on a branch, but: https://travis-ci.com/github/multiformats/go-multihash/builds/219591152 - it became necessary to update the go version in CI for %w features - that apparently adds pointer alignment checking to the compiler... - which flunks some stuff in x/crypto... So, okay, we're updating libraries now!
While it was possible to remove all use of this from this repo, when attempting propagate changes to downstreams consuming it, it became apparently that other repos also rely on this symbol. Whether or not those usages are important and intentional, whether they're actually worth maintaining, and whether they'd be replacable with other approaches... is not considered at this time. (Probably we should be asking this! The first occasions where this cropped up are in other functions that have been marked "deprecated" since... 2018! But... chasing those things down and straightening them out is becoming problematic. Perhaps we'll be more ready to revisit these things at a later date.)
877240b
to
1a96911
Compare
Here's what propagating through go-cid looks like now, with all tests passing: ipfs/go-cid#118 |
I guess the final check here is: do we want to keep #131 did originally describe doing the registration rework in a new package, and keeping the go-multihash root unchanged. As it stands now, the go-multihash symbols exported are pretty much unchanged, but the code does go all-in on making the dependencies optional, so the behavior does shift. If we're scared of this and the consequent need to add the registration imports in downstream applications, and we'd rather make the dependency minimization an opt-in action that has to be done separately from just updating the library by putting it in a deeper package, we can do that. ... heck, I'll just draft that now. If we introduce a new package, who wants to name it? |
Yeah, let's minimize breakage This library is a bit popular: https://pkg.go.dev/github.com/multiformats/go-multihash?tab=importedby. A "core" package would work (I'd call it that). I guess the question is: what do you need. This appears to be blocking other work, so I want to make sure we merge that need ASAP. If you don't really need to reduce dependencies (I'm pretty sure you don't, at least right now) then I'd:
Basically, we need unblock this ASAP and we need to avoid causing a meltdown. Let's just go with the easiest approach and introduce a 2.0 (we'll get there eventually regardless). |
…s again. The transitive dependencies in the root package are still managed by this whole registry system (which now resides in the 'core' package), so we have parity and *mostly* just one suite of code to maintain. You can now use this 'core' package when interested in dependency minimization. In exchange, the "register/*" imports may be required. You can just also just yank updates to go-multihash, and if you were already using it (and happen to be using one of these now-optional(-but-only-if-you-use-core))... nothing should actually change; the "register/*" imports won't be required because the root package does them for you. Some constants are replicated. This was the minimum step necessary to avoid import cycles. I'm not spending time prettifying it because we really probably ought to be refactoring this to use the package of constants in go-multicodec that's automatically generated, and yet, that is a scope limit we're trying not to cross during this changeset.
0db0169
to
cebc9f8
Compare
Alright. I think I've run out of changes to avoid. I'll also bow out on version tagging. Whatever numbers you like, fine by me. (I'm wary of major numbers and the Different Behavior they prompt from |
I've tested with go-ipfs and the only broken test is the one that tries to use murmur (to make sure we don't allow it). |
@warpfork could you cut a minor release when ready? |
This is now available within the |
This diff revisits the registry system to use more standard interfaces for hashers, and puts it to work to reduce the transitive dependencies of go-multihash.
Dependencies:
Previously, depending on the go-mulithash package would create direct
dependencies to several other modules for their various hash function
implementations. This meant that instead of go-multihash being a
lightweight, easy-to-accept dependency itself, it became something
which would noticably increase the size of your go.mod file,
your package graph, your download sizes during development, and most
concerningly, your compile output size in final products.
Now, there is a registry system (see the
Register
function), andthe main go-multihash package only populates the registry with hashes
that are available from the golang standard library by default.
This means you gain no transitive dependencies on other libraries
by importing the go-multihash package, and your binaries will not
be bloated by hashers you don't use. (Your go.mod file may still
show more repos; but they don't end up in your builds unless you
actually refer to them).
There are now several new packages in
go-multihash/register/*
.These can be imported to register the hashes in those packages.
If you want all the hashes that were previously available, just make
sure to import "go-mulithash/register/all" somewhere in your program.
(You can register hashes, too, without making PRs to this library;
these packages are just here for convenience and easy use.)
This is a breaking change if you used hashes not found in the
golang stdlib, such as blake2 and sha3. However, to update,
all you need to do is ensure the relevant
go-multihash/register/*
package is imported anywhere in your program -- an easy change.
A
go-multihash/register/all
package can be imported to get a hasherregistered for all of the same multihash codes as before (but will
correspondingly add the dependency weight back too, of course).
Standard hash.Hash:
Previously, go-multihash had its own definition of a
HashFunc
interface, and only exposed hashing through the
multihash.Sum
method.The problem with this was these definitions did not support streaming:
one had to have an entire chunk of memory loaded at once,
in a single contiguous byte slice, in order to hash it.
(A second, admittedly much more minor, problem with this was that one
often had to write glue code to turn a
hash.Hash
into amultihash.HashFunc
, and since most of golang uses the standard libhash.Hash
definition already, this was generally avoidable friction.)Now, the Register function operates in terms of standard
hash.Hash
,and there is a
GetHasher
function which can get you ahash.Hash
.(Okay, to be more precise, these functions take and return a factory
function for a
hash.Hash
. You get the idea.)Since the standard
hash.Hash
interface can operate streamingly,now it's easy to use go-multihash in a streaming way.
The
multihash.Sum
method works the same as always.Be a data carrier:
Previously, go-multihash contained checks that any multihash indicator
codes being handled were required to have a hash function registered
for them. This made it very difficult to use go-multihash in a
"forward compatible" way (and it also made a lot of practical bumps
for this dependency-extraction refactor).
Now, go-multihash is willing to carry data, even if it doesn't know
what kind of hash function would be associated with an indicator code.
(Methods that you'd expect to parse things do still parse the varints,
making sure they're sanely formatted. They just don't inspect and
whitelist the actual integers anymore.)
I removed the
ValidCode
predicate entirely. It doesn't seem to serveany good purpose anymore.
Other:
I have not touched the
Codes
andNames
maps in this diff.I think we should probably review (and probably remove) these, and
instead direct people to use the go-multicodec package instead,
which has the two advantages of decoupling registration of an
implementation versus simply having a description, and also being
automatically generated from the multiformats table.
However, I wanted to check on feelings about this before doing the work
(especially because they're somewhat entangled with a bunch of the
tests in this package, making their removal somewhat nontrivial).
Most of the test files are now
package multihash_test
.This makes for some colorful diffs, but is not otherwise interesting.
The reason for this is because the dependency separation process now
requires the tests to import those
register/*
packages, and toavoid a cycle, that means, well,
package mulithash_test
.I think there's probably more work to be done in making this library
really shine. For example, in reviewing the
Encode
function,I see some allocations that look very likely to be avoidable... if the
function was redesigned to be more aware of how it's likely to be used.
However, I took no action on this, in part because this diff is big
enough already, and in part because I think it might be reasonable to
re-examine the relationship of this code to go-cid at the same time.
I dropped
TestSmallerLengthHashID
. It appeared to be testing an APIthat wasn't actually exported... and the nearest API that is exported
(
Sum
) has a general contract of truncating a hash upon short length,so it was overall unclear what this test should be checking.
Review might be needed on this.
The situation for murmur3 is still in need of resolution.
It's commented out entirely for now. Questions are noted in the diff.
There's a 'register/miniosha256' package which sets the sha2-256
implementation to a non-stdlib one. If you don't import this package,
you still get a sha2-256; it's just the stdlib one.
I did not include this in the 'register/all' group.
(Maybe it's faster; maybe it's not; but it's definitely not required,
and I'm getting some reports it also shows weird on profiles, so I tend
to think maybe one should really have to explicitly ask for this one.)
There are a couple other review questions marked with "// REVIEW" in the diff.