Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tech Debt Cleanup and Docs Update #219

Merged
merged 8 commits into from
Sep 24, 2021

Conversation

hannahhoward
Copy link
Collaborator

Goals

Do a big cleanup pass on go-graphsync to make it easier to reason about and work with. Belatedly update docs for IPLD LinkSystem.

Implementation

THIS PR LOOKS LARGE BUT IT'S NOT THAT BIG. These are the main changes:

  • I moved converted some functions that were simply computations on request codes to be methods on a request code. This makes them more portable.
  • I have reorganized the RequestManager and ResponseManager so they aren't single giant files, and I've divided them up so it's clear what happens inside the internal thread and what happens outside. This makes it easier to reason about where you can change state and where you can't. I've also documented how they use the Actor Pattern in the architecture docs to manager concurrency. This is where a lot of the line changes are, but it's almost entirely moving code around.
  • The only substantive change is the AsyncLoader. Simply: I removed the internal go routine. All this thing does is manage some state, and interface between various classes. I think it makes way more sense to just wrap the state in a mutex and go forth. I Communicate by sharing data may not be the Go way but it's about 200 lines shorter, and way simpler to reason about. I did a decent amount of testing to make sure nothing broke.
  • I updated the README and archtecture docs for ipld.Linksystem

I can break this into multiple PRs if need be, but you can also review the individual commits if you like.

Clarify actor pattern and responsibilities across threads
refactor requestmanager to clearly designate responsibilities in actor pattern
remove the actor pattern from asyncloader, as its not needed. it works well as simply a locked data
structure and is about 10x simpler
update references in architecture doc to reflect IPLD linksystem
Comment on lines +154 to +155
al.stateLk.Lock()
defer al.stateLk.Unlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR adds this single mutex lock around the loader's state. Do we have reasonable confidence that we won't end up with lock contention over it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code sent a message to a queue channel, and a single goroutine owned the state and processed those messages sequentially. So the amount of contention seems basically the same before and after.

One significant difference is the priority and ordering; with a channel, before you would be guaranteed first-come-first-serve. With a mutex, it's always a semi-random race to decide what goroutine gets to grab the mutex, so if incoming messages keep coming, you could have high latencies with older messages getting unlucky and not grabbing the mutex. It's not clear to me whether order matters here.

One last thought: sharing memory does give us more ability to be performant, if we want to. For example, using a RWMutex, or multiple mutexes, or sync/atomic. So in terms of contention, I think it can be better than a single goroutine with a message queue, if we care enough to optimize it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I feel quite confident about this being a better, more reliable strategy.

// to them.
type RequestManager struct {
ctx context.Context
cancel func()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.CancelFunc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently it's just func() everywhere in the code and I should change it all at once. will do in seperate cleanup PR

for cancelMessageChannel != nil || incomingResponses != nil || incomingErrors != nil {
select {
case cancelMessageChannel <- &cancelRequestMessage{requestID, false, nil, nil}:
cancelMessageChannel = nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this close the channel as well? a comment might be useful since this pattern looks weird otherwise

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if you look a couple lines above, what we're doing here is using cancelMessageChannel = nil to send exactly once -- but the actual channel is the main message channel for the request manager.

I agree, this is overcomplicated. I think it's probably ok to do the channel send once before the loop, but for this PR I'd rather not change that functionality.

return remainingResponses
}

func (rm *RequestManager) updateLastResponses(responses []gsmsg.GraphSyncResponse) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this method (amongst others) does seems not fully reflected in the name. some comments on expected behavior of these methods would be useful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add in cleanup PR

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two comments on one of the commits, since the whole diff was pretty large. I'm not sure how well GitHub will deal with that :)

@@ -1,6 +1,6 @@
module github.com/ipfs/go-graphsync

go 1.12
go 1.13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe time to get this repo onboard with unified ci :) it would take care of bumping this too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agreed.

Comment on lines +154 to +155
al.stateLk.Lock()
defer al.stateLk.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code sent a message to a queue channel, and a single goroutine owned the state and processed those messages sequentially. So the amount of contention seems basically the same before and after.

One significant difference is the priority and ordering; with a channel, before you would be guaranteed first-come-first-serve. With a mutex, it's always a semi-random race to decide what goroutine gets to grab the mutex, so if incoming messages keep coming, you could have high latencies with older messages getting unlucky and not grabbing the mutex. It's not clear to me whether order matters here.

One last thought: sharing memory does give us more ability to be performant, if we want to. For example, using a RWMutex, or multiple mutexes, or sync/atomic. So in terms of contention, I think it can be better than a single goroutine with a message queue, if we care enough to optimize it.


stateLk sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd explicitly document what fields this mutex protects, even if that's just "all fields below".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

// Shutdown ends processing for the want manager.
func (rm *RequestManager) Shutdown() {
rm.cancel()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused; if the user supplies a context when creating a RequestManager, could they not just cancel the context themselves to do a shutdown? Having RequestManager hold the cancel func and expose a Shutdown method feels unnecessary.

One reason we might have for exposing Shutdown methods is to also block until all goroutines have stopped and resources have been freed up. If we think we'll want that later on, then I think leaving these in with a TODO makes sense. Otherwise, I'd personally remove the Shutdown APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we went with the "we'll eventually want to block until all request goroutines are stopped" route, then I think these methods should gain a context parameter, too. That way, I can say "graceful shutdown for 10s to stop accepting requests, and after that, kill all requests which aren't done yet". This is how net/http does it: https://pkg.go.dev/net/http#Server.Shutdown

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is odd. This is the pattern of the repo. I'd rather revisit the shutdown behavior globally in a seperate ticket.

defer al.stateLk.Unlock()
_, ok := al.alternateQueues[name]
if !ok {
return errors.New("unknown persistence option")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include the name? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@@ -77,3 +77,47 @@ var ResponseCodeToName = map[ResponseStatusCode]string{
RequestFailedContentNotFound: "RequestFailedContentNotFound",
RequestCancelled: "RequestCancelled",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you're refactoring, you could automate this table and String method with https://pkg.go.dev/golang.org/x/tools/cmd/stringer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg where was this tool all my life! You should check over at go-fil-markets and go-data-transfer for some pretty awesome comedy here.

responsecode.go Outdated
case RequestCancelled:
return RequestCancelledErr{}
default:
return fmt.Errorf("Unknown")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return fmt.Errorf("unknown response status code: %d", c)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@hannahhoward
Copy link
Collaborator Author

@willscott @mvdan

I've responded to your comments. Some of them related to unchanged code that has just been moved around. I agree generally with all the comments, but as you can see I've suggested deferring them to minimize this changeset.

I've documented these cleanup issues here:
#223
#222
#221
#220

If I can get your approvals on this PR I'd like to go ahead and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants