NewStateMapConnection context support? #14

mhite · 2022-08-07T17:07:31Z

Would it potentially make sense to add context support to NewStateMapConnection so that we can receive a ctx.Done() or initiate a cancel signal? I'm pondering how to gracefully shutdown multiple StateMapConnection when one of them closes its send channel due to error. If they all shared a parent context, I was thinking a graceful shutdown could be initiated when the failure of one should initiate a shutdown of all.

The text was updated successfully, but these errors were encountered:

icedream · 2022-08-07T19:24:09Z

The intention of current handling is to simply "expect the unexpected" – both intentionally closing the network (in this case done via calling Close on the net.Conn object) or having a network failure lead to an error on read attempts which we can deal with. Since we have a state channel and an error channel, something listening to the error channel receives notice of this error and can then decide whether it wants to close other network connections, too. After that we also close the error/state channels, albeit the error channel is only closed if an actual error occurred which admittedly may be a design flaw. So what you want to do is already possible without context.Context, however it's not the library's decision on how you want to handle the error, it's yours.

Now further thoughts about synchronized closure of network connections and/or using contexts:

If you want to transfer the responsibility to this library, just using a context object is not enough as it does not provide a way to pass cancellation to other objects alone; you still need StateMapConnection to report back to some code that an error occurred and a cancel signal should be sent. This already means we can not remove the error channel – or, though it feels more like wrong use of the context library – we would have to replace it with a field to which we can pass the cancel method the context library returns, which enforces a single way of error handling that may not be desired by whoever uses the library. For example, a developer may decide that it is a better idea to reestablish the connection instead of shutting down all connections.

Aside from that, implementing support for context.Context requires separate handling of wanted cancellation of read attempts. We can not simply rely on the error object returned by Read calls, instead we also need to check the context's Done channel. Then decisions: Should net.Conn stay open after cancellation or should we consider ownership of the connection and close it ourselves? Also, once a component uses context.Context, we probably want this consistently across the API to avoid confusion.

I'm currently unfortunately a bit limited on time that I can spend on this myself to further consider how this would need to be integrated, especially since this project has assumed a sort of Proof of Concept/reference state for other projects to implement the protocol properly. I'm currently leaning, as a result of above thoughts, towards not implementing this, however if there is a stronger take on this with good reasoning I may reconsider my opinion.

mhite · 2022-08-07T20:42:05Z

As usual, thanks for your thoughtful analysis and assessment. It is truly appreciated.

My general approach for establishing multiple StateMap connections to multiple devices is to launch a goroutine for each connection and have each one monitor their connection-specific error and state channel. Every launched goroutine is also passed a common state and error channel to publish (in a wrapped struct) what it receives from its corresponding device-specific error and state channel.

So something like this:

...
type stateUpdate struct {
	index int
	state *stagelinq.State
}

type stateError struct {
	index int
	err   error
}

....

This part of the code makes the assumption that the smCh is always closed when there is an error. (You actually link to this in your comment.) This is probably code smell on my part. Maybe I should change it to a for select case approach.

	stateCh := make(chan stateUpdate, len(establishedStateMapDevices))
	stateErrCh := make(chan stateError, len(establishedStateMapDevices))

	for i, smd := range establishedStateMapDevices {
		go func(ch chan<- stateUpdate, errCh chan<- stateError, smCh <-chan *stagelinq.State, smErrCh <-chan error, i int) {
			for state := range smCh {
				ch <- stateUpdate{
					index: i,
					state: state,
				}
			}
			for err := range smErrCh {
				errCh <- stateError{
					index: i,
					err:   err,
				}
			}
		}(stateCh, stateErrCh, smd.stateMapConn.StateC(), smd.stateMapConn.ErrorC(), i)
	}

Then I watch the top-level channels for messages:

channelRead:
	for {
		select {
		case state := <-stateCh:
                     // do stuff
		case stateErr := <-stateErrCh:
			logrus.WithFields(logrus.Fields{"index": stateErr.index, "err": stateErr.err}).Error("State map error")
			break channelRead
		}
	}

	// we are done, close up shop

	for _, smd := range establishedStateMapDevices {
		err := smd.stateMapTCPConn.Close()
		if err != nil {
			logrus.Error(err)
		}
	}

If something comes through to the common error channel, we stop monitoring both channels and then close all the TCP connections.

I think I am technically leaking goroutines for the still healthy StateMap connection(s). I believe they would all eventually become blocked on writes to the common state channel that no longer has a receiver, and thus would not be able to notice the close on their device-specific state channel that is [presumably?] generated during the "close up shop" attempt of the TCP connections.

I hope this made sense and provides some background on why I was pondering how to gracefully close things down in a multi-connection scenario.

This approach may just be bad -- I'm not really sure yet. Please let me know if this screams code smell to you.

At this point I am really just experimenting with potential multi-statemap connections and how to properly handle failure.

mhite · 2022-08-08T01:21:15Z

I've pondered this a bit more and have another approach I'm experimenting with that involves waitgroups. The idea is that when at least one statemap connection goroutine exits, a watchdog goroutine shuts down all the statemap tcp connections. this should force closure of all the channels the other statemap goroutines are receiving from and they should all exit. once every goroutine has exited, a second waitgroup in the watchdog will unblock and then cleanly shutdown the top-level channels the statemap connection goroutines were sending on as they received state map updates from their own connection channels. this then allows program flow control to continue and program shutdown to commence. at least that's the theory!

	stateCh := make(chan stateUpdate, len(establishedStateMapDevices))
	stateErrCh := make(chan stateError, len(establishedStateMapDevices))

	var wg1, wg2 sync.WaitGroup
	wg1.Add(1)
	var once sync.Once

	for i, smd := range establishedStateMapDevices {
		wg2.Add(1)
		go func(ch chan<- stateUpdate, errCh chan<- stateError, smCh <-chan *stagelinq.State, smErrCh <-chan error, i int) {
			defer once.Do(func() {
				wg1.Done()
			})
			defer wg2.Done()
			for state := range smCh {
				ch <- stateUpdate{
					index: i,
					state: state,
				}
			}
			for err := range smErrCh {
				errCh <- stateError{
					index: i,
					err:   err,
				}
			}
		}(stateCh, stateErrCh, smd.stateMapConn.StateC(), smd.stateMapConn.ErrorC(), i)
	}

then we fire off a watchdog goroutine:

	go func() {
		wg1.Wait()
		// we are done, close up shop
		logrus.Debug("statemap fault detected, closing all statemap connections")
		for _, smd := range establishedStateMapDevices {
			err := smd.stateMapTCPConn.Close()
			if err != nil {
				logrus.Error(err)
			}
		}
		wg2.Wait()
		logrus.Debug("All statemap connections gracefully shutdown")
		close(stateCh)
		close(stateErrCh)
		logrus.Debug("All receiver state channels closed")
		logrus.Debug("watchdog goroutine exiting")
	}()

and then we range over our top-level state channel the goroutines are sending to:

...
	for state := range stateCh {
		// do stuff
	}

	for stateErr := range stateErrCh {
		// do stuff
	}
...
	// exit program
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NewStateMapConnection context support? #14

NewStateMapConnection context support? #14

mhite commented Aug 7, 2022 •

edited

Loading

icedream commented Aug 7, 2022 •

edited

Loading

mhite commented Aug 7, 2022 •

edited

Loading

mhite commented Aug 8, 2022 •

edited

Loading

NewStateMapConnection context support? #14

NewStateMapConnection context support? #14

Comments

mhite commented Aug 7, 2022 • edited Loading

icedream commented Aug 7, 2022 • edited Loading

mhite commented Aug 7, 2022 • edited Loading

mhite commented Aug 8, 2022 • edited Loading

mhite commented Aug 7, 2022 •

edited

Loading

icedream commented Aug 7, 2022 •

edited

Loading

mhite commented Aug 7, 2022 •

edited

Loading

mhite commented Aug 8, 2022 •

edited

Loading