set deadline before stats write/read #918

replay · 2018-05-18T18:57:22Z

I created this patch, but so far i was not able to reproduce the problem in my test env that this is supposed to fix. I think the reason why reproducing it is hard is because when i run a service like f.e. nc -l -p 2003 and make MT connect to that, at the moment when i kill the service a FIN gets sent to the client, which notifies the client of the connection closing and MT immediately starts trying to reconnect. In prod this might be different when a carbon-relay-ng pod gets killed, which lets MT wait for a response.

replay · 2018-05-18T18:58:09Z

I think i might be able to reproduce the problem by sending SIGSTOP to nc while MT is connected to it, because that should make nc do nothing at all, which might be more similar to what's happening when a crng pod gets killed

Dieterbe · 2018-05-18T19:39:17Z

Or try iptables maybe

replay · 2018-05-18T21:28:32Z

I ended up reproducing the problem by simply using a second laptop that ran nc -l -p 2003, then i dropped everything on that port with iptables.

with master branch it took over 15min until it tried to reconnect

2018/05/18 21:24:49 [W] stats failed to write to graphite: write tcp 172.18.0.11:47740->192.168.0.16:2003: write: connection timed out (took 15m50.293460176s). will retry...

with this branch, after 10 secs, it detects that there is a problem writing

2018/05/18 21:03:31 [W] stats failed to write to graphite: write tcp 172.18.0.11:46756->192.168.0.16:2003: i/o timeout (took 10.000069578s). will retry...

if you agree to how everything is done in this branch i'll add documentation for that config parameter

Dieterbe · 2018-05-19T09:38:39Z

stats/out_graphite.go

@@ -105,6 +107,7 @@ func (g *Graphite) writer() {
 		var ok bool
 		for !ok {
 			conn = assureConn()
+			conn.SetDeadline(time.Now().Add(g.timeout))


can we remove the TODO for this function? looks like we can

Dieterbe · 2018-05-19T09:39:50Z

stats/config/init.go


 func ConfigSetup() {
 	inStats := flag.NewFlagSet("stats", flag.ExitOnError)
 	inStats.BoolVar(&enabled, "enabled", true, "enable sending graphite messages for instrumentation")
 	inStats.StringVar(&prefix, "prefix", "metrictank.stats.default.$instance", "stats prefix (will add trailing dot automatically if needed)")
 	inStats.StringVar(&addr, "addr", "localhost:2003", "graphite address")
 	inStats.IntVar(&interval, "interval", 1, "interval at which to send statistics")
+	inStats.DurationVar(&timeout, "timeout", time.Second*10, "timeout after which a read/write is considered not successful")


it's a graphite carbon connection, doesn't look like we read from it.

Dieterbe · 2018-05-19T09:42:11Z

how about the case of remote closing the connection, do we handle that cleanly (without data loss) ?
carbon-relay-ng has some code to handle that case https://github.com/graphite-ng/carbon-relay-ng/blob/master/destination/conn.go#L107

woodsaj · 2018-05-21T05:08:39Z

I tested and confirmed that we do not currently handle closed connections cleanly. As @Dieterbe pointed out we need to read from the connection until we get an EOF then close it, as is done in crng.

replay · 2018-05-21T17:59:03Z

I have tested the latest commit by making MT write its stats into a local nc process. Then I just killed nc and this was the MT output:

2018/05/21 17:55:48 [W] checkEOF conn.Read returned err != EOF, which is unexpected.  closing conn. error: read tcp 172.18.0.11:51586->192.168.0.11:2003: read: connection reset by peer

When I restarted nc MT reconnected almost instantly and continued writing. When I then typed stuff into nc I got it logged by MT, as expected:

2018/05/21 18:00:36 [I] checkEOF conn.Read data? did not expect that.  data: fjdl

When I then stopped nc cleanly with ctrl+c the EOF got received, as expected:

2018/05/21 18:00:36 [I] checkEOF conn.Read returned EOF -> conn is closed. closing conn explicitly

Dieterbe · 2018-05-21T18:22:40Z

stats/out_graphite.go

+	for {
+		num, err := conn.Read(b)
+		if err == io.EOF {
+			log.Info("checkEOF conn.Read returned EOF -> conn is closed. closing conn explicitly")


I know I wrote this, but looking at it now, it's too cryptic / implementation-detailed. we can just say "remote closed conn. closing conn" or something

Dieterbe · 2018-05-21T18:24:10Z

stats/out_graphite.go

+		}
+
+		if err != io.EOF {
+			log.Warn("checkEOF conn.Read returned err != EOF, which is unexpected.  closing conn. error: %s\n", err)


this whole "checkEOF conn.Read returned err != EOF, which is unexpected" stuff isn't appropriate.
I know i wrote the original code, but looking at it now, there's nothing unexpected about getting a connection reset

so just print the error and close the conn

Dieterbe · 2018-05-21T18:25:53Z

stats/out_graphite.go

+
+		// just in case i misunderstand something or the remote behaves badly
+		if num != 0 {
+			log.Info("checkEOF conn.Read data? did not expect that.  data: %s\n", b[:num])


can be simplified. maybe log.Warn "read unexpected data from peer: %s"

replay · 2018-05-22T13:05:25Z

@Dieterbe i simplified the error msgs as you commented. I think it's better to still prefix them with something to identify the exact location where the log has been logged, otherwise it's going to be hard where certain errors came from if we only print the connection error without any prefix
3c01e69

Dieterbe · 2018-05-22T13:05:44Z

stats/out_graphite.go

+		}
+
+		if err != io.EOF {
+			log.Warn("Graphite.checkEOF: %s\n", err)


also log "closing conn" for clarity and symmetry with the other error case.

updated again 108f75e

woodsaj · 2018-05-22T19:03:35Z

stats/out_graphite.go

@@ -105,6 +108,7 @@ func (g *Graphite) writer() {
 		var ok bool
 		for !ok {
 			conn = assureConn()
+			conn.SetDeadline(time.Now().Add(g.timeout))


I think this needs to just be conn.SetWriteDeadline(). Calling SetDeadline() will cause the checkEOF() read to timeout

right, updated

Dieterbe · 2018-05-24T08:25:16Z

stats/out_graphite.go

@@ -118,3 +122,31 @@ func (g *Graphite) writer() {
 		}


I think there's a race condition here. note how the write routine can set conn to nil, but checkEOF requires it to be non-nil.
particularly, the conn.Close() will activate the Read in checkEOF which will get an error, and try to call Close() on a pointer that can be nil.

i think you're right... that means i'll need to put a lock around conn

that should do it: 113554f

looks good but why do we need the changes to how the conn variable is being set?

having conn = assureConn() is redundent.

The assureConn func is working with the exact same conn that we are going to write to as they are all in the same scope.

Dieterbe · 2018-05-29T17:53:09Z

fix #908

set deadline before stats write/read

74adc1a

replay requested a review from Dieterbe May 18, 2018 18:57

Dieterbe suggested changes May 19, 2018

View reviewed changes

fixes according to PR comments

1fa6aaa

Dieterbe reviewed May 21, 2018

View reviewed changes

simplify error msgs

3c01e69

Dieterbe reviewed May 22, 2018

View reviewed changes

error message

108f75e

woodsaj reviewed May 22, 2018

View reviewed changes

only set write deadline

1c43fa1

woodsaj approved these changes May 22, 2018

View reviewed changes

Dieterbe added 2 commits May 24, 2018 10:19

first print error, than consequence

5eb5ad1

cleanup

287aaea

Dieterbe reviewed May 24, 2018

View reviewed changes

make sure we dont call method on nil-pointer

113554f

Dieterbe approved these changes May 28, 2018

View reviewed changes

Dieterbe merged commit b8a7464 into master May 28, 2018

Dieterbe deleted the stats_conn_timeout branch September 18, 2018 09:08

Dieterbe added this to the 0.10.0 milestone Dec 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set deadline before stats write/read #918

set deadline before stats write/read #918

replay commented May 18, 2018

replay commented May 18, 2018

Dieterbe commented May 18, 2018

replay commented May 18, 2018

Dieterbe May 19, 2018

Dieterbe May 19, 2018

Dieterbe commented May 19, 2018

woodsaj commented May 21, 2018

replay commented May 21, 2018 •

edited

Loading

Dieterbe May 21, 2018

Dieterbe May 21, 2018

Dieterbe May 22, 2018

Dieterbe May 21, 2018

replay commented May 22, 2018

Dieterbe May 22, 2018

replay May 22, 2018 •

edited

Loading

woodsaj May 22, 2018

replay May 22, 2018

Dieterbe May 24, 2018

replay May 25, 2018

replay May 25, 2018

Dieterbe May 28, 2018

woodsaj May 28, 2018 •

edited

Loading

Dieterbe commented May 29, 2018

set deadline before stats write/read #918

set deadline before stats write/read #918

Conversation

replay commented May 18, 2018

replay commented May 18, 2018

Dieterbe commented May 18, 2018

replay commented May 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented May 19, 2018

woodsaj commented May 21, 2018

replay commented May 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

replay commented May 22, 2018

Choose a reason for hiding this comment

replay May 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woodsaj May 28, 2018 • edited Loading

Choose a reason for hiding this comment

Dieterbe commented May 29, 2018

replay commented May 21, 2018 •

edited

Loading

replay May 22, 2018 •

edited

Loading

woodsaj May 28, 2018 •

edited

Loading