-
Notifications
You must be signed in to change notification settings - Fork 107
Conversation
I think i might be able to reproduce the problem by sending |
Or try iptables maybe |
I ended up reproducing the problem by simply using a second laptop that ran with master branch it took over 15min until it tried to reconnect
with this branch, after 10 secs, it detects that there is a problem writing
if you agree to how everything is done in this branch i'll add documentation for that config parameter |
stats/out_graphite.go
Outdated
@@ -105,6 +107,7 @@ func (g *Graphite) writer() { | |||
var ok bool | |||
for !ok { | |||
conn = assureConn() | |||
conn.SetDeadline(time.Now().Add(g.timeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove the TODO for this function? looks like we can
stats/config/init.go
Outdated
|
||
func ConfigSetup() { | ||
inStats := flag.NewFlagSet("stats", flag.ExitOnError) | ||
inStats.BoolVar(&enabled, "enabled", true, "enable sending graphite messages for instrumentation") | ||
inStats.StringVar(&prefix, "prefix", "metrictank.stats.default.$instance", "stats prefix (will add trailing dot automatically if needed)") | ||
inStats.StringVar(&addr, "addr", "localhost:2003", "graphite address") | ||
inStats.IntVar(&interval, "interval", 1, "interval at which to send statistics") | ||
inStats.DurationVar(&timeout, "timeout", time.Second*10, "timeout after which a read/write is considered not successful") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a graphite carbon connection, doesn't look like we read from it.
how about the case of remote closing the connection, do we handle that cleanly (without data loss) ? |
I tested and confirmed that we do not currently handle closed connections cleanly. As @Dieterbe pointed out we need to read from the connection until we get an EOF then close it, as is done in crng. |
I have tested the latest commit by making MT write its stats into a local
When I restarted
When I then stopped
|
stats/out_graphite.go
Outdated
for { | ||
num, err := conn.Read(b) | ||
if err == io.EOF { | ||
log.Info("checkEOF conn.Read returned EOF -> conn is closed. closing conn explicitly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I wrote this, but looking at it now, it's too cryptic / implementation-detailed. we can just say "remote closed conn. closing conn" or something
stats/out_graphite.go
Outdated
} | ||
|
||
if err != io.EOF { | ||
log.Warn("checkEOF conn.Read returned err != EOF, which is unexpected. closing conn. error: %s\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole "checkEOF conn.Read returned err != EOF, which is unexpected" stuff isn't appropriate.
I know i wrote the original code, but looking at it now, there's nothing unexpected about getting a connection reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so just print the error and close the conn
stats/out_graphite.go
Outdated
|
||
// just in case i misunderstand something or the remote behaves badly | ||
if num != 0 { | ||
log.Info("checkEOF conn.Read data? did not expect that. data: %s\n", b[:num]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be simplified. maybe log.Warn "read unexpected data from peer: %s"
@Dieterbe i simplified the error msgs as you commented. I think it's better to still prefix them with something to identify the exact location where the log has been logged, otherwise it's going to be hard where certain errors came from if we only print the connection error without any prefix |
stats/out_graphite.go
Outdated
} | ||
|
||
if err != io.EOF { | ||
log.Warn("Graphite.checkEOF: %s\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also log "closing conn" for clarity and symmetry with the other error case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated again 108f75e
stats/out_graphite.go
Outdated
@@ -105,6 +108,7 @@ func (g *Graphite) writer() { | |||
var ok bool | |||
for !ok { | |||
conn = assureConn() | |||
conn.SetDeadline(time.Now().Add(g.timeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to just be conn.SetWriteDeadline()
. Calling SetDeadline() will cause the checkEOF() read to timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, updated
@@ -118,3 +122,31 @@ func (g *Graphite) writer() { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a race condition here. note how the write routine can set conn to nil, but checkEOF requires it to be non-nil.
particularly, the conn.Close() will activate the Read in checkEOF which will get an error, and try to call Close() on a pointer that can be nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you're right... that means i'll need to put a lock around conn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should do it: 113554f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good but why do we need the changes to how the conn variable is being set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having conn = assureConn()
is redundent.
The assureConn func is working with the exact same conn
that we are going to write to as they are all in the same scope.
fix #908 |
I created this patch, but so far i was not able to reproduce the problem in my test env that this is supposed to fix. I think the reason why reproducing it is hard is because when i run a service like f.e.
nc -l -p 2003
and make MT connect to that, at the moment when i kill the service aFIN
gets sent to the client, which notifies the client of the connection closing and MT immediately starts trying to reconnect. In prod this might be different when a carbon-relay-ng pod gets killed, which lets MT wait for a response.