-
Notifications
You must be signed in to change notification settings - Fork 107
mt-parrot: continuous validation by sending dummy stats and querying them back #1680
Conversation
cmd/mt-parrot/monitor.go
Outdated
//total number of entries where drift occurred | ||
fmt.Printf("parrot.monitoring.nonMatching;partition=%d; %d\n", stats.partition, stats.numNonMatching) | ||
fmt.Println() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the proposed/sample metrics I'm thinking about emitting... do these make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we register the metrics up front?
I think this set is pretty good.
When I think about "what could problems in the data be?", I think it could be any combination of:
- incorrect values (val != ts): number could be off or be null.
- too many points / not enough points / unevenly spaced points / incorrect timestamps (all points should divide by the period without remainder and be 1 period apart) / points are not in sorted order
1 is well covered by the nancount and deltaSum, but 2 isn't really covered yet.
The delay (delta between last point sent and last non-nullpoint seen) is also interesting. "lag" seems to want to monitor something a bit differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we register the metrics up front?
So that would probably be a map of partition -> struct of partition specific metrics?
2 isn't really covered yet
I'll work something up on those
The delay (delta between last point sent and last non-nullpoint seen) is also interesting. "lag" seems to want to monitor something a bit differently.
Not sure what the latter portion of that means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latter portion:
parrot.monitoring.lag
measures something differently then what i would think it should measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that would probably be a map of partition -> struct of partition specific metrics?
or a slice, is just a bit more efficient to work with than a map (we can exploit the fact that partition id's are a seqence that starts at 0, works well as slice indices)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we register the metrics up front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is well covered by the nancount and deltaSum, but 2 isn't really covered yet.
2ca8897
to
51f6291
Compare
e25199a
to
4141888
Compare
CI is currently failing due to parrot metrics not being defined in I feel like that's potentially confusing to document them with the main metrictank metrics. |
The best solution for that would be if we can generate separate metrics.md files based on the binary. |
This came up in a 1-1 with @woodsaj , his advice was to not worry about documenting parrot metrics for now, so I've disabled metrics2docs for the parrot command by removing the |
Yep, fixing metrics2docs is well outside of scope for this PR, so lets not let if block us from getting this work merged. I have opened an issue for improving metrics2docs |
we wouldn't want to process old ticks with a lag as that would produce incorrect outcomes
introduce "post nan points" concept
@fitzoh I have pushed a bunch of commits that should address all my comments, as well as some feedback that was easier to just do rather than type it out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be good enough.
we can do further tweaks in subsequent PR's.
…for each metrictank partition.
Builds on #1665, has initial support for querying metrics back out of metrictank.
Doesn't actually emit useful stats on the results yet, but I wanted to get some feedback on the type of stats to emit.
monitor.go
is probably the most interesting file at the moment