-
Notifications
You must be signed in to change notification settings - Fork 107
WIP - Basic clustering support - issue #56 #74
Conversation
does it make sense for me to start reviewing or should i hold off until you're more happy with the code? |
@Dieterbe hold of for a bit. |
Conflicts: metric_tank/aggmetric.go
@Dieterbe this is ready for review now. |
This handler is designed to be used by LB healthchecks. The handler will return a 503 unavailable response if the node is not the primary node and has not been running for at least warmUpPeriod seconds. This will ensure that metric Get requests will only be sent to the node when it has sufficient data to serve.
} | ||
} | ||
|
||
func (a *AggMetric) getChunkByT0(ts uint32) *Chunk { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function assumes lock is held? put that in comment. also ByT0 suffix can be dropped. rename arg1 to t0.
overall looks pretty good. similar to what i had in mind. how far can we go in assuring that a secondary node is able to get all messages from a primary, so that it doesn't end up saving (overwriting?) partial chunks that were already saved properly but for some reason (nsqd disconnect/timeout) it lost the message? i made a comment in the line where the nsqd consumer is created about using a non-ephemeral disk-backed channel. I think we should aim to guarantee delivery except when a nsqd machine just fatally dies or something, or is not brought back up before we want to promote to primary |
@@ -28,13 +28,15 @@ var ( | |||
showVersion = flag.Bool("version", false, "print version string") | |||
dryRun = flag.Bool("dry", false, "dry run (disable actually storing into cassandra") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that we have better terminology, i think it would be better to use a var named primaryMode bool default true, or something.
- rename SaveChunkByT0 to SyncChunkSaveState - fix incorrect mod instead of division - use a non-ephemeral channel for the metric-persist messages. As the metrics channel is non-ephemeral this ensures that metric-persist messages get bufferd just like metrics between restarts and outages. - convert cluster API to just use basic form data for setting the primary state. state can now be changed with `curl -X POST http://localhost:6063/cluster?primary=true` - rename "dryRun" cli argument to "primaryNode" - ensure cassandra is initialized before we start consuming metrics - update comments to ensure clarity
Thanks for all the feedback Dieter.
It is generally ok if a secondary node saves a chunk to Cassandra that has already been previously saved. Cassandra will just tombstone the original write and store the new write, ie a normal Cassandra Update. The only time this is problematic is if the second save is only of a partial chunk. The only time the metricPersist message would be lost is if the primary node crashes or is stopped between writing data to Cassandra and publishing the metricPersist message. So delivery of the messages is reliable enough. What is more important is to ensure that the secondary nodes have processed all metricPersist message before they are promoted to Primary. As this is still a manual process, that can easily be controlled. I think there are still a few issues we need to resolve here, related to this problem.
This is an issue when saving chunks (we really only want to save partial chunks when the node has been primary since startup), and also when handling Get methods (if the chunk is partial, then we should fetch the chunk from cassandra instead) |
or if nsqd (machine) crashes irrecoverably but i'm not too worried about that.
re your comment on stopping the program, when we stop the program, we just ensure it has sufficient time to send the persistmessages after persisting the chunks.
yeah i figure we could just look at nsqadmin and verify that the new instance has 0 backlog on the persist topic of all of the nsqd's.
agreed on only do this when it was primary at startup.
yes, except when we're primary mode I think. MT starting up in primary mode is a pretty good indication that what we have in RAM is the best we have and that we'll have a gap in data in cassandra (i.e. starting in primary mode is a worst case). we could just track for each chunk what's the first ts "contained within": if we created the chunk and there was a chunk before this, then we know the chunk has the data from its t0 and onwards (irrespective if there's a point with that ts in it), but if it was the first chunk created, then it only has data from the ts of the first point. this might still be enough to satisfy certain read requests that only need a section of the chunk. But this should be enough to add that nuance to the |
Conflicts: metric_tank/aggmetric.go
The first chunk recorded is likely a partial chunk. When not the cluster primary dont serve the first chunk out of memory. Instead the chunk will be fetched from Cassandra.
When there is no primary node (due to crash or shutdown) chunks will not be getting saved. When a secondary node gets promoted to primary there may be older chunks that never got saved and this commit ensures that we check for this condition. If there are older chunks that need saving, we add them to the writeQueue. Older chunks are added to the writeQueue before newer chunks.
I think these latest changes should get us to a good enough position.
thoughts @Dieterbe |
looks good. can be merged. 2 points:
|
Good catch Dieter, I had already corrected this, but had not pushed the commit yet. |
looks good to me. you have some logic that would benefit a lot from unit tests though. |
WIP - Basic clustering support - issue #56
this is an old optimization (?) that has been with us since a long time ago: #74 2029113 here's how it caused data loss at read time: - when only 1 chunk of data had been filled: the "update" of the field is a no-op because len(chunks) == 1, so oldPos goes back to 0 (not sure if intentional or a bug) so reading the first chunk worked. - once you have more than 1 chunk: update of oldPos works. we start hitting cassandra. depending on how long the chunk takes to get saved to cassandra, we will miss data at read time. also, our chunk cache does not cache absence of data, hitting cassandra harder during this period. - once the chunk is saved to cassandra the problem disappears - once the circular buffer recycles the first time (effectively removing the first chunk) this optimization no longer applies, but at that point we still hit cassandra just as before. This problem is now solved. However, removing that code enables another avenue for data loss at read time: - when a read node starts (without data backfill) - or a read node starts with data backfill, but the backfill doesn't have old data for the particular metric, IOW when the data only covers 1 chunk's worth - a read node starts with data backfill, but since backfilling starts at arbitrary positions, the first chunk will miss some data in the beginning. In all these cases, the first chunk is a partial chunk, whereas a full version of the chunk is most likely already in cassandra. To make sure this is not a problem, if the first chunk we used was partial, we set oldest to the first timestamp, so that the rest can be retrieved from cassandra. Typically, this will cause the "same" chunk (but a full version) to be retrieved from cassandra, which is then cached and seamlessly merged via Fix()
this is an old optimization (?) that has been with us since a long time ago: #74 2029113 here's how it caused data loss at read time: - when only 1 chunk of data had been filled: the "update" of the field is a no-op because len(chunks) == 1, so oldPos goes back to 0 (not sure if intentional or a bug) so reading the first chunk worked. - once you have more than 1 chunk: update of oldPos works. we start hitting cassandra. depending on how long the chunk takes to get saved to cassandra, we will miss data at read time. also, our chunk cache does not cache absence of data, hitting cassandra harder during this period. - once the chunk is saved to cassandra the problem disappears - once the circular buffer recycles the first time (effectively removing the first chunk) this optimization no longer applies, but at that point we still hit cassandra just as before. This problem is now solved. However, removing that code enables another avenue for data loss at read time: - when a read node starts (without data backfill) - or a read node starts with data backfill, but the backfill doesn't have old data for the particular metric, IOW when the data only covers 1 chunk's worth - a read node starts with data backfill, but since backfilling starts at arbitrary positions, the first chunk will miss some data in the beginning. In all these cases, the first chunk is a partial chunk, whereas a full version of the chunk is most likely already in cassandra. To make sure this is not a problem, if the first chunk we used was partial, we set oldest to the first timestamp, so that the rest can be retrieved from cassandra. Typically, this will cause the "same" chunk (but a full version) to be retrieved from cassandra, which is then cached and seamlessly merged via Fix() fix #78 fix #988
This is not ready for merging yet.